条件を満たすサブシリーズ（データフレーム内の行）を削除します。

私は、時系列の各サブシリーズの特徴である時系列（列1）と値（列2）を持つデータフレームを持っています。条件を満たすサブシリーズを削除するにはどうすればよいですか？条件を満たすサブシリーズ（データフレーム内の行）を削除します。

写真は私がしたいことを示しています。

削除する行を示す機能を持つ追加の列を作成するためにループを作成しようとしましたが、このソリューションは非常に計算コストが高いです（列に10mlのレコードがあります）。コード（遅いソリューション）：重複をnumpy.uniqueでnumpy.concatenateによってインデックスの配列を作成するための削除のために

import numpy as np 
import pandas as pd 

# sample data (smaller than actual df) 
# length of df = 100; should be 10000000 in the actual data frame 
time_ser = 100*[25] 
max_num = 20 
distance = np.random.uniform(0,max_num,100) 
to_remove= 100*[np.nan] 

data_dict = {'time_ser':time_ser, 
      'distance':distance, 
      'to_remove': to_remove 
      } 

df = pd.DataFrame(data_dict) 

subser_size = 3 
maxdist = 18 


# loop which creates an additional column which indicates which indexes should be removed. 
# Takes first value in a subseries and checks if it meets the condition. 
# If it does, all values in subseries (i.e. rows) should be removed ('wrong'). 

for i,d in zip(range(len(df)), df.distance): 
    if d >= maxdist: 
     df.to_remove.iloc[i:i+subser_size] = 'wrong' 
    else: 
     df.to_remove.iloc[i] ='good'

出典

2017-06-12 Marta

あなたはリストの内包表記を使用することができます。カラム内の必要値場合

np.random.seed(123) 
time_ser = 100*[25] 
max_num = 20 
distance = np.random.uniform(0,max_num,100) 
to_remove= 100*[np.nan] 

data_dict = {'time_ser':time_ser, 
      'distance':distance, 
      'to_remove': to_remove 
      } 

df = pd.DataFrame(data_dict) 
print (df) 
    distance time_ser to_remove 
0 13.929384  25  NaN 
1 5.722787  25  NaN 
2 4.537029  25  NaN 
3 11.026295  25  NaN 
4 14.389379  25  NaN 
5 8.462129  25  NaN 
6 19.615284  25  NaN 
7 13.696595  25  NaN 
8 9.618638  25  NaN 
9 7.842350  25  NaN 
10 6.863560  25  NaN 
11 14.580994  25  NaN

subser_size = 3 
maxdist = 18 

print (df.index[df['distance'] >= maxdist]) 
Int64Index([6, 38, 47, 84, 91], dtype='int64') 

arr = [np.arange(i, min(i+subser_size,len(df))) for i in df.index[df['distance'] >= maxdist]] 
idx = np.unique(np.concatenate(arr)) 
print (idx) 
[ 6 7 8 38 39 40 47 48 49 84 85 86 91 92 93] 

df = df.drop(idx) 
print (df) 
    distance time_ser to_remove 
0 13.929384  25  NaN 
1 5.722787  25  NaN 
2 4.537029  25  NaN 
3 11.026295  25  NaN 
4 14.389379  25  NaN 
5 8.462129  25  NaN 
9 7.842350  25  NaN 
10 6.863560  25  NaN 
11 14.580994  25  NaN 
... 
...

：

その後dropた場合、または新しい列locを必要とするを使用

df['to_remove'] = 'good' 
df.loc[idx, 'to_remove'] = 'wrong' 
print (df) 
    distance time_ser to_remove 
0 13.929384  25  good 
1 5.722787  25  good 
2 4.537029  25  good 
3 11.026295  25  good 
4 14.389379  25  good 
5 8.462129  25  good 
6 19.615284  25  wrong 
7 13.696595  25  wrong 
8 9.618638  25  wrong 
9 7.842350  25  good 
10 6.863560  25  good 
11 14.580994  25  good

出典

2017-06-12 08:51:34 jezrael

を受け入れていただき、ありがとうございます。あなたもupvoteすることができます - マークを受け入れる上記の '0 'の上の小さな三角形をクリックします。ありがとう。 – jezrael

条件を満たすサブシリーズ（データフレーム内の行）を削除します。

答えて

関連する問題