別のシリーズに基づくパンダの効率的なグループ化

DataFrameの別のブール型列に基づいてグループ化された操作を実行する必要があります。それは、最も簡単に例に見られている：私はDataFrame次き：別のシリーズに基づくパンダの効率的なグループ化

b   id 
0 False  0 
1 True  0 
2 False  0 
3 False  1 
4 True  1 
5 True  2 
6 True  2 
7 False  3 
8 True  4 
9 True  4 
10 False  4

とその要素b列が真であり、それはそれがために真である最後の時間であればTrueの列を取得したいと思いますid与えられた：

b   id lastMention 
0 False  0  False 
1 True  0  True 
2 False  0  False 
3 False  1  False 
4 True  1  False 
5 True  2  True 
6 True  3  True 
7 False  3  False 
8 True  4  False 
9 True  4  True 
10 False  4  False

非効率的ものの、私は、これを実現するコードを持っている：誰か

def lastMentionFun(df): 
    b = df['b'] 
    a = b.sum() 
    if a > 0: 
     maxInd = b[b].index.max() 
     df.loc[maxInd, 'lastMention'] = True 
    return df 

df['lastMention'] = False 
df = df.groupby('id').apply(lastMentionFun)

ができこれをすばやく速くするには、正しいpythonicアプローチが何であるかを提案してください。

出典

2017-03-21 splinter

あなたは真の列bで、その後groupbyでmaxインデックス値を取得し、maxを集約する第一のフィルタ値次のことができます。そして、locとインデックス値により値Falseを交換

print (df[df.b].reset_index().groupby('id')['index'].max()) 
id 
0 1 
1 4 
2 6 
4 9 
Name: index, dtype: int64

：

df['lastMention'] = False 
df.loc[df[df.b].reset_index().groupby('id')['index'].max(), 'lastMention'] = True 

print (df) 
     b id lastMention 
0 False 0  False 
1 True 0   True 
2 False 0  False 
3 False 1  False 
4 True 1   True 
5 True 2  False 
6 True 2   True 
7 False 3  False 
8 True 4  False 
9 True 4   True 
10 False 4  False

をもう1つの解決方法 - groupbyとapplyのmaxインデックス値を取得し、0のインデックスの値のメンバーシップをテストします- 出力はboolean Seriesです：

print (df[df.b].groupby('id').apply(lambda x: x.index.max())) 
id 
0 1 
1 4 
2 6 
4 9 
dtype: int64 

df['lastMention'] = df.index.isin(df[df.b].groupby('id').apply(lambda x: x.index.max())) 
print (df) 
     b id lastMention 
0 False 0  False 
1 True 0  True 
2 False 0  False 
3 False 1  False 
4 True 1  True 
5 True 2  False 
6 True 2  True 
7 False 3  False 
8 True 4  False 
9 True 4  True 
10 False 4  False

出典

2017-03-21 12:12:10 jezrael

ない、これが最も効率的な方法ですが、それはそれは最後の1に等しいことを確認するために主なものは、「CUMSUM」とし、最大であること（のみ組み込み関数を使用する場合は必ず - pd.mergeはテーブルに最大値を戻すために使用されています。これを行うにはより良い方法がありますか？）

df['cum_b']=df.groupby('id', as_index=False).cumsum() 
df = pd.merge(df, df[['id','cum_b']].groupby('id', as_index=False).max(), how='left', on='id', suffixes=('','_max')) 
df['lastMention'] = np.logical_and(df.b, df.cum_b == df.cum_b_max)

P.S.この例で指定したデータフレームは、最初のスニペットから2番目のスニペットに少し変更されています。

出典

2017-03-21 12:13:35

別のシリーズに基づくパンダの効率的なグループ化

答えて

関連する問題