あなたがnotnull
とnumpy.in1d
を試すことができます。
df_new1 = df.groupby('id').apply(lambda x: pd.Series(dict(
new_col1=(x['foo'].notnull()).sum(),
new_col2=np.in1d(x['bar'],'P').sum(),
new_col3=np.in1d(x['bar'],'C').sum(),
new_col4=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),
new_col5=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, not yet paid'])).sum(),
new_col6=(np.in1d(x['Status'],['Approved, paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum()
)))
別速く解決策はfactorize
によって値0
と1
に値を変換し、その後、反転列を作成abs
と最後にgroupby
とsum
:
df['new_col1'] = df['foo'].notnull().astype(int)
df['new_col2'] = df['bar'].factorize()[0]
df['new_col3'] = (df['new_col2'] - 1).abs()
df['Status'] = df['Status'].factorize()[0]
df['invertStatus'] = (df['Status'] - 1).abs()
df['Current_Status'] = df['Current_Status'].factorize()[0]
df['invertCurrent_Status'] = (df['Current_Status'] - 1).abs()
df['new_col4'] = df['Status'] & df['invertCurrent_Status']
df['new_col5'] = df['Status'] & df['Current_Status']
df['new_col6'] = df['invertStatus'] & df['invertCurrent_Status']
print df.groupby('id').sum()
[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']]
それとも、ブールSeries
作成することができます - 最速のソリューション:
df['new_col1'] = df['foo'].notnull()
df['new_col2'] = np.in1d(df['bar'], 'P')
df['new_col3'] = ~df['new_col2']
Status = np.in1d(df['Status'],'Approved, not yet paid')
invertStatus = ~Status
Current_Status = np.in1d(df['Current_Status'],'Approved, not yet paid')
invertCurrent_Status = ~Current_Status
df['new_col4'] = Status & invertCurrent_Status
df['new_col5'] = Status & Current_Status
df['new_col6'] = invertStatus & invertCurrent_Status
#print df
print df.groupby('id').sum()
[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']].astype(int)
タイミング:
In [25]: len(df)
Out[25]: 110000
In [26]: %timeit a(df)
10 loops, best of 3: 24.7 ms per loop
In [27]: %timeit b(df1)
10 loops, best of 3: 39.3 ms per loop
In [28]: %timeit c(df2)
10 loops, best of 3: 46 ms per loop
In [29]: %timeit d(df3)
10 loops, best of 3: 103 ms per loop
コード:
を
def c(df):
return df.groupby('id').apply(lambda x: pd.Series(dict(new_col1=(x['foo'].notnull()).sum(),new_col2=np.in1d(x['bar'],'P').sum(),new_col3=np.in1d(x['bar'],'C').sum(),new_col4=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),new_col5=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, not yet paid'])).sum(),new_col6=(np.in1d(x['Status'],['Approved, paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),)))
def d(df):
return df.groupby('id').apply(lambda x: pd.Series(dict(new_col1=(x['foo'] != np.nan).sum(),new_col2=(x['bar'] == 'P').sum(),new_col3=(x['bar'] == 'C').sum(),new_col4=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, paid')).sum(),new_col5=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, not yet paid')).sum(),new_col6=((x['Status']=='Approved, paid') & (x['Current_Status']=='Approved, paid')).sum())))
テストDATAFRAME:
id foo bar Status Current_Status
0 1 23 C Approved, paid Approved, paid
1 1 63 P Approved, not yet paid Approved, paid
2 1 84 P Approved, paid Approved, paid
3 1 125 P Approved, not yet paid Approved, not yet paid
4 1 12 C Approved, paid Approved, paid
5 2 23 C Approved, paid Approved, paid
6 2 63 P Approved, not yet paid Approved, paid
7 2 84 P Approved, paid Approved, paid
8 2 125 P Approved, not yet paid Approved, not yet paid
9 2 216 P Approved, not yet paid Approved, paid
10 2 12 C Approved, paid Approved, paid
あなたがサンプルデータを追加することはできますか? '5-6 rows' – jezrael
ちょうどそうでした。私は "アウト[15]"の構文が無効であると確信しています。それを無視してください! – nonegiven72