パンダDataFrame適用効率

私は、別のデータフレームに一致する値がある場合、ある種類のステータスを持つ列を追加することはできません。私は働く現在のコードを持っている：パンダDataFrame適用効率

df1['NewColumn'] = df1['ComparisonColumn'].apply(lambda x: 'Match' if any(df2.ComparisonColumn == x) else ('' if x is None else 'Missing'))

私はラインが醜いですけど、私は、その非効率的な印象を受けます。あなたはこの比較をするより良い方法を提案できますか？

あなたが np.where、 isin、および isnullを使用することができます

出典

2017-05-30 user3535074

：

いくつかのダミーデータを作成します。

np.random.seed(123) 
df = pd.DataFrame({'ComparisonColumn':np.random.randint(10,20,20)}) 
df.iloc[4] = np.nan #Create missing data 
df2 = pd.DataFrame({'ComparisonColumn':np.random.randint(15,30,20)})

はnp.whereとのマッチングを行います。

df['NewColumn'] = np.where(df.ComparisonColumn.isin(df2.ComparisonColumn),'Matched',np.where(df.ComparisonColumn.isnull(),'Missing',''))

出力：

ComparisonColumn NewColumn 
0    12.0   
1    12.0   
2    16.0 Matched 
3    11.0   
4    NaN Missing 
5    19.0 Matched 
6    16.0 Matched 
7    11.0   
8    10.0   
9    11.0   
10    19.0 Matched 
11    10.0   
12    10.0   
13    19.0 Matched 
14    13.0   
15    14.0   
16    10.0   
17    10.0   
18    14.0   
19    11.0

を210

出典

2017-05-30 13:38:24

これは大変ありがとうございます。私はそれを実装しましたが、少し速く、はっきりと明確です。それがなぜより速いのかあなたはコメントできますか？私の元の投稿には、比較がテキスト比較であるかもしれないということがあります。 numpyを使ってテキスト比較を行うのは面白いようです。 – user3535074

@ user3535074はい、適用は多少遅い操作ですが、Pandasとisin関数で制御が行われ、比較が行われている場合はNumpyを使用しています。 –

答えて

関連する問題