パンダ列の特定の単語を特定する

次のようなtsvファイルがあります。パンダ列の特定の単語を特定する

id ingredients recipe 
code1 egg, butter beat eggs. add unsalted butter 
code2 tim tam, butter beat tim tam. add butter 
code3 coffee, sugar add coffee and sugar and mix 
code4 sugar, fresh goat milk beat sugar and milk together

は、私は彼らがingredientsまたはrecipeのいずれかの列に下記の単語が含まれている場合、エントリを削除します。

mylist = ['tim tam', 'unsalted butter', 'fresh goat milk']

私の出力は次のようになります。

id ingredients recipe 
code3 coffee, sugar add coffee and sugar and mix

パンダでこれを行う方法はありますか？私を助けてください！

出典

2017-11-27 Anonymous

あなたは私たちにあなたがこれまでに行った作業を表示することができますか？ – louisfischer

使用contains「|」正規表現を作るために：

mylist = ['tim tam','unsalted butter','fresh goat milk'] 
df[~(df.ingredients.str.contains('|'.join(mylist)) | 
    df.recipe.str.contains('|'.join(mylist)))]

出力：

 id ingredients      recipe 
2 code3 coffee, sugar add coffee and sugar and mix

出典

2017-11-27 15:30:54

高速溶液を最初に一緒に列を結合した後containsて値を確認している。

df = df[~(df['ingredients'] + df['recipe']).str.contains('|'.join(mylist))] 
print (df) 
     id ingredients      recipe 
2 code3 coffee, sugar add coffee and sugar and mix

別の解決策：

使用containsを両方の列について、その後~と|逆マスクによってチェーン：

m1 = df['ingredients'].str.contains('|'.join(mylist)) 
m2 = df['recipe'].str.contains('|'.join(mylist)) 
m = m1 | m2 
print (m) 
0  True 
1  True 
2 False 
3  True 
dtype: bool 

df = df[~m] 
print (df) 
     id ingredients      recipe 
2 code3 coffee, sugar add coffee and sugar and mix

タイミング ：文字列は「サブ」の文字列が含まれているかどうかを確認し、をベースに参加することを見てjoinと

#[40000 rows x 3 columns] 
df = pd.concat([df]*10000).reset_index(drop=True) 

In [358]: %timeit df[~(df['ingredients'] + df['recipe']).str.contains('|'.join(mylist))] 
10 loops, best of 3: 47.8 ms per loop 

In [359]: %timeit df[~(df['ingredients'].str.contains('|'.join(mylist))|df['recipe'].str.contains('|'.join(mylist)))] 
10 loops, best of 3: 78.2 ms per loop

出典

2017-11-27 15:30:52 jezrael

パンダ列の特定の単語を特定する

答えて

関連する問題