2016-06-18 45 views
4

400万行のデータフレーム内の部分文字列または複数の部分文字列を検索しています。pandas dataframe str.contains検索の高速化の方法

df[df.col.str.contains('Donald',case=True,na=False)] 

または

df[df.col.str.contains('Donald|Trump|Dump',case=True,na=False)] 

DATAFRAME(DF)下記(400万文字列の行を持つ)のように見えます

df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.", 
         "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have", 
         "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]}) 

が速く、この文字列検索を行うための任意のヒントはありますか? たとえば、データフレームを最初にソートし、特定の索引付け方法で、列名を数値に変更し、問合せから「na = False」を削除しますか?スピードアップの数ミリ秒も非常に便利です!

答えて

3

サブ文字列の数が少ない場合は、regex=False引数をcontainsに渡すことができるため、一度に1つずつ検索する方が高速になる可能性があります。

約6000行のサンプルDataFrameで、2つのサンプル部分文字列でテストしました。blah.contains("foo", regex=False) | blah.contains("bar", regex=False)は、blah.contains("foo|bar")の約2倍の速さでした。データの規模を調べるには、データでテストする必要があります。

2

リストに変換できます。文字列メソッドをシリーズに適用するのではなく、リストを検索する方がはるかに高速です。

サンプルコード:

import timeit 
df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.", 
         "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have", 
         "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]}) 



def first_way(): 
    df["new"] = pd.Series(df["col"].str.contains('Donald',case=True,na=False)) 
    return None 
print "First_way: " 
%timeit for x in range(10): first_way() 
print df 

df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.", 
         "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have", 
         "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]}) 


def second_way(): 
    listed = df["col"].tolist() 
    df["new"] = ["Donald" in n for n in listed] 
    return None 

print "Second way: " 
%timeit for x in range(10): second_way() 
print df 

結果:

First_way: 
100 loops, best of 3: 2.77 ms per loop 
               col new 
0 very definition of the American success story,... False 
1 The myriad vulgarities of Donald Trump—example... True 
2 While a fearful nation watched the terrorists ... False 
Second way: 
1000 loops, best of 3: 1.79 ms per loop 
               col new 
0 very definition of the American success story,... False 
1 The myriad vulgarities of Donald Trump—example... True 
2 While a fearful nation watched the terrorists ... False 
関連する問題