パンダの文字列の中の単語を数えます

私は特定の期間のクエリとカウントを含むpandasデータフレームを持っています。このデータフレームをユニークワード数に変換したいと考えています。たとえば、データフレームに以下が含まれている場合：パンダの文字列の中の単語を数えます

query   count 
foo bar  10 
super   8 
foo   4 
super foo bar 2

私は以下のデータフレームを受け取っています。例えば単語「foo」は表の中で正確に16回現れます。

word count 
foo  16 
bar  12 
super 10

私は、以下の機能が働いているが、それはほとんどこれを行うための最適な方法のように思えることはありません、また、各行の合計数を無視します。

def _words(df): 
    return Counter(re.findall(r'\w+', ' '.join(df['query'])))

ご協力いただきますようお願い申し上げます。

ありがとうございます！

出典

2017-10-03 Seano314

オプション1

df['query'].str.get_dummies(sep=' ').T.dot(df['count']) 

bar  12 
foo  16 
super 10 
dtype: int64

オプション2

df['query'].str.get_dummies(sep=' ').mul(df['count'], axis=0).sum() 

bar  12 
foo  16 
super 10 
dtype: int64

オプション3
numpy.bincount + pd.factorize
また、cytoolz.mapcatの使用が強調されています。関数をマップし、結果を連結するイテレータを返します。カッコいい！

import pandas as pd, numpy as np, cytoolz 

q = df['query'].values 
c = df['count'].values 

f, u = pd.factorize(list(cytoolz.mapcat(str.split, q.tolist()))) 
l = np.core.defchararray.count(q.astype(str), ' ') + 1 

pd.Series(np.bincount(f, c.repeat(l)).astype(int), u) 

foo  16 
bar  12 
super 10 
dtype: int64

オプション4ものの
不条理使用...ちょうど使用オプション1.

pd.DataFrame(dict(
    query=' '.join(df['query']).split(), 
    count=df['count'].repeat(df['query'].str.count(' ') + 1) 
)).groupby('query')['count'].sum() 

query 
bar  12 
foo  16 
super 10 
Name: count, dtype: int64

出典

2017-10-03 20:58:57 piRSquared

'オプション1 'は純粋な美しさです！ – MaxU

メモを作る:) – Vaishali

うわー、すべての詳細な回答ありがとう！オプション1は素晴らしいです。多くのありがとう – Seano314

melt + groupby + sumとのちょうど別の代替：

df['query'].str.split(expand=True).assign(count=df['count'])\ 
          .melt('count').groupby('value')['count'].sum() 

value 
bar  12 
foo  16 
super 10 
Name: count, dtype: int64

出典

2017-10-03 21:18:38

パンダの文字列の中の単語を数えます

答えて

関連する問題