変更値value_countsに応じて（）

私はパンダのデータフレームを、次います変更値value_countsに応じて（）

import pandas as pd 
from pandas import Series, DataFrame 

data = DataFrame({'Qu1': ['apple', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'egg'], 
       'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'], 
       'Qu3': ['apple', 'potato', 'sausage', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'egg']})

値が大きいカウントまたは一部に等しいとき、私はvalue_counts()によるQu3、列Qu1、Qu2内の値を変更したいですQu1カラム

>>> pd.value_counts(data.Qu1) >= 2 
cheese  True 
potato  True 
banana  True 
apple  False 
egg  False

ため例えば数

それぞれの値に少なくとも2つの出現があるので、値cheese、potato、bananaを保持したいと思います。

値appleとeggから、私は何も変更列Qu2について値others

を作成していないしたいと思います：

>>> pd.value_counts(data.Qu2) >= 2 
banana  True 
apple  True 
sausage True

test_data

test_data = DataFrame({'Qu1': ['other', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'other'], 
        'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'], 
        'Qu3': ['other', 'potato', 'other', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'other']})

添付のおかげで、最終結果！

出典

2016-05-15 Toren

私は、対応するエントリが値カウントで同じ形状のデータフレームを作成します。

3000列を持つテスト

data.apply(lambda x: x.map(x.value_counts())) 
Out[229]: 
    Qu1 Qu2 Qu3 
0 1 2 1 
1 2 4 3 
2 3 3 1 
3 2 3 3 
4 3 3 3 
5 2 2 3 
6 3 4 3 
7 2 4 3 
8 1 4 1

そして、対応するエントリが2よりも小さい場合、「その他」を返すためにdf.whereで結果を使用します。

data.where(data.apply(lambda x: x.map(x.value_counts()))>=2, "other") 

     Qu1  Qu2  Qu3 
0 other sausage other 
1 potato banana potato 
2 cheese apple other 
3 banana apple cheese 
4 cheese apple cheese 
5 banana sausage potato 
6 cheese banana cheese 
7 potato banana potato 
8 other banana other

出典

2016-05-15 15:57:25 ayhan

かなり多くエレガント＆私のアプローチよりも早く（.replace 'と）'！ – Stefan

@StefanJansenありがとうございました。 :)私の経験では、 '.replace（）'は一般的に '.map（）'よりも遅いので、両方が可能な場合にマップを使う傾向があります。私はまだapply-map-value_countsの組み合わせが繰り返すかもしれないと思っていますが、より良い選択肢を見つけることができませんでした。 – ayhan

ありがとう！エレガントなソリューション。 '.where（）> = 2'の仕組みはどうですか？ – Toren

あなたは可能性：

value_counts = df.apply(lambda x: x.value_counts()) 

     Qu1 Qu2 Qu3 
apple 1.0 3.0 1.0 
banana 2.0 4.0 NaN 
cheese 3.0 NaN 3.0 
egg  1.0 NaN 1.0 
potato 2.0 NaN 3.0 
sausage NaN 2.0 1.0

そして、各列の置換を含んでいますdictionaryを構築：

import cycle 
replacements = {} 
for col, s in value_counts.items(): 
    if s[s<2].any(): 
     replacements[col] = dict(zip(s[s < 2].index.tolist(), cycle(['other']))) 

replacements 
{'Qu1': {'egg': 'other', 'apple': 'other'}, 'Qu3': {'egg': 'other', 'apple': 'other', 'sausage': 'other'}}

は値を置き換えるためにdictionaryを使用します。

df.replace(replacements) 

     Qu1  Qu2  Qu3 
0 other sausage other 
1 potato banana potato 
2 cheese apple other 
3 banana apple cheese 
4 cheese apple cheese 
5 banana sausage potato 
6 cheese banana cheese 
7 potato banana potato 
8 other banana other

かループをdictionary内包で囲みます。

from itertools import cycle 

df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})

しかし、これはさらに厄介であるだけでなく、.whereを使用するよりも遅くなります。 .replace()を使用して

df = pd.concat([df for i in range(1000)], axis=1) 

<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 9 entries, 0 to 8 
Columns: 3000 entries, Qu1 to Qu3 
dtypes: object(3000)

：.where()対

%%timeit 
value_counts = df.apply(lambda x: x.value_counts()) 
df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()}) 

1 loop, best of 3: 4.97 s per loop

を：

%%timeit 
df.where(df.apply(lambda x: x.map(x.value_counts()))>=2, "other") 

1 loop, best of 3: 2.01 s per loop

出典

2016-05-15 15:01:18 Stefan

変更値value_countsに応じて（）

答えて

関連する問題