DataFrameの列間で効率的な行数を行単位でグループ化

DataFrameの各行について同じdtypeの列にわたって異なる値を数える最も速い方法（純粋な平滑度の範囲内で）は何ですか？DataFrameの列間で効率的な行数を行単位でグループ化

詳細：私は次のようによって生成されたものに類似した（列の）日によって（行の）被験者によるカテゴリの成果のDataFrameを、持っています。データセットは、店舗への各訪問で注文各顧客を飲む教えてくれる場合

import numpy as np 
import pandas as pd 

def genSampleData(custCount, dayCount, discreteChoices): 
    """generate example dataset""" 
    np.random.seed(123)  
    return pd.concat([ 
       pd.DataFrame({'custId':np.array(range(1,int(custCount)+1))}), 
       pd.DataFrame(
       columns = np.array(['day%d' % x for x in range(1,int(dayCount)+1)]), 
       data = np.random.choice(a=np.array(discreteChoices), 
             size=(int(custCount), int(dayCount)))  
       )], axis=1)

は例えば、私は、顧客ごとに個別のドリンクの数を知っていただきたいと思います。（例えば、下記testDf）このユースケースでデータセットは、日よりも多くの科目を持つことになりますので、私は最も効率的な行方向の操作を見つけることを試みています：

私が試してみました何

# notional discrete choice outcome   
drinkOptions, drinkIndex = np.unique(['coffee','tea','juice','soda','water'], 
            return_inverse=True) 

# integer-coded discrete choice outcomes 
d = genSampleData(2,3, drinkIndex) 
d 
# custId day1 day2 day3 
#0  1  1  4  1 
#1  2  3  2  1 

# Count distinct choices per subject -- this is what I want to do efficiently on larger DF 
d.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1) 
#0 2 
#1 3 

# Note: I have coded the choices as `int` rather than `str` to speed up comparisons. 
# To reconstruct the choice names, we could do: 
# d.iloc[:,1:] = drinkOptions[d.iloc[:,1:]]

私の元の試みを改善する

testDf = genSampleData(100000,3, drinkIndex) #---- Original attempts ---- %timeit -n20 testDf.iloc[:,1:].apply(lambda x: x.nunique(), axis=1) # I didn't wait for this to finish -- something more than 5 seconds per loop %timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(x.unique()), axis=1) # Also too slow %timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1) #20 loops, best of 3: 2.07 s per loop

、我々はpandas.DataFrame.apply()引数を受け入れることに注意してください：

raw=Trueた場合のp assed関数は代わりにndarrayオブジェクトを受け取ります。あなただけの、これははるかに優れたパフォーマンスを実現しますnumpyの低減機能を適用する場合

これは半分以上でランタイムをカットしました：

%timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1, raw=True) #20 loops, best of 3: 721 ms per loop *best so far*

私は、これは純粋なnumpyのソリューションことに驚きましたraw=Trueと上記に相当すると思われる、実際には少し遅かった：

%timeit -n20 np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr = testDf.iloc[:,1:].values) #20 loops, best of 3: 1.04 s per loop

最後に、私も試してみましたtranspo私がより効率的かもしれないと思ったcolumn-wise count distinctを行うためにデータを歌いなさい（少なくともDataFrame.apply()のために、しかし意味のある差があるように思わなかった。

%timeit -n20 testDf.iloc[:,1:].T.apply(lambda x: len(np.unique(x)), raw=True) #20 loops, best of 3: 712 ms per loop *best so far* %timeit -n20 np.apply_along_axis(lambda x: len(np.unique(x)), axis=0, arr = testDf.iloc[:,1:].values.T) # 20 loops, best of 3: 1.13 s per loop

これまでのところ、私の最善の解決策は、len(np.unique())のdf.applyの奇妙なミックスですが、私は他に何を試してみてください？ DataFrame.groupbyとgroupby.SeriesGroupBy.nuniqueと

出典

2016-08-04 C8H10N4O2

は日カウント代表ですか？パフォーマンスの差に大きく影響するようです。 – ayhan

@面白い...面白い...私の特定のユースケースは日数が代表ですが、他のユーザーには注目に値する幅広いデータセットでは何か他のものがうまく動作すれば – C8H10N4O2

実際は反対です。少数の列がある場合、各列を他の列と比較する方がはるかに高速です。私は答えとして結果を掲示した。 – ayhan

、stackを思います。ここには3日しかありません。他人に対する各列を比較すると、より高速であるように思わ：あなたはもちろんの複数の列を追加するときには、そのエッジを失う

testDf = genSampleData(100000,3, drinkIndex) 
days = testDf.columns[1:] 

%timeit testDf.iloc[:, 1:].stack().groupby(level=0).nunique() 
10 loops, best of 3: 46.8 ms per loop 

%timeit pd.melt(testDf, id_vars ='custId').groupby('custId').value.nunique() 
10 loops, best of 3: 47.6 ms per loop 

%%timeit 
testDf['nunique'] = 1 
for col1, col2 in zip(days, days[1:]): 
    testDf['nunique'] += ~((testDf[[col2]].values == testDf.ix[:, 'day1':col1].values)).any(axis=1) 
100 loops, best of 3: 3.83 ms per loop

。異なる数の列（：stack().groupby()、pd.melt().groupby()とループ同じ順序）：については

10 columns: 143ms, 161ms, 30.9ms 
50 columns: 749ms, 968ms, 635ms 
100 columns: 1.52s, 2.11s, 2.33s

出典

2016-08-04 15:49:30 ayhan

うわー、勝利のための 'for'ループ？ – C8H10N4O2

はい、そのループはわずか数回実行されるためです。私はあまりにも多くの列のタイミングを追加しました。 – ayhan

すばらしい答え！ +1、私はまだ良いソリューションを主張して私の車輪を回転させています。 – piRSquared

pandas.meltは、他のソリューションを吹き飛ばすようだ：

%timeit -n20 pd.melt(testDf, id_vars ='custId').groupby('custId').value.nunique() 
#20 loops, best of 3: 67.3 ms per loop

出典

2016-08-04 14:22:36 C8H10N4O2

これは、行番号ではなく列値でグループ化されていることに注意してください。一意の行ID変数がない場合は、公正なベンチマークのために作成する（小さな）コストを考慮する必要があります。 – C8H10N4O2

他のすべてのソリューションでは、 'apply'ループがPythonで起こっていました。ここで' Groupby.nunique'はいくつかのトリックを使用します（[here]（https://github.com/pydata/pandas/blob/master/pandas/を参照） core/groupby.py＃L2896））を使用してベクトル化された操作をすべて実行します。 – chrisb

あなたはcustIdは必要ありません。私は、私の理解では、大規模なシリーズ用に最適化されているnuniqueで、その後groupby

testDf.iloc[:, 1:].stack().groupby(level=0).nunique()

出典

2016-08-04 14:45:26 piRSquared

DataFrameの列間で効率的な行数を行単位でグループ化

答えて

関連する問題