2016-10-06 19 views
1

私はCSVから定義されたデータフレームを持っており、基本的な要約統計量を計算したいと思います。平均、分散、...すべてのモデルの列車部分。パンダファジィグループ概要統計

モデル番号を挿入してそれをグループ化するとうまくいくが、良い解決策ではないようだ。 (トレーニングの場合のみ)は、カウンターのためにgroup_by modelNameが機能しないため、モデルごとにサマリー統計を取得するにはどうすればよいですか。

df.groupby(['modelName', 'typeOfRun'])['kappa'].mean() 

又は

df[df.typeOfRun != 'validation'].describe() 

所望の結果をもたらしません。 pct

AUC_R,Accuracy,Error rate,False negative rate,False positive rate,Lift value,Precision J,Precision N,Rate of negative predictions,Rate of positive predictions,Sensitivity (true positives rate),Specificity (true negatives rate),f1_R,kappa,modelName,typeOfRun 
0.7747622323007851,0.7182416731216111,0.28175832687838887,0.16519823788546256,0.28527729751296715,2.769918376242967,0.08117369886485329,0.9930703132218424,0.029305447973147433,0.3013813581203202,0.8348017621145375,0.7147227024870328,0.8312130234716368,0.09987857210248623,00_testing_1-training,training 
0.7688154033277225,0.7295055512522592,0.27049444874774076,0.1894273127753304,0.27294188056922464,2.807689674786938,0.08228060368921185,0.9921956531603068,0.029305447973147433,0.28869739220242707,0.8105726872246696,0.7270581194307754,0.8391825769931881,0.10159217699431862,00_testing_2-training,training 
0.7653761718477654,0.7217918925897238,0.2782081074102763,0.1883259911894273,0.2809216651150419,2.737743031677203,0.08023078597866318,0.9921552436003304,0.029305447973147433,0.29647560030983733,0.8116740088105727,0.7190783348849581,0.8338281219878937,0.09791120175612114,00_testing_3-training,training 
0.7666987721022418,0.7202566535628756,0.2797433464371244,0.18396711202466598,0.2826353437708505,2.7358921138891255,0.08018987022168358,0.9923159476282464,0.02931031885891585,0.2982693958700465,0.816032887975334,0.7173646562291496,0.8327314318650539,0.097878484924986,00_testing-validation,validation 
0.7776426005660843,0.7300542215336948,0.2699457784663052,0.17180616740088106,0.2729086314669504,2.8639238514789174,0.08392857142857142,0.9929168180167091,0.029305447973147433,0.28918151303898787,0.8281938325991189,0.7270913685330496,0.8394625719769673,0.10476961017159536,01_otherSet_1-training,training 
0.7691501646636157,0.737412858249419,0.26258714175058095,0.197136563876652,0.2645631067961165,2.8639098209585327,0.08392816025788626,0.9919723742039644,0.029305447973147433,0.2803382390911438,0.802863436123348,0.7354368932038835,0.8446557452170924,0.1044486077353842,01_otherSet_2-training,training 
0.770174515310113,0.7342176607281178,0.2657823392718823,0.19162995594713655,0.26802101343263735,2.847815513920855,0.08345650938032974,0.9921582766235522,0.029305447973147433,0.283856183836819,0.8083700440528634,0.7319789865673627,0.8424375777288816,0.10367514449353035,01_otherSet_3-training,training 
0.7676347850606817,0.7317488289428102,0.26825117105718976,0.19424460431654678,0.2704858255620898,2.8156062097690264,0.08252631578947368,0.9920241385858671,0.02931031885891585,0.2861747473378218,0.8057553956834532,0.7295141744379102,0.8407546494992847,0.10196584743637081,01_otherSet-validation,validation 

答えて

1

IIUCあなたはDataFrameGroupBy.describeを使用することができます。

print (df.groupby(['modelName', 'typeOfRun']).describe()) 

              f1_R  kappa 
modelName    typeOfRun        
00_testing-validation validation count 1.000000 1.000000 
            mean 0.832731 0.097878 
            std   NaN  NaN 
            min 0.832731 0.097878 
            25% 0.832731 0.097878 
            50% 0.832731 0.097878 
            75% 0.832731 0.097878 
            max 0.832731 0.097878 
00_testing_1-training training count 1.000000 1.000000 
            mean 0.831213 0.099879 
            std   NaN  NaN 
            min 0.831213 0.099879 
            25% 0.831213 0.099879 
            50% 0.831213 0.099879 
            75% 0.831213 0.099879 
            max 0.831213 0.099879 
00_testing_2-training training count 1.000000 1.000000 
            mean 0.839183 0.101592 
            std   NaN  NaN 
... 
...         

あなたはsplitによってSeriesによってgroupbyを作成し、str[0]によって、リストの最初の項目を選択することができます

print (df.modelName.str.split('_').str[0]) 
0 00 
1 00 
2 00 
3 00 
4 01 
5 01 
6 01 
7 01 
Name: modelName, dtype: object 

print (df.groupby([df.modelName.str.split('_').str[0]]).describe()) 
        AUC_R Accuracy Error;rate False;negative;rate \ 
modelName                
00  count 4.000000 4.000000 4.000000    4.000000 
      mean 0.768913 0.722449 0.277551    0.181730 
      std 0.004149 0.004924 0.004924    0.011270 
      min 0.765376 0.718242 0.270494    0.165198 
      25% 0.766368 0.719753 0.276280    0.179275 
      50% 0.767757 0.721024 0.278976    0.186147 
      75% 0.770302 0.723720 0.280247    0.188601 
      max 0.774762 0.729506 0.281758    0.189427 
01  count 4.000000 4.000000 4.000000    4.000000 
      mean 0.771151 0.733358 0.266642    0.188704 
      std 0.004452 0.003198 0.003198    0.011488 
      min 0.767635 0.730054 0.262587    0.171806 
      25% 0.768771 0.731325 0.264984    0.186674 
      50% 0.769662 0.732983 0.267017    0.192937 
      75% 0.772042 0.735016 0.268675    0.194968 
      max 0.777643 0.737413 0.269946    0.197137 
      ... 
      ... 
+0

をほぼあり。しかし、私は厳密にはmodelNameでグループ化することは望ましくない。むしろ、私はモデル名の最初の部分(定数です)だけでグループ化を実行したいと思います。 –

+0

うん、 'modelby'だけで' groupby'が必要です。 'print(df.groupby(['modelName'])。describe())'?それとももっと説明できますか? – jezrael

+0

"train"データだけがdfにあるようにデータをフィルタリングします。次に、モデル名ごとにグループを実行します。しかし、名前が「00_testing_1-training」、「00_testing_2-training」のようにカウントされているので、別のカウンターを無視したいです。 –