Python 3.xより小さい平均値で重複のグループを維持する方法は？

こんにちは、私はpythonに新しいので、友人はstackoverflowのヘルプを求めることを勧め、私はそれをショットを与えることにしました。私は現在、Pythonバージョン3.xを使用しています。Python 3.xより小さい平均値で重複のグループを維持する方法は？

カラムヘッダーのないcsvファイルに100kを超えるデータセットがあります。このデータをpandas DataFrameにロードしました。ため、私は、私はここにデータを表示するカントが、これはあなたがname列から見ることができるように

("id", "name", "number", "time", "text_id", "text", "text") 

1 | apple | 12 | 123 | 2 | abc | abc 

1 | apple | 12 | 222 | 2 | abc | abc 

2 | orange | 32 | 123 | 2 | abc | abc 

2 | orange | 11 | 123 | 2 | abc | abc 

3 | apple | 12 | 333 | 2 | abc | abc 

3 | apple | 12 | 443 | 2 | abc | abc 

3 | apple | 12 | 553 | 2 | abc | abc

以下のように定義することができ、データや列の一例である文書が機密であるという事実に「リンゴ」の2つの重複したクラスタを有するが、異なるIDを有する。

私の質問はどのようにして、平均値の高いベース（ "行"）全体を「時間」にドロップするのですか？

例：（IDとクラスタ：1）場合.mean（時間）<（IDとクラスタ：3）：3

所望の出力.mean（時間）、次いでIDとクラスタ内のすべての行を削除：

1 |リンゴ| 12 | 123 | 2 | abc | abc

1 |リンゴ| 12 | 222 | 2 | abc | abc

2 |オレンジ| 32 | 123 | 2 | abc | abc

2 |オレンジ| 11 | 123 | 2 | abc | abc

私は多くの助けを必要とし、私はそれを得ることができます、私は時間切れです、事前に感謝！

出典

2017-11-04 Anonymous

は、これらのものです：

は、以下のことを試してみてください。

import pandas as pd 

df = pd.read_csv('filename.csv', header=None) 
df.columns = ['id', 'name', 'number', 'time', 'text_id', 'text', 'text'] 

print(df) 

for eachname in df.name.unique(): 
    eachname_df = df.loc[df['name'] == eachname] 
    grouped_df = eachname_df.groupby(['id', 'name']) 
    avg_name = grouped_df['time'].mean() 

    for a, b in grouped_df: 
     if b['time'].mean() != avg_name.min(): 
      indextodrop = b.index.get_values() 
      for eachindex in indextodrop: 
       df = df.drop([eachindex]) 

print(df) 


Result: 
    id name number time text_id text text 
0 1 apple  12 123  2 abc abc 
1 1 apple  12 222  2 abc abc 
2 2 orange  32 123  2 abc abc 
3 2 orange  11 123  2 abc abc 
4 3 apple  12 333  2 abc abc 
5 3 apple  12 443  2 abc abc 
6 3 apple  12 553  2 abc abc 

    id name number time text_id text text 
0 1 apple  12 123  2 abc abc 
1 1 apple  12 222  2 abc abc 
2 2 orange  32 123  2 abc abc 
3 2 orange  11 123  2 abc abc

出典

2017-11-04 16:24:38

最初に削除する行を取得するには、groupbyとapplyを使用できます。その後、takeを使用して最終結果を得ることができます。

import pandas as pd 

## define the rows with higher than mean value 
def my_func(df): 
    return df[df['time'] > df['time'].mean()] 

## get rows to removed 
df1 = df.groupby(by='name', group_keys=False).apply(my_func) 

## take only the row we want 
index_to_keep = set(range(df.shape[0])) - set(df1.index) 
df2 = df.take(list(index_to_keep))

例：私はこのanswerからtakeの使用を取っ

## df 
id name number time text_id text text1 
0 1 apple  12 123  2 abc abc 
1 1 apple  12 222  2 abc abc 
2 2 orange  32 123  2 abc abc 
3 2 orange  11 123  2 abc abc 
4 3 apple  12 333  2 abc abc 
5 3 apple  12 444  2 abc abc 
6 3 apple  12 553  2 abc abc 

df1 = df.groupby(by='name', group_keys=False).apply(my_func) 

## df1 
id name number time text_id text text1 
5 3 apple  12 444  2 abc abc 
6 3 apple  12 553  2 abc abc 

index_to_keep = set(range(df.shape[0])) - set(df1.index) 
df2 = df.take(list(index_to_keep)) 

#index_to_keep 
{0, 1, 2, 3, 4} 

# df2 
id name number time text_id text text1 
0 1 apple  12 123  2 abc abc 
1 1 apple  12 222  2 abc abc 
2 2 orange  32 123  2 abc abc 
3 2 orange  11 123  2 abc abc 
4 3 apple  12 333  2 abc abc

P.S。あなたが必要なもの

出典

2017-11-04 09:04:25 SSC

こんにちは@SCC、ありがとうございました返信buしかし、私が探しているのは、index_to_keepが{0,1,2,3} 4個のIDを持つクラスタに属していなければならないということです。3 クラスタ平均値（時間に基づく）？（クラスタがID：3）.mean（時間）=クラスタIDが3のすべての行を削除します。3 –

あなたは平均値の高いクラスタがドロップされた場合例：私の例では '>'を '> ='に変更することで 'my_func（）'の条件を調整することができます。 – SSC

Python 3.xより小さい平均値で重複のグループを維持する方法は？

答えて

関連する問題