2017-05-01 11 views
1

私はDATAFRAME(単なる例)どのようにgroupby ndarray?

D = pd.DataFrame({i: {"name": str(i), 
         "vector": np.arange(i + i % 4, i + i % 4 + 10), 
         "sq": i ** 2, 
         "gp": i % 2} for i in range(10)}).T 

    gp name sq         vector 
0 0 0 0   [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 
1 1 1 1   [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] 
2 0 2 4  [4, 5, 6, 7, 8, 9, 10, 11, 12, 13] 
3 1 3 9  [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 
4 0 4 16  [4, 5, 6, 7, 8, 9, 10, 11, 12, 13] 
5 1 5 25  [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 
6 0 6 36 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17] 
7 1 7 49 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19] 
8 0 8 64 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17] 
9 1 9 81 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19] 

を持っていると私は列ベクトルと、列GPでグループ化したいです。これどうやってするの?

TypeError: unhashable type: 'numpy.ndarray'

答えて

5

from dfply import * 
D >>\ 
    groupby(X.vector, X.gp) >>\ 
    summarize(b=X.sq.sum()) 

結果、私はあなたがpandasに最初のタプルに列vectorを変換する必要があると思う:

print(D['sq'].groupby([D['vector'].apply(tuple), D['gp']]).sum().reset_index()) 
            vector gp sq 
0   (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) 0 0 
1   (2, 3, 4, 5, 6, 7, 8, 9, 10, 11) 1 1 
2  (4, 5, 6, 7, 8, 9, 10, 11, 12, 13) 0 20 
3  (6, 7, 8, 9, 10, 11, 12, 13, 14, 15) 1 34 
4 (8, 9, 10, 11, 12, 13, 14, 15, 16, 17) 0 100 
5 (10, 11, 12, 13, 14, 15, 16, 17, 18, 19) 1 130 

別の解決策は、最初の列を変換です:

D['vector'] = D['vector'].apply(tuple) 
print(D.groupby(['vector','gp'])['sq'].sum().reset_index()) 
            vector gp sq 
0   (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) 0 0 
1   (2, 3, 4, 5, 6, 7, 8, 9, 10, 11) 1 1 
2  (4, 5, 6, 7, 8, 9, 10, 11, 12, 13) 0 20 
3  (6, 7, 8, 9, 10, 11, 12, 13, 14, 15) 1 34 
4 (8, 9, 10, 11, 12, 13, 14, 15, 16, 17) 0 100 
5 (10, 11, 12, 13, 14, 15, 16, 17, 18, 19) 1 130 

ANF arrayバックにnecesary最後変換する場合:

D['vector'] = D['vector'].apply(tuple) 
df = D.groupby(['vector','gp'])['sq'].sum().reset_index() 
df['vector'] = df['vector'].apply(np.array) 
print (df) 
            vector gp sq 
0   [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 0 0 
1   [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] 1 1 
2  [4, 5, 6, 7, 8, 9, 10, 11, 12, 13] 0 20 
3  [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 1 34 
4 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17] 0 100 
5 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19] 1 130 

print (type(df['vector'].iat[0])) 
<class 'numpy.ndarray'> 

私はあなたのコードを使用して試してみて、私の作品:

from dfply import * 

D['vector'] = D['vector'].apply(tuple) 
a = D >> groupby(X.vector, X.gp) >> summarize(b=X.sq.sum()) 
a['vector'] = a['vector'].apply(np.array) 
print (a) 
    gp         vector b 
0 0   [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 0 
1 1   [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] 1 
2 0  [4, 5, 6, 7, 8, 9, 10, 11, 12, 13] 20 
3 1  [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 34 
4 0 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17] 100 
5 1 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19] 130 
+1

にそれを得るために。 – piRSquared

0

D.groupby([D.vector.apply(str), D.gp]).sq.sum().reset_index() 
4

list sがではありません少し奇妙な方法ハッシュ可能... tupleです。我々は、vector列の累積版でグループ化したいと思っています。私はリストの理解を使用します。

D.groupby([[tuple(x) for x in D.vector], 'gp']).sq.sum() 

              gp 
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)   0  0 
(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)   1  1 
(4, 5, 6, 7, 8, 9, 10, 11, 12, 13)  0  20 
(6, 7, 8, 9, 10, 11, 12, 13, 14, 15)  1  34 
(8, 9, 10, 11, 12, 13, 14, 15, 16, 17) 0  100 
(10, 11, 12, 13, 14, 15, 16, 17, 18, 19) 1  130 
Name: sq, dtype: int64 

元の形...これはあまりにも私の作品の多くの方法の1

d1 = D.groupby([[tuple(x) for x in D.vector], 'gp']).sq.sum() 
d1.reset_index('gp').rename(index=list).rename_axis('vector').reset_index() 

            vector gp sq 
0   [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 0 0 
1   [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] 1 1 
2  [4, 5, 6, 7, 8, 9, 10, 11, 12, 13] 0 20 
3  [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 1 34 
4 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17] 0 100 
5 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19] 1 130 
関連する問題