2017-09-12 4 views
0

私は、別の列の要素をグループ化するリストに基づいて、データフレームの複数のインデックスを取得する方法を知りました。マルチインデックスパンダのデータフレーム

それはおそらくより良い例で示すことにあるので、ここで私が持っているもの表示するスクリプトがある、と私はしたいだろうか:

def ungroup_column(df, column, split_column = None): 
    ''' 
    # Summary 
     Takes a dataframe column that contains lists and spreads the items in the list over many rows 
     Similar to pandas.melt(), but acts on lists within the column 

    # Example 

     input datframe: 

       farm_id animals 
      0 1  [pig, sheep, dog] 
      1 2  [duck] 
      2 3  [pig, horse] 
      3 4  [sheep, horse] 


     output dataframe: 

       farm_id animals 
      0 1  pig 
      0 1  sheep 
      0 1  dog 
      1 2  duck 
      2 3  pig 
      2 3  horse 
      3 4  sheep 
      3 4  horse 

    # Arguments 

     df: (pandas.DataFrame) 
      dataframe to act upon 

     column: (String) 
      name of the column which contains lists to separate 

     split_column: (String) 
      column to be added to the dataframe containing the split items that were in the list 
      If this is not given, the values will be written over the original column 
    ''' 
    if split_column is None: 
     split_column = column 

    # split column into mulitple columns (one col for each item in list) for every row 
    # then transpose it to make the lists go down the rows 
    list_split_matrix = df[column].apply(pd.Series).T 

    # Now the columns of `list_split_matrix` (they're just integers) 
    # are the indices of the rows in `df` - i.e. `df_row_idx` 
    # so this melt concats each column on top of each other 
    melted_df = pd.melt(list_split_matrix, var_name = 'df_row_idx', value_name = split_column).dropna().set_index('df_row_idx') 

    if split_column == column: 
     df = df.drop(column, axis = 1) 
     df = df.join(melted_df) 
    else: 
     df = df.join(melted_df) 
    return df 

from IPython.display import display 
train_df.index 
from utils import * 
play_df = train_df 
sent_idx = play_df.groupby('pmid')['sentence'].apply(lambda row: range(0, len(list(row)))) #set_index(['pmid', range(0, len())]) 
play_df.set_index('pmid') 

import pandas as pd 
doc_texts = ['Here is a sentence. And Another. Yet another sentence.', 
      'Different Document here. With some other sentences.'] 
playing_df = pd.DataFrame({'doc':[nlp(doc) for doc in doc_texts], 
          'sentences':[[s for s in nlp(doc).sents] for doc in doc_texts]}) 
display(playing_df) 
display(ungroup_column(playing_df, 'sentences')) 

次のようにこれの出力は次のとおりです。

doc sentences 
0 (Here, is, a, sentence, ., And, Another, ., Ye... [(Here, is, a, sentence, .), (And, Another, .)... 
1 (Different, Document, here, ., With, some, oth... [(Different, Document, here, .), (With, some, ... 
doc sentences 
0 (Here, is, a, sentence, ., And, Another, ., Ye... (Here, is, a, sentence, .) 
0 (Here, is, a, sentence, ., And, Another, ., Ye... (And, Another, .) 
0 (Here, is, a, sentence, ., And, Another, ., Ye... (Yet, another, sentence, .) 
1 (Different, Document, here, ., With, some, oth... (Different, Document, here, .) 
1 (Different, Document, here, ., With, some, oth... (With, some, other, sentences, .) 

しかし、私は本当にこのような、「文章」列のインデックスを持っているしたいと思います:

doc_idx sent_idx  document           sentence 
0   0   (Here, is, a, sentence, ., And, Another, ., Ye... (Here, is, a, sentence, .) 
      1   (Here, is, a, sentence, ., And, Another, ., Ye... (And, Another, .) 
      2   (Here, is, a, sentence, ., And, Another, ., Ye... (Yet, another, sentence, .) 
1   0   (Different, Document, here, ., With, some, oth... (Different, Document, here, .) 
      1   (Different, Document, here, ., With, some, oth... (With, some, other, sentences, .) 
+1

[この非常に素晴らしいMaxUソリューション](https://stackoverflow.com/a/40449726/2901002)を確認できますか? – jezrael

+0

Whats nlp(doc).sents? nlkt sentence tokenizer? – Dark

+0

@Bharath、はい、spacyからsentence tokenizerです – chase

答えて

1

Bあなたは、インデックスをリセットすることができ、あなたの第二の出力にASEDは、[名前の変更現在のインデックスのcumcountに基づいてset_index軸すなわち

new_df = ungroup_column(playing_df, 'sentences').reset_index() 
new_df['sent_idx'] = new_df.groupby('index').cumcount() 
new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx']) 

出力:

 
                   doc  sents 
doc_idx sent_idx              
0  0   [Here, is, a, sentence, ., And, Another, ., Ye...  Here is a sentence. 
     1   [Here, is, a, sentence, ., And, Another, ., Ye...  And Another. 
     2   [Here, is, a, sentence, ., And, Another, ., Ye...  Yet another sentence. 
1  0   [Different, Document, here, ., With, some, oth...  Different Document here. 
     1   [Different, Document, here, ., With, some, oth...  With some other sentences. 

代わりに適用するpd.Seriesのあなたが使用することができます列を拡大するnp.concatenate。( 私は単語や文章をトークンにNLTKを使用)

import nltk 
import pandas as pd 
doc_texts = ['Here is a sentence. And Another. Yet another sentence.', 
     'Different Document here. With some other sentences.'] 
playing_df = pd.DataFrame({'doc':[nltk.word_tokenize(doc) for doc in doc_texts], 
         'sents':[nltk.sent_tokenize(doc) for doc in doc_texts]}) 

s = playing_df['sents'] 
i = np.arange(len(df)).repeat(s.str.len()) 

new_df = playing_df.iloc[i, :-1].assign(**{'sents': np.concatenate(s.values)}).reset_index() 

new_df['sent_idx'] = new_df.groupby('index').cumcount() 
new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx']) 

はそれがお役に立てば幸いです。

+0

ありがとうございました!これはうまくいく。私はまた、[pandas multiindexing documentation](https://pandas.pydata.org/pandas-docs/stable/advanced.html)を見てからも疑問に思っていました。もしあなたがmultiindexを扱うより適切な方法があると思えば、私がここで適用した 'ungroup_column'関数の後にあるように' document 'レベルが繰り返されていないことに気づきました。 – chase

+0

@chaseを助けてくれてうれしいです。 – Dark

関連する問題