2017-06-16 20 views
1

まず、質問タイトルは質問をよく説明していないと思います。タイトルを変更したり、より良いものをお勧めします。パンダのデータフレームを行の名前で変更する

私はフォーマットのCSVファイルを読んでいる: enter image description here

"sample","module","status","tot.seq","seq.length","pct.gc","pct.dup" 
"ERR435952_cleaned_1","Basic Statistics","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per base sequence quality","FAIL","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per tile sequence quality","FAIL","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per sequence quality scores","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per base sequence content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per sequence GC content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Per base N content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Sequence Length Distribution","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Sequence Duplication Levels","WARN","15529112","62",47,41.66 
"ERR435952_cleaned_1","Overrepresented sequences","WARN","15529112","62",47,41.66 
"ERR435952_cleaned_1","Adapter Content","PASS","15529112","62",47,41.66 
"ERR435952_cleaned_1","Kmer Content","FAIL","15529112","62",47,41.66 
"ERR435952_cleaned_2","Basic Statistics","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per base sequence quality","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per tile sequence quality","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per sequence quality scores","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per base sequence content","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per sequence GC content","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Per base N content","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Sequence Length Distribution","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Sequence Duplication Levels","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Overrepresented sequences","WARN","15529112","62",48,42.44 
"ERR435952_cleaned_2","Adapter Content","PASS","15529112","62",48,42.44 
"ERR435952_cleaned_2","Kmer Content","FAIL","15529112","62",48,42.44 

そして、私はこのような何かに変換したいので、私は(PASS/FAIL /値をWARNに基づいて、単純なヒートマップを作成することができます総読み取り数を含む:tot.seq):

私は行の数をカウントすることでこれを行うことができます(各モジュール/機能値の間隔には相関があります)が、これはまったくきれいではありません大規模なデータセットでも効率的かどうかはわかりません。

答えて

2

使用set_index + unstack、また、除去のためのインデックスから列に対してreset_indexを追加し、rename_axis(つまり、私は+ N ...ように、Iである)の名前によると、いうより間隔を次の値をマッピングする方法はありますmodule - 列名:

df = df.set_index(['sample', 'tot.seq', 'module'])['status'].unstack() \ 
     .reset_index().rename_axis(None, axis=1) 
print (df) 
       sample tot.seq Adapter Content Basic Statistics \ 
0 ERR435952_cleaned_1 15529112   PASS    PASS 
1 ERR435952_cleaned_2 15529112   PASS    PASS 

    Kmer Content Overrepresented sequences Per base N content \ 
0   FAIL      WARN    PASS 
1   FAIL      WARN    PASS 

    Per base sequence content Per base sequence quality Per sequence GC content \ 
0      PASS      FAIL     PASS 
1      PASS      PASS     WARN 

    Per sequence quality scores Per tile sequence quality \ 
0      PASS      FAIL 
1      PASS      WARN 

    Sequence Duplication Levels Sequence Length Distribution 
0      WARN       PASS 
1      WARN       PASS 

しかし、取得した場合:

ValueError: Index contains duplicate entries, cannot reshape

その後、重複を持っているし、集計データを必要とする:

回の
print (df) 
       sample      module status tot.seq \ 
0 ERR435952_cleaned_1    Basic Statistics PASS 15529112 
1 ERR435952_cleaned_1 Per base sequence quality FAIL 15529112 
2 ERR435952_cleaned_1 Per base sequence quality FAIL 15529112 
3 ERR435952_cleaned_1 Per sequence quality scores PASS 15529112 

    seq.length pct.gc pct.dup 
0   62  47 41.66 
1   62  47 41.66 
2   62  47 41.66 
3   62  47 41.66 

df = df.pivot_table(index=['sample', 'tot.seq'], columns='module', values='status', aggfunc=', '.join) \ 
     .reset_index().rename_axis(None, axis=1) 
print (df) 
       sample tot.seq Basic Statistics Per base sequence quality \ 
0 ERR435952_cleaned_1 15529112    PASS    FAIL, FAIL 

    Per sequence quality scores 
0      PASS 

df = df.groupby(['sample', 'tot.seq', 'module'])['status'].apply(', '.join).unstack() \ 
     .reset_index().rename_axis(None, axis=1) 
print (df) 

       sample tot.seq Basic Statistics Per base sequence quality \ 
0 ERR435952_cleaned_1 15529112    PASS    FAIL, FAIL 

    Per sequence quality scores 
0      PASS 
+0

ありがとう!最初の質問では、各サンプル(モジュールごとに繰り返す)の繰り返し値であるため、読み取り回数(tot.seq)を追加するのを忘れました。どのように1回だけ追加できますか? – Siddharth

関連する問題