2017-03-28 6 views
0

データフレーム内のすべての列を互いに移動させようとしています。これは、最初の列(下の名前付きの列)を列の数だけ繰り返すことを意味します。列を行に移動する

  location     160095-T_S2_L001_R1_001.bam 160096-N_S4_L001_R1_001.bam  160094-T_S12_L001_R1_001.bam 160095-N_S1_L001_R1_001.bam 
      1:1-100000 NA NA NA NA 
      1:100001-200000 2 2 4 1 
      1:200001-300000 1 NA NA NA 
      1:300001-400000 3 3 3 3 
      2:1-100000 NA NA NA NA 
      2:100001-200000 1 1 NA NA 

ので、それは次のようになります。

  location sample_id number 
      1:1-100000     160095-T_S2_L001_R1_001.bam NA 
      1:100001-200000    160095-T_S2_L001_R1_001.bam 2 
      1:200001-300000    160095-T_S2_L001_R1_001.bam 1 
      1:300001-400000    160095-T_S2_L001_R1_001.bam 3 
      2:1-100000     160095-T_S2_L001_R1_001.bam NA 
      2:100001-200000    160095-T_S2_L001_R1_001.bam 1 
      1:1-100000 160096-N_S4_L001_R1_001.bam NA 
      1:100001-200000 160096-N_S4_L001_R1_001.bam 2 
      1:200001-300000 160096-N_S4_L001_R1_001.bam NA 
      1:300001-400000 160096-N_S4_L001_R1_001.bam 3 
      2:1-100000 160096-N_S4_L001_R1_001.bam NA 
      2:100001-200000 160096-N_S4_L001_R1_001.bam 1 
      1:1-100000 160094-T_S12_L001_R1_001.bam NA 
      1:100001-200000 160094-T_S12_L001_R1_001.bam 4 
      1:200001-300000 160094-T_S12_L001_R1_001.bam NA 
      1:300001-400000 160094-T_S12_L001_R1_001.bam 3 
      2:1-100000 160094-T_S12_L001_R1_001.bam NA 
      2:100001-200000 160094-T_S12_L001_R1_001.bam NA 
      1:1-100000 160095-N_S1_L001_R1_001.bam NA 
      1:100001-200000 160095-N_S1_L001_R1_001.bam 1 
      1:200001-300000 160095-N_S1_L001_R1_001.bam NA 
      1:300001-400000 160095-N_S1_L001_R1_001.bam 3 
      2:1-100000 160095-N_S1_L001_R1_001.bam NA 
      2:100001-200000 160095-N_S1_L001_R1_001.bam NA 

私はトン(データフレーム)を転置試してみましたが、これは単に全体のデータフレームではなく、私が欲しいの列を転置します。

また、場所の列を分割して最初にコロンで区切り、次にダッシュで3つの別々の列に分割したいとします。これはあなたの例からdata.frameを作成するために使用され、他の人が使用するため

  chromosome start stop sample_id number 
      1 1 100000     160095-T_S2_L001_R1_001.bam NA 
      1 100001 200000     160095-T_S2_L001_R1_001.bam 2 
      1 200001 300000     160095-T_S2_L001_R1_001.bam 1 
      1 300001 400000     160095-T_S2_L001_R1_001.bam 3 
      2 1 100000     160095-T_S2_L001_R1_001.bam NA 
      2 100001 200000     160095-T_S2_L001_R1_001.bam 1 
      1 1 100000 160096-N_S4_L001_R1_001.bam NA 
      1 100001 200000 160096-N_S4_L001_R1_001.bam 2 
      1 200001 300000 160096-N_S4_L001_R1_001.bam NA 
      1 300001 400000 160096-N_S4_L001_R1_001.bam 3 
      2 1 100000 160096-N_S4_L001_R1_001.bam NA 
      2 100001 200000 160096-N_S4_L001_R1_001.bam 1 
      1 1 100000 160094-T_S12_L001_R1_001.bam NA 
      1 100001 200000 160094-T_S12_L001_R1_001.bam 4 
      1 200001 300000 160094-T_S12_L001_R1_001.bam NA 
      1 300001 400000 160094-T_S12_L001_R1_001.bam 3 
      2 1 100000 160094-T_S12_L001_R1_001.bam NA 
      2 100001 200000 160094-T_S12_L001_R1_001.bam NA 
      1 1 100000 160095-N_S1_L001_R1_001.bam NA 
      1 100001 200000 160095-N_S1_L001_R1_001.bam 1 
      1 200001 300000 160095-N_S1_L001_R1_001.bam NA 
      1 300001 400000 160095-N_S1_L001_R1_001.bam 3 
      2 1 100000 160095-N_S1_L001_R1_001.bam NA 
      2 100001 200000 160095-N_S1_L001_R1_001.bam NA 
+1

多分、これは長いかぎりの長形です。それを検索し、あなたの問題を解決するかどうかを確認してください。 – lmo

答えて

1

はここで、ベースRを使用したソリューションです。長い形式でデータを配置する

d <- structure(list(location = c("1:1-100000", "1:100001-200000", 
    "1:200001-300000", "1:300001-400000", "2:1-100000", "2:100001-200000" 
), `160095-T_S2_L001_R1_001.bam` = c(NA, 2L, 1L, 3L, NA, 1L), 
    `160096-N_S4_L001_R1_001.bam` = c(NA, 2L, NA, 3L, NA, 1L), 
    `160094-T_S12_L001_R1_001.bam` = c(NA, 4L, NA, 3L, NA, NA 
), `160095-N_S1_L001_R1_001.bam` = c(NA, 1L, NA, 3L, NA, 
    NA)), .Names = c("location", "160095-T_S2_L001_R1_001.bam", 
     "160096-N_S4_L001_R1_001.bam", "160094-T_S12_L001_R1_001.bam", 
     "160095-N_S1_L001_R1_001.bam"), class = "data.frame", row.names = c(NA, 
     -6L)) 

まず、使用リシェイプ

long <- reshape(d, varying=2:5, v.names="number", timevar="sample_id", 
    times=names(d)[2:5], direction="long") 

この機能は非常に直感的ではありません、それは右の私の経験でそれを得るために、実験のかなりの量を必要とします。

> head(long) 
            location     sample_id number id 
1.160095-T_S2_L001_R1_001.bam  1:1-100000 160095-T_S2_L001_R1_001.bam  NA 1 
2.160095-T_S2_L001_R1_001.bam 1:100001-200000 160095-T_S2_L001_R1_001.bam  2 2 
3.160095-T_S2_L001_R1_001.bam 1:200001-300000 160095-T_S2_L001_R1_001.bam  1 3 
4.160095-T_S2_L001_R1_001.bam 1:300001-400000 160095-T_S2_L001_R1_001.bam  3 4 
5.160095-T_S2_L001_R1_001.bam  2:1-100000 160095-T_S2_L001_R1_001.bam  NA 5 
6.160095-T_S2_L001_R1_001.bam 2:100001-200000 160095-T_S2_L001_R1_001.bam  1 6 

次に、strspltを使用して、3つの部分に位置する文字列とコロンとダッシュで分割する正規表現を分割します。結果は文字行列ですが、数値である必要がありますので、行列のモードを変更します。

splt <- do.call(rbind, strsplit(long$location, "(:|-|\\s+)")) 
mode(splt) <- "numeric" 

colnames(splt) <- c("chromosome", "start", "stop") 

> head(splt) 
    chromosome start stop 
[1,]   1  1 100000 
[2,]   1 100001 200000 
[3,]   1 200001 300000 
[4,]   1 300001 400000 
[5,]   2  1 100000 
[6,]   2 100001 200000 

最終ステップでは、必要なすべてのフィールドを含むdata.frameを作成します。

result <- data.frame(splt, long[c("sample_id","number")], row.names = NULL) 

> head(result) 
    chromosome start stop     sample_id number 
1   1  1 100000 160095-T_S2_L001_R1_001.bam  NA 
2   1 100001 200000 160095-T_S2_L001_R1_001.bam  2 
3   1 200001 300000 160095-T_S2_L001_R1_001.bam  1 
4   1 300001 400000 160095-T_S2_L001_R1_001.bam  3 
5   2  1 100000 160095-T_S2_L001_R1_001.bam  NA 
6   2 100001 200000 160095-T_S2_L001_R1_001.bam  1