dplyr：data.frameに多くの列内のレベルのパーセンテージを計算し、それが広い

DATA Iが欲しいdplyr：data.frameに多くの列内のレベルのパーセンテージを計算し、それが広い

df <- data.frame(id=c(rep("site1", 3), rep("site2", 8), rep("site3", 9), rep("site4", 15)), 
       major_rock = c("greywacke", "mudstone", "gravel", "greywacke", "gravel", "mudstone", "gravel", "mudstone", "mudstone", 
           "conglomerate", "gravel", "mudstone", "greywacke","conglomerate", "gravel", "gravel", "greywacke","gravel", 
           "greywacke", "gravel", "mudstone", "greywacke", "gravel", "gravel", "gravel", "conglomerate", "greywacke", 
           "coquina", "gravel", "gravel", "greywacke", "gravel", "mudstone","mudstone", "gravel"), 
       minor_rock = c("sandstone mudstone basalt chert limestone", "limestone", "sand silt clay", "sandstone mudstone basalt chert limestone", 
           "sand silt clay", "sandstone conglomerate coquina tephra", NA, "limestone", "mudstone sandstone coquina limestone", 
           "sandstone mudstone limestone", "sand loess silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone", 
           "sandstone mudstone limestone", "sand loess silt", "loess silt sand", "sandstone mudstone conglomerate chert limestone basalt", 
           "sand silt clay", "sandstone mudstone conglomerate", "loess sand silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone", 
           "sand loess silt", "sand silt clay", "loess silt sand", "sandstone mudstone limestone", "sandstone mudstone conglomerate chert limestone basalt", 
           "limestone", "loess sand silt", NA, "sandstone mudstone conglomerate", "sandstone siltstone mudstone limestone silt lignite", "limestone", 
           "mudstone sandstone coquina limestone", "mudstone tephra loess"), 
       area_ha = c(1066.68, 7.59, 3.41, 4434.76, 393.16, 361.69, 306.75, 124.93, 95.84, 9.3, 8.45, 4565.89, 2600.44, 2198.52,  
          2131.71, 2050.09, 1640.47, 657.09, 296.73, 178.12, 10403.53, 8389.2, 8304.08, 3853.36, 2476.36, 2451.25,  
          1640.47, 1023.02, 532.94, 385.68, 296.73, 132.45, 124.93, 109.12, 4.87))

に変換しますか？

別の分析のために、各サイトに1つの行のみが必要であると、私はdfを準備する必要があります。したがって、最終データフレームdf_finでは、各サイトのレベルの割合はmajor_rockとminor_rockで、列名（変数）はmajor_rockとminor_rockのレベルになります。私は、各変数（major_rockとminor_rock）のためにこれを行うと、私が何をしたか

を下回るようにそれらを組み合わせることができます

？ minor_rock

df_minor_rock <- df %>% dplyr::select(-major_rock) %>% dplyr::group_by(id, minor_rock) %>% dplyr::summarise(total_area = sum(area_ha)) %>% dplyr::group_by(id) %>% dplyr::mutate(percent_minor = total_area/sum(total_area) * 100)%>% dplyr::select(-total_area) %>% tidyr::spread(minor_rock, percent_minor) > df_minor_rock Source: local data frame [4 x 15] Groups: id [4] id limestone `loess sand silt` `loess silt sand` `mudstone sandstone coquina limestone` `mudstone tephra loess` `sand loess silt` * <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 site1 0.7042907 NA NA NA NA NA 2 site2 2.1784240 NA NA 1.6711771 NA 0.147344 3 site3 NA 1.091484 12.562550 NA NA 13.062701 4 site4 2.8607214 1.328100 6.171154 0.2719299 0.01213617 20.693984 # ... with 8 more variables: `sand silt clay` <dbl>, `sandstone conglomerate coquina tephra` <dbl>, `sandstone mudstone basalt chert # limestone` <dbl>, `sandstone mudstone conglomerate` <dbl>, `sandstone mudstone conglomerate chert limestone basalt` <dbl>, `sandstone # mudstone limestone` <dbl>, `sandstone siltstone mudstone limestone silt lignite` <dbl>, `<NA>` <dbl>
ためmajor_rock

library(tidyverse) df_major_rock <- df %>% dplyr::select(-minor_rock) %>% dplyr::group_by(id, major_rock) %>% dplyr::summarise(total_area = sum(area_ha)) %>% dplyr::group_by(id) %>% dplyr::mutate(percent_major = total_area/sum(total_area) * 100) %>% dplyr::select(-total_area) %>% tidyr::spread(major_rock, percent_major) > df_major_rock Source: local data frame [4 x 6] Groups: id [4] id conglomerate coquina gravel greywacke mudstone * <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 site1 NA NA 0.3164205 98.97929 0.7042907 2 site2 0.1621656 NA 12.3517842 77.32960 10.1564462 3 site3 13.4720995 NA 30.7432536 27.80577 27.9788787 4 site4 6.1085791 2.549393 39.0992422 25.73366 26.5091274

について

同じそして、私は（df_major_rockとdf_minor_rock）2つdata.framesに加わったので、最終的なdata.frameのdf_finがあります4つの観測のみ（各サイトに対して1つの行）、変数はmajor_rockとminor_rock

のレベルになります

df_fin <- df_major_rock %>% 
    dplyr::right_join(., df_minor_rock, by="id")

質問

df_finは、私が欲しいものを正確にです。しかし、この再現可能な例では、2つの変数（major_rockとminor_rock）しか示していないので、2つの異なるdata.framesを作成して、各変数のレベルの比率を取得し、それらを結合して最終出力df_finを取得しなければなりませんでした。私の実際のデータでは、major_rockとminor_rock以外の変数もあり、それぞれのサイトのレベルの割合も取得したいと考えています。私は、私よりも簡単か短いアプローチが必要だと思います。どんな提案も高く評価されますか？

出典

2017-05-01 aelwan

data.table::dcastを使用すると、データを列に広げることができます。次に、rowSumsを使用して1つのステップでパーセンテージを計算できます。これを行うには良い方法があるかもしれないが、私は、ループ内の各列のためにこのアプローチを包ん：

df_fin <- data.frame(id = unique(df$id)) 
myColumns <- setdiff(colnames(df)[-1], "area_ha") 

for (name in myColumns){ 
    dcastFormula <- paste0("id ~ ", name) 
    tempdf <- data.table::dcast(df, dcastFormula, sum) 
    tempdf[,-1] <- tempdf[,-1]/rowSums(tempdf[,-1],na.rm = TRUE)*100 
    df_fin <- left_join(df_fin , tempdf, by ="id") 
}

は、いつものようにこれを行うには、いくつかの他の方法は、おそらくありますが、これは少し簡単です一例であり、あなたの出発場所よりも。また、他の列や集計方法に応じて変更する必要があります。

出典

2017-05-01 19:08:57

あなたの時間と助けに多くの感謝のイアン。 – aelwan

私は関数の最後の行で、MyExampleはtempdfであるべきだと思います。 – aelwan

ありがとうございます。私は編集します。 –

dplyr：data.frameに多くの列内のレベルのパーセンテージを計算し、それが広い

答えて

関連する問題