2017-08-22 17 views
2

私は値とNAで間違った要約値を何しているのか分かりません。私はsum()で要約するケースを数えることができ、NAケースを数えるにはsum(is.na(variable))を使用できることを周りに読んだ。dplyrの合計NAケースは集計されます

実は、私はテストtibbleとその動作を再現することができます

df <- tibble(x = c(rep("a",5), rep("b",5)), y = c(NA, NA, 1, 1, NA, 1, 1, 1, NA, NA)) 

df %>% 
    group_by(x) %>% 
    summarise(one = sum(y, na.rm = T), 
      na = sum(is.na(y))) 

そして、これが期待された結果である。いくつかの理由について

# A tibble: 2 x 3 
     x one na 
    <chr> <dbl> <int> 
1  a  2  3 
2  b  3  2 

、私は私のデータで結果を再現することはできません。

mydata <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Amphibians", 
"Birds", "Mammals", "Reptiles", "Plants"), class = "factor"), 
    Scenario = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
    1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("Present", 
    "RCP 4.5", "RCP 8.5"), class = "factor"), year = c(1940, 
    1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 
    1940, 1940, 1940, 1940, 1940, 1940, 1940), random = c("obs", 
    "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", 
    "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs"), 
    species = c("Allobates fratisenescus", "Allobates fratisenescus", 
    "Allobates fratisenescus", "Allobates juanii", "Allobates juanii", 
    "Allobates juanii", "Allobates kingsburyi", "Allobates kingsburyi", 
    "Allobates kingsburyi", "Adelophryne adiastola", "Adelophryne adiastola", 
    "Adelophryne adiastola", "Adelophryne gutturosa", "Adelophryne gutturosa", 
    "Adelophryne gutturosa", "Adelphobates quinquevittatus", 
    "Adelphobates quinquevittatus", "Adelphobates quinquevittatus" 
    ), Endemic = c(1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA)), row.names = c(NA, -18L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "species", indices = list(
    9:11, 12:14, 15:17, 0:2, 3:5, 6:8), group_sizes = c(3L, 3L, 
3L, 3L, 3L, 3L), biggest_group_size = 3L, labels = structure(list(
    species = c("Adelophryne adiastola", "Adelophryne gutturosa", 
    "Adelphobates quinquevittatus", "Allobates fratisenescus", 
    "Allobates juanii", "Allobates kingsburyi")), row.names = c(NA, 
-6L), class = "data.frame", vars = "species", .Names = "species"), .Names = c("Group", 
"Scenario", "year", "random", "species", "Endemic")) 

(私のデータは数百万行あり、ここでは一部のみを再現しています)

Testsum <- mydata %>% 
    group_by(Group, Scenario, year, random) %>% 
    summarise(All = n(), 
      Endemic = sum(Endemic, na.rm = T), 
      noEndemic = sum(is.na(Endemic))) 

# A tibble: 3 x 7 
# Groups: Group, Scenario, year [?] 
     Group Scenario year random All Endemic noEndemic 
     <fctr> <fctr> <dbl> <chr> <int> <dbl>  <int> 
1 Amphibians Present 1940 obs  6  3   0 
2 Amphibians RCP 4.5 1940 obs  6  3   0 
3 Amphibians RCP 8.5 1940 obs  6  3   0 

!!!! 私は倍増し、確認することを... NAは、種の3にあるよう何の風土病が、すべてのケースのために3はないように

を期待:

Test3$Endemic %>% class 
[1] "numeric" 

は明らかに、私は非常に愚かなものがあります見られません...数時間後に乱れています。あなたの誰にとっても明らかですか?ありがとう!!!

答えて

4

これは、Endemicを新しい集計変数に割り当てたためです。代わりに新しい列名を使用する必要があります

mydata %>% 
    group_by(Group, Scenario, year, random) %>% 
    summarise(All = n(), 
       EndemicS = sum(Endemic, na.rm = TRUE), 
       noEndemic = sum(is.na(Endemic))) %>% 
    rename(Endemic = EndemicS) 
# A tibble: 3 x 7 
# Groups: Group, Scenario, year [3] 
#  Group Scenario year random All Endemic noEndemic 
#  <fctr> <fctr> <dbl> <chr> <int> <dbl>  <int> 
#1 Amphibians Present 1940 obs  6  3   3 
#2 Amphibians RCP 4.5 1940 obs  6  3   3 
#3 Amphibians RCP 8.5 1940 obs  6  3   3 
関連する問題