グループ化が必要なデータセットで集計を計算するにはどうすればよいですか？

私は、センサー（1〜16）で測定が行われたデータセットを持っています。私は、各シーケンスの各センサーのvalueの平均値を希望します。すべてのシーケンスが16から1に戻るわけではありません（除去する必要のある漂遊測定値があることがあります）。注：これは小さな、偽のデータセットです。グループ化が必要なデータセットで集計を計算するにはどうすればよいですか？

dataset（また、下記のスクリプトを使って読み取ることができます）

# To read with rio 
# library("devtools") 
# install_github("leeper/rio") 
library("rio") 
df <- import("https://gist.githubusercontent.com/karthik/ad2874e5b5c5f3af73ad89d14b26a913/raw/f435317539bc56a09b248a0ef193db21b7176eee/small.csv")

私の最初の試み：

library(dplyr) 
# Assigning groups to the data 
df$diff <- c(df$sensor[2:nrow(df)], 0) - df$sensor 
# There is sometimes a sensor reading between 16 and 1. This removes those rows. 
df2 <- df[-which(df$diff < 0 & df$sensor != 16),] 

# end is now where the last 16 was 
end <- which(df2$diff < 0) 
# Start begins with 1, then adds 1 to the position of every last 16 sensor 
# reading to get the next 1 
start <- 
    c(1, which(df2$diff < 0)[1:length(which(df2$diff < 0)) - 1] + 1) 
# Now combine both into a data.frame 
positions <- data_frame(start, end) 
# Add unique groups 
positions$group <- 1:nrow(positions) 
df2$group <- NA 

# Yes this is a horrible loop and 
# super inefficient on the full dataset. 
for (i in 1:nrow(positions)) { 
    df2[positions[i,]$start:positions[i, ]$end, ]$group <- 
    positions[i,]$group 
}

は、今ではdplyr

df3 <- df2 %>% 
    group_by(sensor,group) %>% 
    summarise(mean_value = mean(value)) 
    head(df3)

と集約を行うことが容易になります

私は何が欲しいのか。

Source: local data frame [6 x 3] 
    Groups: sensor [4] 

    sensor group mean_value 
    (int) (int)  (dbl) 
    1  1  2 0.07285933 
    2  2  2 0.06993007 
    3  3  1 0.04845651 
    4  3  2 0.03976837 
    5  4  1 0.06033732 
    6  4  2 0.06480888

これを行うにはどうすればよいでしょうか？

出典

2016-06-17 Maiasaura

positionsデータフレームを作成し、中間データフレームdf2を作成し、forループでグループ化変数を追加する代わりに、dplyrボキャブラリですべてを実行できます。 cumsumとlagの組み合わせを使用すると、グループ化変数をmutateで追加します。これは、はるかに単純化された手順になり：

与え

df %>% 
    mutate(differ = lead(sensor) - sensor) %>% 
    filter(!(differ < 0 & sensor != 16)) %>% 
    mutate(grp = cumsum(lag(differ,default=0) < 0) + 1) %>% 
    group_by(sensor, grp) %>% 
    summarise(mean_val = mean(value))

：

Source: local data frame [30 x 3] 
Groups: sensor [?] 

    sensor grp mean_val 
    (int) (dbl)  (dbl) 
1  1  2 0.07285933 
2  2  2 0.06993007 
3  3  1 0.04845651 
4  3  2 0.03976837 
5  4  1 0.06033732 
6  4  2 0.06480888 
7  5  1 0.03276722 
8  5  2 0.05005240 
9  6  1 0.06967405 
10  6  2 0.06484712 
.. ... ...  ...

注：後者は、関数であるので、私が代わりにdiffの変数名としてdifferを使用（とそうではありません列に「機能名」を付けることをお勧めします）。

はまた、このためにdata.tableパッケージを使用することができます。

library(data.table) 
setDT(df)[, differ := shift(sensor, type='lead') - sensor 
      ][!(differ < 0 & sensor != 16) 
      ][, grp := cumsum(shift(differ,fill=0) < 0) + 1 
       ][, .(mean_val = mean(value)), .(sensor,grp)]

するsetDT(df)は、データテーブルにあなたのデータフレームに変換します。

出典

2016-06-17 21:37:10 Jaap

グループ化が必要なデータセットで集計を計算するにはどうすればよいですか？

答えて

関連する問題