2つのテーブルの組み合わせから最大値を見つける（forループが遅すぎる）

私はデータテーブル "the.data"を持っています。最初の列は測定器を示し、残りは測定データです。私はまた私がその時間に各群の和を見つける必要がある、例えばグループ1（G1）は機器1,22つのテーブルの組み合わせから最大値を見つける（forループが遅すぎる）

g1 <- c(1,2) 
g2 <- c(4,3,1) 
g3 <- c(1,5,2) 
g4 <- c(2,4) 
g5 <- c(5,3,1,2,6) 
groups <- c("g1","g2","g3","g4","g5")

を意味機器のグループを定義している

instrument <- c(1,2,3,4,5,1,2,3,4,5) 
hour <- c(1,1,1,1,1,2,2,2,2,2) 
da <- c(12,14,11,14,10,19,15,16,13,11) 
db <- c(21,23,22,29,28,26,24,27,26,22) 
the.data <- data.frame(instrument,hour,da,db)

データ型ごとに最大値とその合計値を持ちます。

G1時間1：和（DA）= 12 + 14 = 26 G1の時間2：和（DA）= 19 + 15 = 34

したがって、G1およびDA答えが時間2値であります私はfor-loop内でfor-loopでこれを行いましたが、時間がかかりすぎます（数時間後に中断しました）。問題はthe.dataが約100,000行であり、2〜50個の計器を持つ約5.000のグループがあることです。

これを行うにはどのような方法が良いでしょうか？

Stack-Overflowのすべての貢献者に感謝します。

更新：サンプルでは5つのグループに変更されました。

/クリス・

出典

2012-04-23 Chris

はgroupループが滞在する必要があります、またはせいぜいlapply()のようなものに置き換えられます。しかし、hourループは、instrument x hour行列に再フォーマットしてベクトル化代数を行うだけで完全に置き換えることができます。たとえば：

library(reshape2) 

groups = list(g1, g3) 

the.data.a = dcast(the.data[,1:3], instrument ~ hour) 

> sapply(groups, function(x) data.frame(max = max(colSums(the.data.a[x, -1])), 
             ind = which.max(colSums(the.data.a[x, -1])))) 
    [,1] [,2] 
max 34 45 
ind 2 2

出典

2012-04-23 22:06:20

これは2つのグループで実行されますが、5つのグループでエラーが発生します。 –

@DWinこれは、機器1〜5のみがサンプルデータに含まれているためです。他のグループは、存在しない計器を参照する。 –

迅速かつ非常に良い答えをありがとう。ここで読んだところ、私は "the.data"でいくつかの楽器を見逃して修正したことに気付きましたが、私の実際のデータには測定値が欠落していました（Instではなく、Hour）。 – Chris

ここハドレーからplyrとreshape2を使用して一つの手法です。まず、機器がそのグループに含まれているかどうかによって、ブール値をthe.dataに追加します。その後、長い形式に解凍し、必要のない行をサブセット化してから、操作によってグループをddplyまたはdata.tableで実行します。

#add boolean columns 
the.data <- transform(the.data, 
         g1 = instrument %in% g1, 
         g2 = instrument %in% g2, 
         g3 = instrument %in% g3, 
         g4 = instrument %in% g4, 
         g5 = instrument %in% g5 
        ) 

#load library 
library(reshape2) 
#melt into long format 
the.data.m <- melt(the.data, id.vars = 1:4) 
#subset out data that that has FALSE for the groupings 
the.data.m <- subset(the.data.m, value == TRUE) 

#load plyr and data.table 
library(plyr) 
library(data.table) 

#plyr way 
ddply(the.data.m, c("variable", "hour"), summarize, out = sum(da)) 
#data.table way 
dt <- data.table(the.data.m) 
dt[, list(out = sum(da)), by = "variable, hour"]

高速であるかを確認するためにいくつかのベンチマークを実行します。

library(rbenchmark) 
f1 <- function() ddply(the.data.m, c("variable", "hour"), summarize, out = sum(da)) 
f2 <- function() dt[, list(out = sum(da)), by = "variable, hour"] 

> benchmark(f1(), f2(), replications=1000, order="elapsed", columns = c("test", "elapsed", "relative")) 
    test elapsed relative 
2 f2() 3.44 1.000000 
1 f1() 6.82 1.982558

ので、data.tableは、この例では約2倍高速です。マイルは異なる場合があります。

そして、ちょうどそれが正しい値を与えていることを示すために：

> dt[, list(out = sum(da)), by = "variable, hour"] 
     variable hour out 
[1,]  g1 1 26 
[2,]  g1 2 34 
[3,]  g2 1 25 
[4,]  g2 2 29 

...

出典

2012-04-23 22:19:49 Chase

あなたのコードがmaxとwhich.maxの選択肢をまだ扱っていないと思います。 –

@Dwin - doh、そうですよ！私はそれを誤解している/それを先に光沢、ビットで更新されます。おかげで、 - チェス – Chase

あなたは5000のグループ数を必要とするように思われ、あなたのコード（またはグループを生成するためのプログラム的な方法を提供していませんでしたここ

groups <- list(g1,g2,g3,g4,g5) 
gmax <- list() 
# The "da" results 
for(gitem in seq_along(groups)) { 
     gmax[[gitem]] <- with(subset(the.data , instrument %in% groups[[gitem]]), 
           tapply(da , hour, sum)) } 
damat <- matrix(c(sapply(gmax, which.max), 
        sapply(gmax, max)) , ncol=2) 

# The "db" results 
for(gitem in seq_along(groups)) { 
     gmax[[gitem]] <- with(subset(the.data , instrument %in% groups[[gitem]]), 
           tapply(db , hour, sum)) } 
dbmat <- matrix(c(sapply(gmax, which.max), 
        sapply(gmax, max)) , ncol=2) 

#-------- 
> damat 
    [,1] [,2] 
[1,] 2 34 
[2,] 2 29 
[3,] 2 45 
[4,] 1 14 
[5,] 2 42 
> dbmat 
    [,1] [,2] 
[1,] 2 50 
[2,] 2 53 
[3,] 1 72 
[4,] 1 29 
[5,] 1 73

出典

2012-04-23 22:20:48

John Colby's answerのわずかに変更されたバージョンだ、いくつかのサンプルデータを有する。）しかし、このは Rのより効果的な使用であってもよいです。

set.seed(21) 
instrument <- sample(100, 1e5, TRUE) 
hour <- sample(24, 1e5, TRUE) 
da <- trunc(runif(1e5)*10) 
db <- trunc(runif(1e5)*10) 
the.data <- data.frame(instrument,hour,da,db) 
groups <- replicate(5000, sample(100, sample(50,1))) 
names(groups) <- paste("g",1:length(groups),sep="") 

library(reshape2) 
system.time({  
the.data.a <- dcast(the.data[,1:3], instrument ~ hour, sum) 
out <- t(sapply(groups, function(i) { 
    byHour <- colSums(the.data.a[i,-1]) 
    c(max(byHour), which.max(byHour)) 
})) 
colnames(out) <- c("max.hour","max.sum") 
}) 
# Using da as value column: use value.var to override. 
# user system elapsed 
# 3.80 0.00 3.81

出典

2012-04-23 23:21:00

素敵な例、ジョシュ！私はいつも私たちがこれらのことをどれくらい早く得ることができるのか不思議です。 –

2つのテーブルの組み合わせから最大値を見つける（forループが遅すぎる）

答えて

関連する問題