簡単な方法[R]

私は2つのデータフレームがあります。df_workingFileとdf_groupIDs簡単な方法[R]

df_workingFile：

ID | GroupID | Sales | Date 
v | a1  | 1 | 2011 
w | a1  | 3 | 2010 
x | b1  | 8 | 2007 
y | b1  | 3 | 2006 
z | c3  | 2 | 2006

df_groupIDs：df_groupIDsについては

GroupID | numIDs | MaxSales 
a1  | 2  | 3  
b1  | 2  | 8  
c3  | 1  | 2

、私がしたいのそのグループの最大売上を持つイベントのIDと日付を取得します。したがって、グループ "a1"にはdf_workingFile、 "v"、 "w"の2つのイベントがあります。私は、イベント "w"が最大売上値を持っていることを特定し、その情報をdf_groupIDsに持っていきたいと思います。最終的な出力は次のようになります。

GroupID | numIDs | MaxSales | ID | Date 
a1  | 2  | 3  | w | 2010 
b1  | 2  | 8  | x | 2007 
c3  | 1  | 2  | z | 2006

は、今ここで問題です。私はこれを行うコードを書いたが、それは非常に非効率的で、50〜100K行のデータセットを扱うときは永遠に処理する必要がある。より効率的になるようにコードを書き直す方法を理解する助けが必要です。ここで私は、現在持っているものです：dplyr使用

i = 1 
for (groupID in df_groupIDs$groupID) { 

    groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID) 
    index <- match(df_groupIDs$maxSales[i], groupEvents$Sales) 
    df_groupIDs$ID[i] = groupEvents$ID[index] 
    df_groupIDs$Date[i] = groupEvents$Date[index] 

    i = i+1 
}

出典

2017-08-04 NBC

：

library(dplyr) 

df_workingFile %>% 
    group_by(GroupID) %>%  # for each group id 
    arrange(desc(Sales)) %>% # sort by Sales (descending) 
    slice(1) %>%    # keep the top row 
    inner_join(df_groupIDs) # join to df_groupIDs 
    select(GroupID, numIDs, MaxSales, ID, Date) 
    # keep the columns you want in the order you want

もう一つの簡単な方法を、場合Salesは整数である（したがってMaxSales列でテスト平等のために依拠することができます）：

inner_join(df_groupIDs, df_workingFile, 
      by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))

出典

2017-08-04 23:48:58 Gregor

これは、maxiteがon行が自動的に最大の行に沿って移動します。

library(sqldf) 

sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date 
     from df_groupIDs g left join df_workingFile w using(GroupID) 
     group by GroupID")

与える：

GroupID numIDs MaxSales ID Date 
1  a1  2  3 w 2010 
2  b1  2  8 x 2007 
3  c3  1  2 z 2006

注再現示す 2つの入力データ・フレームは、次のとおり

Lines1 <- " 
ID | GroupID | Sales | Date 
v | a1  | 1 | 2011 
w | a1  | 3 | 2010 
x | b1  | 8 | 2007 
y | b1  | 3 | 2006 
z | c3  | 2 | 2006" 
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE) 

Lines2 <- " 
GroupID | numIDs | MaxSales 
a1  | 2  | 3  
b1  | 2  | 8  
c3  | 1  | 2"  

df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)

出典

2017-08-05 00:02:46

答えて

関連する問題