2017-08-08 5 views
0

私はデータテーブルを持っており、ルックアップテーブルを使ってそれを変更したいと思います。データ内のコード列をループし、照合表のfield列の正しい行の値と一致するコード列の名前と列に基づいて、それぞれ対応する新しい対応するvalue列を追加します。ルックアップテーブルを使用してデータフレームに複数のカラムを追加する

私はlapplyをleft_joinで使用しようとしましたが、データ列名を使用してルックアップのfield列の正しい値を参照する方法を学習できません。ルックアップテーブルがワイドフォーマットで優れているかどうかを検討したので、少なくとも一致するカラム名を持つことになりますが、まだ実行可能な関数を生成することはできません。

例データと所望の出力:

データ(EDI​​T:実際のデータは、より多くのコード列が含まれます):

structure(list(id = 1:10, datayear = c(2007L, 2007L, 2007L, 2007L, 
2007L, 2008L, 2008L, 2008L, 2008L, 2008L), nationalitycode = c(1L, 
1L, 1L, 2L, 3L, 5L, 4L, 3L, 2L, 1L), subjectcode = c(2L, 5L, 
5L, 5L, 2L, 5L, 4L, 2L, 1L, 4L)), .Names = c("id", "datayear", 
"nationalitycode", "subjectcode"), class = "data.frame", row.names = c(NA, 
-10L)) 

    id datayear nationalitycode subjectcode 
1 1  2007    1   2 
2 2  2007    1   5 
3 3  2007    1   5 
4 4  2007    2   5 
5 5  2007    3   2 
6 6  2008    5   5 
7 7  2008    4   4 
8 8  2008    3   2 
9 9  2008    2   1 
10 10  2008    1   4 

ルックアップテーブル:

structure(list(datayear = c(2007L, 2007L, 2007L, 2007L, 2007L, 
2007L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 
2008L, 2008L, 2008L, 2008L, 2008L, 2008L), field = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L), .Label = c("nationalitycode", "subjectcode"), class = "factor"), 
    code = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 
    3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), lookupvalue = structure(c(10L, 
    16L, 9L, 4L, 5L, 2L, 7L, 13L, 1L, 14L, 5L, 16L, 4L, 6L, 11L, 
    17L, 3L, 15L, 8L, 12L), .Label = c("Algebra", "Art", "Beekeeping", 
    "Chinese", "English", "French", "Geography", "H.E.", "Indian", 
    "Irish", "Italian", "Latin", "Maths", "P.E.", "Rivetting", 
    "Scottish", "Sewing"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L), .Names = c("datayear", "field", "code", "lookupvalue")) 

    datayear   field code lookupvalue 
1  2007 nationalitycode 1  Irish 
2  2007 nationalitycode 2 Scottish 
3  2007 nationalitycode 3  Indian 
4  2007 nationalitycode 4  Chinese 
5  2007 nationalitycode 5  English 
6  2007  subjectcode 1   Art 
7  2007  subjectcode 2 Geography 
8  2007  subjectcode 3  Maths 
9  2007  subjectcode 4  Algebra 
10  2007  subjectcode 5  P.E. 
11  2008 nationalitycode 1  English 
12  2008 nationalitycode 2 Scottish 
13  2008 nationalitycode 3  Chinese 
14  2008 nationalitycode 4  French 
15  2008 nationalitycode 5  Italian 
16  2008  subjectcode 1  Sewing 
17  2008  subjectcode 2 Beekeeping 
18  2008  subjectcode 3 Rivetting 
19  2008  subjectcode 4  H.E. 
20  2008  subjectcode 5  Latin 

所望の出力:

id datayear nationalitycode subjectcode nationalityvalue subjectvalue 
1 1  2007    1   2   Irish Geography 
2 2  2007    1   5   Irish   P.E. 
3 3  2007    1   5   Irish   P.E. 
4 4  2007    2   5   Scottish   P.E. 
5 5  2007    3   2   Indian Geography 
6 6  2008    5   5   Italian  Latin 
7 7  2008    4   4   French   H.E. 
8 8  2008    3   2   Chinese Beekeeping 
9 9  2008    2   1   Scottish  Sewing 
10 10  2008    1   4   English   H.E. 

ご協力いただきありがとうございます!

答えて

1

トリックは、ルックアップテーブルの適切なサブセットに基づいて参加することです。それは、右のフィールド値を使用してサブセット化することです。

library(dplyr) 

dt1 = structure(list(id = 1:10, datayear = c(2007L, 2007L, 2007L, 2007L, 
2007L, 2008L, 2008L, 2008L, 2008L, 2008L), nationalitycode = c(1L, 
1L, 1L, 2L, 3L, 5L, 4L, 3L, 2L, 1L), subjectcode = c(2L, 5L, 
5L, 5L, 2L, 5L, 4L, 2L, 1L, 4L)), .Names = c("id", "datayear", 
"nationalitycode", "subjectcode"), class = "data.frame", row.names = c(NA, -10L)) 


dt2 = structure(list(datayear = c(2007L, 2007L, 2007L, 2007L, 2007L, 
2007L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 
2008L, 2008L, 2008L, 2008L, 2008L, 2008L), field = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L), .Label = c("nationalitycode", "subjectcode"), class = "factor"), 
code = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 
3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), lookupvalue = structure(c(10L, 
16L, 9L, 4L, 5L, 2L, 7L, 13L, 1L, 14L, 5L, 16L, 4L, 6L, 11L, 
17L, 3L, 15L, 8L, 12L), .Label = c("Algebra", "Art", "Beekeeping", 
"Chinese", "English", "French", "Geography", "H.E.", "Indian", 
"Irish", "Italian", "Latin", "Maths", "P.E.", "Rivetting", 
"Scottish", "Sewing"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L), .Names = c("datayear", "field", "code", "lookupvalue")) 


dt1 %>% 
    left_join(dt2 %>% filter(field == "nationalitycode"), by=c("datayear"="datayear","nationalitycode"="code")) %>% 
    left_join(dt2 %>% filter(field == "subjectcode"), by=c("datayear"="datayear","subjectcode"="code")) %>% 
    rename(nationalityvalue = lookupvalue.x, 
     subjectvalue = lookupvalue.y) %>% 
    select(-field.x, -field.y) 

# id datayear nationalitycode subjectcode nationalityvalue subjectvalue 
# 1 1  2007    1   2   Irish Geography 
# 2 2  2007    1   5   Irish   P.E. 
# 3 3  2007    1   5   Irish   P.E. 
# 4 4  2007    2   5   Scottish   P.E. 
# 5 5  2007    3   2   Indian Geography 
# 6 6  2008    5   5   Italian  Latin 
# 7 7  2008    4   4   French   H.E. 
# 8 8  2008    3   2   Chinese Beekeeping 
# 9 9  2008    2   1   Scottish  Sewing 
# 10 10  2008    1   4   English   H.E. 

ループを使用するより一般的なケースでは、ルックアップテーブルの形状を変更して列名を処理する必要があります。このプロセスでは、ルックアップテーブルにある独自のフィールドの数が自動的に検出され、forループを使用して(順番に)結合が実行されます。

library(dplyr) 
library(tidyr) 

dt1 = structure(list(id = 1:10, datayear = c(2007L, 2007L, 2007L, 2007L, 
2007L, 2008L, 2008L, 2008L, 2008L, 2008L), nationalitycode = c(1L, 
1L, 1L, 2L, 3L, 5L, 4L, 3L, 2L, 1L), subjectcode = c(2L, 5L, 
5L, 5L, 2L, 5L, 4L, 2L, 1L, 4L)), .Names = c("id", "datayear", 
"nationalitycode", "subjectcode"), class = "data.frame", row.names = c(NA, -10L)) 


dt2 = structure(list(datayear = c(2007L, 2007L, 2007L, 2007L, 2007L, 
2007L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 
2008L, 2008L, 2008L, 2008L, 2008L, 2008L), field = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L), .Label = c("nationalitycode", "subjectcode"), class = "factor"), 
code = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 
3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), lookupvalue = structure(c(10L, 
16L, 9L, 4L, 5L, 2L, 7L, 13L, 1L, 14L, 5L, 16L, 4L, 6L, 11L, 
17L, 3L, 15L, 8L, 12L), .Label = c("Algebra", "Art", "Beekeeping", 
"Chinese", "English", "French", "Geography", "H.E.", "Indian", 
"Irish", "Italian", "Latin", "Maths", "P.E.", "Rivetting", 
"Scottish", "Sewing"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L), .Names = c("datayear", "field", "code", "lookupvalue")) 


# reshape your lookup data 
dt2 %>% 
    spread(field, code) -> dt2_reshaped 

# start dataset (to join every field you have) 
dt_temp = dt1 

# for every field you have do the join 
for (fld in as.character(unique(dt2$field))) { 

    dt_temp %>% left_join(dt2_reshaped %>% select_("datayear", "lookupvalue", fld), by=c("datayear",fld)) -> dt_temp 
    names(dt_temp)[names(dt_temp) == "lookupvalue" ] = gsub("code","value",fld) 

} 


dt_temp 

# id datayear nationalitycode subjectcode nationalityvalue subjectvalue 
# 1 1  2007    1   2   Irish Geography 
# 2 2  2007    1   5   Irish   P.E. 
# 3 3  2007    1   5   Irish   P.E. 
# 4 4  2007    2   5   Scottish   P.E. 
# 5 5  2007    3   2   Indian Geography 
# 6 6  2008    5   5   Italian  Latin 
# 7 7  2008    4   4   French   H.E. 
# 8 8  2008    3   2   Chinese Beekeeping 
# 9 9  2008    2   1   Scottish  Sewing 
# 10 10  2008    1   4   English   H.E. 
+0

おかげ@AntoniosK:ここではあなたが始めるのに役立つはず関数です。私の実際のデータを多くの列に渡ってこれをループ化する方法はありますか?私は対処すべき数十のコード列があることを明確にすべきだった。それで、私は最初にlapplyを使用しようとしていたのです。 –

+0

@peter_w私が提供する 'data.table'ソリューションは、簡単に列のリストに拡張できます。マージ・ステップをラップする関数を作成するだけです。いくつかの列をサブセット化して名前を変更すると、あなたは良いことになるでしょう。 –

1

Xはあなたの最初のdata.frameLUある場合はdata.tablemergeが、これは単純で、重要なのは、明確にして、2番目です。

library(data.table) 

# Convert the data.frames into data.tables 
setDT(X) 
setDT(LU) 

# Join the tables on datayear and the appropriate code, for the 
# nationality data only. 
X1 <- merge(X, LU[field == "nationalitycode"], 
      by.x=c("datayear", "nationalitycode"), 
      by.y=c("datayear", "code")) 

# Now join the resulting table by subjectcode. 
X2 <- merge(X1, LU[field == "subjectcode"], 
      by.x=c("datayear", "subjectcode"), 
      by.y=c("datayear", "code")) 

# Now subset the data.table to the columns you want, set the key 
# (order) by id, and rename some columns. 
M <- X2[, c("id", "datayear", "nationalitycode", "subjectcode", 
      "lookupvalue.x", "lookupvalue.y"), with=FALSE] 
setkey(M, "id") 
setnames(M, c("lookupvalue.x", "lookupvalue.y"), 
     c("nationalityvalue", "subjectvalue")) 

M 
#  id datayear nationalitycode subjectcode nationalityvalue subjectvalue 
# 1: 1  2007    1   2   Irish Geography 
# 2: 2  2007    1   5   Irish   P.E. 
# 3: 3  2007    1   5   Irish   P.E. 
# 4: 4  2007    2   5   Scottish   P.E. 
# 5: 5  2007    3   2   Indian Geography 
# 6: 6  2008    5   5   Italian  Latin 
# 7: 7  2008    4   4   French   H.E. 
# 8: 8  2008    3   2   Chinese Beekeeping 
# 9: 9  2008    2   1   Scottish  Sewing 
# 10: 10  2008    1   4   English   H.E. 

これを短くするためにできることがいくつかありますが、これは何が起こっているかをかなり明確にしていると思います。

EDIT:

merge_fn <- function(column, data=X, lookup=LU) 
{ 
    value_nm <- paste0(gsub("code", "", column), 
         "value") 

    X1 <- merge(data, LU[field == column], 
       by.x=c("datayear", column), 
       by.y=c("datayear", "code")) 

    setnames(X1, "lookupvalue", value_nm) 
    X1[, !"field", with=FALSE] 
} 

M <- merge_fn("subjectcode", data=merge_fn("nationalitycode")) 
setkey(M, "id") 
M 
#  datayear subjectcode nationalitycode id nationalityvalue subjectvalue 
# 1:  2007   2    1 1   Irish Geography 
# 2:  2007   5    1 2   Irish   P.E. 
# 3:  2007   5    1 3   Irish   P.E. 
# 4:  2007   5    2 4   Scottish   P.E. 
# 5:  2007   2    3 5   Indian Geography 
# 6:  2008   5    5 6   Italian  Latin 
# 7:  2008   4    4 7   French   H.E. 
# 8:  2008   2    3 8   Chinese Beekeeping 
# 9:  2008   1    2 9   Scottish  Sewing 
# 10:  2008   4    1 10   English   H.E. 
+0

ありがとう@Jason Moragn、私はdata.tableパッケージに精通していませんが、これを見ています。 –

+0

@peter_w私はあなたが使い始めるのに役立つ機能を追加しました。 –

関連する問題