重複したID /キーのマッピングテーブルを作成する

ヌル距離の結果として正確な重複行（IDなし）を好きではない統計ルーチンがあります。重複したID /キーのマッピングテーブルを作成する

私は最初に削除した重複を検出し、ルーチンを適用して、別の場所に残したレコードをマージします。

私はID /キーとしてrownamesを使用することを考えてみましょう。

私は基本Rで私の結果を達成するために、以下の方法を発見した：

data <- data.frame(x=c(1,1,1,2,2,3),y=c(1,1,1,4,4,3)) 

# check duplicates and get their ID -- cf. https://stackoverflow.com/questions/12495345/find-indices-of-duplicated-rows 
dup1 <- duplicated(data) 
dupID <- rownames(data)[dup1 | duplicated(data[nrow(data):1, ])[nrow(data):1]] 

# keep only those records that do have duplicates to preveng running folowing steps on all rows 
datadup <- data[dupID,] 

# "hash" row 
rowhash <- apply(datadup, 1, paste, collapse="_") 

idmaps <- split(rownames(datadup),rowhash) 
idmaptable <- do.call("rbind",lapply(idmaps,function(vec)data.frame(mappedid=vec[1],otherids=vec[-1],stringsAsFactors = FALSE)))

、私が欲しいものを私にすなわち重複排除されたデータ（簡単）とのマッピングテーブルを与えます。

> (data <- data[!dup1,]) 
    x y 
1 1 1 
4 2 4 
6 3 3 
> idmaptable 
     mappedid otherids 
1_1.1  1  2 
1_1.2  1  3 
2_4   4  5

Iは、単純またはより効果的な方法は、（data.table/dplyr受け入れ）があるかどうかを疑問に思います。提案する代替案？

出典

2017-08-03 Eric Lecoutre

...

library(data.table) 
setDT(data) 

# tag groups of dupes 
data[, g := .GRP, by=x:y] 

# do whatever analysis 
f = function(DT) Reduce(`+`, DT) 
resDT = unique(data, by="g")[, res := f(.SD), .SDcols = x:y][] 

# "update join" the results back to the main table if needed 
data[resDT, on=.(g), res := i.res ]

OPは例の中央部分（重複排除データの使用を）スキップので、私はちょうどfを構成しました。

出典

2017-08-03 14:30:40 Frank

ありがとう！印象的な、それはどのように簡潔です。これを検証し、 'data.table'を使うコードの一部を書き直します。「by」列を指定する別の方法が必要な場合はどうすればよいですか？私はグローバルID列（キーとして設定される）を持ち、最初にプロセスから削除する必要があります - 私の重複マッピングプロセスは明らかにこのID列なしで機能しなければならないためです。 –

@エリック。 'cols = setdiff（names（data）、" ID "）'を実行し、 'by = cols'や' .SDcols = cols'のようなcolを渡すことができます。これらの引数を渡すためのさまざまなオプションは、 '？data.table'で取り上げられています。それらの多くがあります。また、私のメモhttp://franknarf1.github.io/r-tutorial/_book/tables.html#program-tablesの「列の指定」のリストにある – Frank

tidyverseを用いた溶液。私は通常行名に情報を格納しないので、情報を格納するためにIDとID2を作成しました。もちろん、ニーズに応じて変更することができます。 data.tableで

library(tidyverse) 

idmaptable <- data %>% 
    rowid_to_column() %>% 
    group_by(x, y) %>% 
    filter(n() > 1) %>% 
    unite(ID, x, y) %>% 
    mutate(ID2 = 1:n()) %>% 
    group_by(ID) %>% 
    mutate(ID_type = ifelse(row_number() == 1, "mappedid", "otherids")) %>% 
    spread(ID_type, rowid) %>% 
    fill(mappedid) %>% 
    drop_na(otherids) %>% 
    mutate(ID2 = 1:n()) 

idmaptable 
# A tibble: 3 x 4 
# Groups: ID [2] 
    ID ID2 mappedid otherids 
    <chr> <int> <int> <int> 
1 1_1  1  1  2 
2 1_1  2  1  3 
3 2_4  1  4  5

出典

2017-08-03 13:58:41 www

ありがとうございます。運動のためのニース！私はこのパッケージを使用するつもりでdata.tableオプションを検証します。 –

操作は難しいですが、意味では多くのステップがあり、ロジックは読み込み/分解/理解がそれほど簡単ではありません。 –

コメントありがとうございます。トリッキーなのかどうかは、ユーザーがどのように感じているかによって決まります。私の解決策では、各ステップは1つのことと1つのことだけを行う関数です。それぞれの機能が何を表しているか知っていれば、あなたは「大声でそれを読む」ことができます。私にとって、時には簡潔なアプローチはあまりにもコンパクトです。 – www

与え、あなたのベースRソリューション、

df <- data[duplicated(data)|duplicated(data, fromLast = TRUE),] 

do.call(rbind, lapply(split(rownames(df), 
       do.call(paste, c(df, sep = '_'))), function(i) 
                data.frame(mapped = i[1], 
                  others = i[-1], 
                  stringsAsFactors = FALSE)))

にいくつかの改善、

 mapped others 
1_1.1  1  2 
1_1.2  1  3 
2_4  4  5

そしてもちろん、

unique(data) 

x y 
1 1 1 
4 2 4 
6 3 3

出典

2017-08-03 14:54:42 Sotos

実際には短いです。 –

重複したID /キーのマッピングテーブルを作成する

答えて

関連する問題