Levenshteinを計算するR

のQWERTYエラーを許容するユーザ入力の企業名とFortune 1000のリストとの間のRのLevenshtein距離を計算したいが、QWERTYの誤植がある。たとえば、Mcdimldesは、McDonaldsから2の距離を持つ必要があります。iがoの次にあり、mがnの次にあるからです。Levenshteinを計算するR

実装でこの別の試みがあったが、Python (click here).すべての助けに感謝します。

問題を明確にするために追加の詳細を追加する必要があるかどうかお知らせください。

出典

2017-05-12 Jonathan Rauscher

をadist機能やRecordLinkageパッケージをチェックしてください。両方とも、Damerau-Levenshteinの距離に基づいて編集距離を計算できます。 – Curious

たぶん、あなたは、この上で何かを構築することができます

## from the link in the linked python answer: 
# txt <- "'q': {'x':0, 'y':0}, 'w': {'x':1, 'y':0}, 'e': {'x':2, 'y':0}, 'r': {'x':3, 'y':0}, 't': {'x':4, 'y':0}, 'y': {'x':5, 'y':0}, 'u': {'x':6, 'y':0}, 'i': {'x':7, 'y':0}, 'o': {'x':8, 'y':0}, 'p': {'x':9, 'y':0}, 'a': {'x':0, 'y':1},'z': {'x':0, 'y':2},'s': {'x':1, 'y':1},'x': {'x':1, 'y':2},'d': {'x':2, 'y':1},'c': {'x':2, 'y':2}, 'f': {'x':3, 'y':1}, 'b': {'x':4, 'y':2}, 'm': {'x':5, 'y':2}, 'j': {'x':6, 'y':1}, 'g': {'x':4, 'y':1}, 'h': {'x':5, 'y':1}, 'j': {'x':6, 'y':1}, 'k': {'x':7, 'y':1}, 'l': {'x':8, 'y':1}, 'v': {'x':3, 'y':2}, 'n': {'x':5, 'y':2}" 
# txt <- strsplit(txt, "\\},\\s?")[[1]] 
# m <- t(sapply(regmatches(txt, regexec("'(.)':\\s*\\{'x':(\\d+),\\s*'y':(\\d+).*", txt)), "[", -1)) 
# m <- matrix(as.numeric(m[,-1]), ncol=2, dimnames = list(m[,1],c("x","y"))) 
# dput(m) 
m <- structure(c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 1, 1, 2, 2, 3, 
    4, 5, 6, 4, 5, 6, 7, 8, 3, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 
    2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2), .Dim = c(27L, 
    2L), .Dimnames = list(c("q", "w", "e", "r", "t", "y", "u", "i", 
    "o", "p", "a", "z", "s", "x", "d", "c", "f", "b", "m", "j", "g", 
    "h", "j", "k", "l", "v", "n"), c("x", "y"))) 
m["m", ] <- c(6,2) # 5,2 seems wrong... 

f <- function(a, b) { 
    posis <- lapply(strsplit(c(a, b), "", T), function(x) m[x,,drop=F]) 
    d <- abs(posis[[1]]-posis[[2]]) 
    idx <- which(rowSums(d>1)==0) 
    if (length(idx)>0) rownames(posis[[1]])[idx] <- rownames(posis[[2]])[idx] 
    paste(rownames(posis[[1]]), collapse="") 
} 
a <- tolower("Mcdimldes") # make it case-insensitive 
b <- tolower("McDonalds") 
adist(a,b) # regular distance 
# [1,] 4 
newa <- f(a, b) # replace possible typo chars 
adist(newa,b) # new dist is 2 - as requested 
#  [,1] 
# [1,] 2

行列でキーボードレイアウト：

keyb <- sweep(m, 2, c(1, -1), "*") 
plot(keyb, type = "n") 
text(keyb, rownames(keyb)) 
grid()

出典

2017-05-12 23:42:23 lukeA

答えて

関連する問題