特定の単語を効率的に含むすべてのnグラムを見つける

文書から、特定の単語を含むすべてのnグラムを生成したい。特定の単語を効率的に含むすべてのnグラムを見つける

例：

document: i am 50 years old, my son is 20 years old 
word: years 
n: 2

出力：

[(50, years), (years, old), (20, years), (years, old)]

私たちはすべての可能なnグラムを生成し、言葉でものを除外することができます知っているが、より多くがある場合、私は思っていました効率的な方法です。私はそれらを生成するためにPySparkの使用を計画していました。

出典

2017-08-01 ace allen

itertoolsをご覧ください。 – perigon

こんにちは！何よりも効率的ですか？あなたは現在何をしていますか？ – arturomp

from nltk.util import ngrams 

DOC = 'i am 50 years old, my son is 20 years old' 


def ngram_filter(doc, word, n): 
    tokens = doc.split() 
    all_ngrams = ngrams(tokens, n) 
    filtered_ngrams = [x for x in all_ngrams if word in x] 
    return filtered_ngrams 


ngram_filter(DOC, 'years', 2)

出典

2017-08-02 00:19:02 Stefanus

特定の単語を効率的に含むすべてのnグラムを見つける

答えて

関連する問題