Pythonで連続した部分文字列を取得する

n-gram単語を指定すると、「start to end」と「end to start」から連続する部分文字列パターンを取得したいと考えています。Pythonで連続した部分文字列を取得する

例えば、4グラムの場合、computer supported machine translationは次の部分文字列を取得する必要があります。開始するには端からcomputer supported、computer supported machine

：開始から終了までの

3グラムnatural language processingためmachine translation、supported machine translation

、私はnatural languageとlanguage processingを取得する必要があります。

私は本当に大きなnグラムを持っていますので、これを行う最も簡単な方法を知りたいと思っています！

出典

2017-12-08 Anonymous

を使用することができ、リストになりたくない場合。 – Galen

あなたはグラムのリストにNGRAMをsplit、その後joinスライス（Understanding Python's slice notationを参照）ことができます：

ngram = "computer supported machine translation" 
grams = ngram.split(" ") 

# Start to end 
for c in range(2, len(grams)): 
    print(" ".join(grams[:c])) 

# End to start 
for c in range(2, len(grams)): 
    print(" ".join(grams[-c:]))

出典

2017-12-08 04:17:28 Galen

あなたが機能を使用して、ちょうどパラメータとしてnグラムを渡す必要があります。

@Galenから借りコードのいくつかの部分：

def count_grams(gram,sentence): 
    grams = sentence.split(" ") 

    words=[] 
    for i in range(gram,len(grams)): 
     start=[] 
     start.append(" ".join(grams[:i])) 
     words.append(start) 
    for j in range(gram,len(grams)): 
     end=[] 
     end.append(" ".join(grams[-j:])) 
     words.append(end) 

    return words 



print(count_grams(2,'computer supported machine translation')) 
print(count_grams(2,'natural language processing'))

出力：

[['computer supported'], ['computer supported machine'], ['machine translation'], ['supported machine translation']] 
[['natural language'], ['language processing']]

あなたが最速または最も効率的な方法は、おそらく入力を処理し、どのように出力を処理した後に保存される前に保存されているかに依存します" ".join()

出典

2017-12-08 05:16:00

Pythonで連続した部分文字列を取得する

答えて

関連する問題