Python nltk単語とフレーズの頻度をカウントする

私はNLTKを使用して、特定のドキュメントの特定の長さと各フレーズの頻度まで語句カウントを取得しようとしています。文字列をトークン化してデータリストを取得します。Python nltk単語とフレーズの頻度をカウントする

from nltk.util import ngrams 
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.collocations import * 


data = ["this", "is", "not", "a", "test", "this", "is", "real", "not", "a", "test", "this", "is", "this", "is", "real", "not", "a", "test"] 

bigrams = ngrams(data, 2) 

bigrams_c = {} 
for b in bigrams: 
    if b not in bigrams_c: 
     bigrams_c[b] = 1 
    else: 
     bigrams_c[b] += 1

上記のコードは次のように与え、出力：

(('is', 'this'), 1) 
(('test', 'this'), 2) 
(('a', 'test'), 3) 
(('this', 'is'), 4) 
(('is', 'not'), 1) 
(('real', 'not'), 2) 
(('is', 'real'), 2) 
(('not', 'a'), 3)

私が探しています何の一部です。

私の質問は、この変数を変更するためだけにこのコードを複製せずに、長さが4または5のフレーズまでこれを行うより便利な方法はありますか？

出典

2016-11-18 user1610950

nltkとタグ付けされているので、nltkのメソッドを使用してそれを行う方法は、標準のPythonコレクションのものよりもいくつかの機能があります。

from nltk import ngrams, FreqDist 
all_counts = dict() 
for size in 2, 3, 4, 5: 
    all_counts[size] = FreqDist(ngrams(data, size))

辞書all_countsの各要素は、nグラム周波数の辞書です。たとえば、次のような5つの最も一般的なトリグラムを得ることができます：

all_counts[3].most_common(5)

出典

2016-11-19 13:22:26 alexis

聖なる煙、これは私が以前に書いたものよりずっと優れています。とてもありがとう、すばらしい答え！ – user1610950

ええ、このループを実行しないでください。collections.Counter(bigrams)またはpandas.Series(bigrams).value_counts()を使用して、1つのライナーでカウントを計算してください。

出典

2016-11-18 04:14:01 maxymoo

Python nltk単語とフレーズの頻度をカウントする

答えて

関連する問題