コーパス文書内の単語を数える方法

文書内の単語を数える最善の方法を知りたい。自分の "corp.txt"コーパスの設定があり、 "corp.txt"というファイルに "students、trust、ayre"がどれほど頻繁に存在するかを知りたい。私は何を使うことができますか？ほとんどの人はちょうど（0の既定値で）defaultdictionaryを使用することになりコーパス文書内の単語を数える方法

.... 
full=nltk.Text(mycorpus.words('FullReport.txt')) 
>>> fdist= FreqDist(full) 
>>> fdist 
<FreqDist with 34133 outcomes> 
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

おかげで、レイ

出典

2011-11-15 Ray Hmar

どちらも標準のpythonライブラリによって提供されたものの一つ：あなたも、あなたのトークン化を小文字にしたいかもしれませんのでFreqDistやカウンターオブジェクトのキーは、大文字と小文字が区別されていることに注意してください。あなたはNLTKを考えていないと確信していますか？ –

あなたの名前を見ると、私はあなたが "学生がayreを信頼する"ことを知っていると思っているつもりです。とにかく、私は 'FreqDist'と一緒に行くでしょう。 'fdist = FreqDist（）; for tokenize.whitespace（sent）：fdist.inc（word.lower（）） 'の単語です。 doc [here]（http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html）を確認できます。 – aayoubi

私は答えを編集しました。私はそれをもう一度確認してください。ありがとうございます –

：

は、それが次のいずれかだろう。単語を表示するたびに、値を1つだけ増やします。

total = 0 
count = defaultdict(lambda: 0) 
for word in words: 
    total += 1 
    count[word] += 1 

# Now you can just determine the frequency by dividing each count by total 
for word, ct in count.items(): 
    print('Frequency of %s: %f%%' % (word, 100.0 * float(ct)/float(total)))

出典

2011-11-15 16:01:42

あなたは 'defaultdict（int）'を意味します - - 'defaultdict'は呼び出し可能です。 – kindall

ああありがとうございます。 –

@Chris「Counter」の使い方は？ – alvas

あなたはほぼそこにいます！。あなたは、インデックスFreqDist言葉に興味がありますを使用しては、次のことを試してみてくださいすることができます

print fdist['students'] 
print fdist['ayre'] 
print fdist['full']

これはあなたに、各単語の出現回数や数を与えます。周波数は出現数とは異なっている - - あなたは「頻度」と述べ、それは次のようになったことができます。

print fdist.freq('students') 
print fdist.freq('ayre') 
print fdist.freq('full')

出典

2012-07-11 18:41:29 Spaceghost

は私がcollections.Counterに探してお勧めします。特に大量のテキストの場合、これはトリックを行い、使用可能なメモリによってのみ制限されます。 12ギガビットのラムを搭載したコンピュータでは、1年半で30億トークンを計上しました。単語は、ディスクに書き込むことができ、または他の場所に格納された辞書my_counterているが完了したら

from collections import Counter 
my_counter = Counter() 
for word in Words: 
    my_counter.update(word)

（例えばSQLiteの）：擬似コード（可変単語は実際にファイルまたは同様に、いくつかの基準になります）。

出典

2014-04-06 19:06:14

あなたは[OUT]ファイルを読み込んでNLTKにFreqDistオブジェクトに個々のトークンをトークン化して入れ、http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

from nltk.probability import FreqDist 
from nltk import word_tokenize 

# Creates a test file for reading. 
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!" 
with open('test.txt', 'w') as fout: 
    fout.write(doc) 

# Reads a file into FreqDist object. 
fdist = FreqDist() 
with open('test.txt', 'r') as fin: 
    for word in word_tokenize(fin.read()): 
     fdist.inc(word) 

print "'blah' occurred", fdist['blah'], "times"

を見ることができます：

'blah' occurred 3 times

また、あなたが使用することができますネイティブCounterオブジェクトはcollectionsであり、同じカウントが得られます。https://docs.python.org/2/library/collections.htmlを参照してください。

from collections import Counter 
from nltk import word_tokenize 

# Creates a test file for reading. 
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!" 
with open('test.txt', 'w') as fout: 
    fout.write(doc) 

# Reads a file into FreqDist object. 
fdist = Counter() 
with open('test.txt', 'r') as fin: 
    fdist.update(word_tokenize(fin.read().lower())) 

print "'blah' occurred", fdist['blah'], "times"

出典

2014-04-07 05:10:15 alvas

コーパス文書内の単語を数える方法

答えて

関連する問題