ファイル内で最も頻繁に出現する単語を見つける

私はファイルを持っており、その中で最も頻繁に使われる10の単語を探したいと思っています。私はストップワードと句読点を省略し、その結果をリストに入れました。各行には、ペルシア語の文、タブ、そして英語の単語が含まれています。問題は、以下のコードは各行の1語を返します。たとえば、行数が12の場合は、12語を返します。インデントに問題があると私は思う。どうすれば修正できますか？ファイル内で最も頻繁に出現する単語を見つける

. 
. 
. 
def train(): 
    RemStopWords (file1, file2) # the function for removing stop words and punctuation at the start of the code 
    for line in witoutStops: 
     line = line.strip().split("\t") 
     words = line[0].split() 
     uniques = [] 
     q = [] 
     for word in words: 
      if word not in uniques: 
       uniques.append(word) 
     counts = [] 
     for unique in uniques: 
      count = 0    
      for word in words:  
       if word == unique: 
        count += 1   
      counts.append((count, unique)) 
      counts.sort()   
      counts.reverse() 
      for i in range(min(10, len(counts))): 
       count, word = counts[i] 
      print('%s %d' % (word, count)) 
      #q.append(word) 
      #print (q)

出典

2017-02-13 sara.t

あなたはこのためにcollections.Counterを使用することができます。

from collections import Counter 

def train(): 
    RemStopWords (file1, file2) # the function for removing stop words and punctuation at the start of the code 
    counter = Counter() 
    for line in withoutStops: 
     line = line.strip().split("\t") 
     words = line[0].split() 
     counter.update(words) 
    top10 = [word[0] for word in counter.most_common(10)] 
    print(top10)

出典

2017-02-13 18:55:12

それは動作しますが、私は言葉が必要です。どのように単語をリストに追加できますか？ありがとう –

私は私の答えにsaraを追加しました –

EDIT：ルイジアナハシェックの答えはこれを行うための、よりシンプルでエレガントな方法で、同じ出力を持っているので、あなたが決定的にすべきとそれを確認します！これを行うための簡単な方法:)例えば

import operator # we will use this later for sorting dictionaries 

def train(): 
    # assuming this returns the string of the text 
    textWithoutStops = RemStopWords(file1, file2) 

    # dictionary were words are keys and number of time they appear are values 
    wordCount = {} 
    for word in textWithoutStops.split(' '): # convert string to list, using spaces as separators 
     if not word in wordCount: 
      wordCount[word] = 1 
     else: 
      wordCount[word] += 1 

    # we sort from less to more frequency 
    sortedWordCount = sorted(wordCount.items(), key=operator.itemgetter(1)) 
    # and reverse the list so it's from more to less frequent 
    sortedWordCount = sortedWordCount[::-1] 

    # we take only the first 10, if it has more than 10 
    if len(sortedWordCount) > 10: 
     sortedWordCount = sortedWordCount[:10] 

    # Here we go, a list containing tuples with the structure: (word, count) 
    return sortedWordCount

があり

ファイルは、あなたの質問

が含まれている場合、私はファイルを持っていると私は10の、最も頻繁に言葉を見つけたいですその中で。 I はストップワードと句読点を省略して、結果をリストに入れます。各行には、ペルシア語の文、タブ、そして英語の単語が含まれています。問題は、以下のコードは各行の1語を返します。の場合行の数が12の場合、12語を返します。私はインデントに問題があると思います。どうすれば修正できますか？

出力は次のようになります。

[('the', 5), ('I', 4), ('a', 4), ('and', 4), ('in', 2), ('of', 2), ('then', 2), ('returns', 2), ('words', 2), ('fix', 1)]

注：は、テキストファイルを開き、文字列に、それはコンテンツだすべてを変換するには、あなたが（すでにやると多分）することができ、次の

with open(file, 'r') as f: 
    text = f.read()

希望すると、これが役立ちます。

出典

2017-02-13 19:07:03 mikelsr

ファイル内で最も頻繁に出現する単語を見つける

答えて

関連する問題