複数のcsvファイルの単語のうち、ストップワードを含まない単語の頻度を調べる

複数のCSVファイルの単語の出現を数えたいと思います。最初に、10の最も出現した単語をストップワードで示し、ストップワードなしで表示したいと考えています。複数のcsvファイルの単語のうち、ストップワードを含まない単語の頻度を調べる

これは私のコードです：

import nltk 
nltk.download("stopwords") 


from nltk.corpus import stopwords 


myfile = sc.textFile('./Sacramento*.csv') 


counts = myfile.flatMap(lambda line: line.split(",")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2) 


sorted_counts = counts.map(lambda (a, b): (b, a)).sortByKey(0, 1).map(lambda (a, b): (b, a)) 


first_ten = sorted_counts.take(10) 


first_ten 
Out[7]: 
[(u'Residential', 917), 
(u'2', 677), 
(u'CA', 597), 
(u'3', 545), 
(u'SACRAMENTO', 439), 
(u'ours', 388), 
(u'0', 387), 
(u'4', 277), 
(u'Mon May 19 00:00:00 EDT 2008', 268), 
(u'Fri May 16 00:00:00 EDT 2008', 264)] 


cachedStopWords = stopwords.words("english") 


result_ll = counts.map(lambda (a, b): (b, a)).sortByKey(0, 
1).map(lambda (a, b): (b, a)) 


print [i for i in result_ll.take(10) if i not in cachedStopWords]

しかし、出力はストップワードとはまだです - "我々は、" ストップワード

[（u'Residential」、917）の間にもある、（u'2 （u'CA '、597）、（u'3'、545）、（u'SACRAMENTO '、439）、（u'ours'、388）、（u'0 '、387）、（u'Mon May 19 00:00:00 EDT 2008 '、268）、（u'Fri May 16 00:00:00 EDT 2008'、264）]

私は自分のコードを変更する必要がありますので、出力はストップワードなしです： "私たち"？ i[0]は、実際の単語を保持しているため

出典

2017-02-13 Filip Dzuroska

あなたが最後の行でエラーが発生している、それが

print [i for i in result_ll.take(10) if i[0] not in cachedStopWords]

ようにする必要があります。

出典

2017-02-13 18:39:28 Mariusz

複数のcsvファイルの単語のうち、ストップワードを含まない単語の頻度を調べる

答えて

関連する問題