nltkリストにストップワードを追加するには？

私は以下のコードを持っています。 nltkストップワードリストに単語を追加する必要があります。私はthsiを実行した後、リストに単語を追加しません。nltkリストにストップワードを追加するには？

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer 
import string 
stop = set(stopwords.words('english'))  
new_words = open("stopwords_en.txt", "r") 
new_stopwords = stop.union(new_word) 
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer() 
def clean(doc): 
    stop_free = " ".join([i for i in doc.lower().split() if i not in new_stopwords])  
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude) 
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split()) 
    return normalized 
doc_clean = [clean(doc).split() for doc in emails_body_text]

出典

2017-09-21 Vrushab Jain

あなたのコードに字下げを修正してください。あなたが持っている方法には意味がありません。 – alexis

'new_stopwords = stop.union（new_word）'は必ず 'new_stopwords = stop.union（new_words）'を読みますか？ 'new_words = open（" stopwords_en.txt "、" r "）'はファイルオブジェクトを返します。そのため、ファイルオブジェクトを内容ではなくストップワードリストに追加します。 'new_words = open（" stopwords_en.txt "、" r "）のようなものが欲しい。readlines（）'確実に？ –

盲目的には行いません。ストップワードの新しいリストを読み、それが正しいかどうかを調べて、、次にを他のストップワードリストに追加します。 @greg_dataで提案されているコードから始めましょうが、改行を取り除く必要があります。ストップワードファイルの内容を誰が知っていますか？

これは、例えば、それを行う可能性があります：

new_words = open("stopwords_en.txt", "r").read().split() 
new_stopwords = stop.union(new_words)

PS。分割してあなたの文書に加わらないでください。 tokenizeを1回実行し、トークンのリストを操作します。

出典

2017-09-21 13:29:38 alexis

nltkリストにストップワードを追加するには？

答えて

関連する問題