PythonでNLTKを使用してストップワードを削除する

NLTKを使用してリスト要素からストップワードを削除しています。ここは私のコードスニペットは、問題はこれがストップワードを削除するだけでなく、それは例えば、他の言葉から文字を削除して、あるPythonでNLTKを使用してストップワードを削除する

dict1 = {} 
    for ctr,row in enumerate(cur.fetchall()): 
      list1 = [row[0],row[1],row[2],row[3],row[4]] 
      dict1[row[0]] = list1 
      print ctr+1,"\n",dict1[row[0]][2] 
      list2 = [w for w in dict1[row[0]][3] if not w in stopwords.words('english')] 
      print list2

です単語 'orientation'から 'i'を除いて、より多くのストップワードが削除され、さらにlist2に単語の代わりに文字が格納されます。つまり['O'、 'r'、 'e'、 'n'、 'n'、 ''、 'f'、 ''、 '3'、 ''、 'r'、 'e'、 'r 'n'、 '\ n'、 '\ n'、 'O'、 'r'、 'e'、 'n'、 'n' 「r」「e」「r」「e」「r」「p」「l」は、 ...................... 私は['オリエンテーション'、 '.............. ......

出典

2016-07-08 Yash Goel

あなたの言葉を最初にトークンにしてみてください – galaxyan

あなたのコードにcurとは何ですか？もっとコンテキストコードを投稿してください。 –

まず、list1は文字の配列ではなく単語のリストであることを確認してください。ここでは、おそらくそれを活用できるコードスニペットを与えることができます。

from nltk import word_tokenize 
from nltk.corpus import stopwords 

english_stopwords = stopwords.words('english') # get english stop words 

# test document 
document = '''A moody child and wildly wise 
Pursued the game with joyful eyes 
''' 

# first tokenize your document to a list of words 
words = word_tokenize(document) 
print(words) 

# the remove all stop words 
content = [w for w in words if w.lower() not in english_stopwords] 
print(content)

出力は次のようになります。

['A', 'moody', 'child', 'and', 'wildly', 'wise', 'Pursued', 'the', 'game', 'with', 'joyful', 'eyes'] 
['moody', 'child', 'wildly', 'wise', 'Pursued', 'game', 'joyful', 'eyes']

出典

2016-07-08 20:14:09

まず、リスト1のあなたの建設は、私には少し独特です。あなたが行にアクセスしている理由がある、そして、

list1 = row[:5]

[3] dict1で[行[0] [3]ではなく、行[3]直接：私はより多くのニシキヘビ解決策があることと思いますか？

最後に、行が文字列のリストであると仮定すると、行[3]からlist2を構築すると、すべての文字ではなくすべての文字が繰り返し処理されます。それは、あなたが「私」と「a」（および他のいくつかの文字）を解析している理由かもしれません。

正しい理解は次のようになります。

list2 = [w for w in row[3].split(' ') if w not in stopwords]

おそらくスペースの周りに、離れて何とか自分の文字列を分割する必要があります。それはあなたに完全な言葉ではなく、個々の文字を与える上

'Hello, this is row3'

反復処理

['Hello,', 'this', 'is', 'row3']

へ：それはから何かを取ります。

出典

2016-07-08 20:28:41 dashiell

TypeError： 'LazyCorpusLoader'タイプの引数は反復不可能です –

PythonでNLTKを使用してストップワードを削除する

答えて

関連する問題