2017-03-27 10 views
0

私は3.5をPythonのために使用されると、私はプロジェクトを作成し、私のプロジェクトでこれらのコードを追加gensimサンプルに基づいて:gensimメモリーに優しいコーパスエラー

class MyCorpus(object): 
    def __iter__(self): 
     for line in open('files/2/mycorpus.txt'): 
      # assume there's one document per line, tokens separated by whitespace 
      yield dictionary.doc2bow(line.lower().split()) 


corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory! 
print(corpus_memory_friendly) 

しかし、実行した後、私は私のpycharmコンソールでこれらのエラーを持っています:

Traceback (most recent call last): 
    File "D:/Python-Workspace(s)/GensimSamples/2.Gensim_CorpusStreaming.py", line 31, in <module> 
    for vector in corpus_memory_friendly: # load one vector into memory at a time 
    File "D:/Python-Workspace(s)/GensimSamples/2.Gensim_CorpusStreaming.py", line 17, in __iter__ 
    yield dictionary.doc2bow(line.lower().split()) 
AttributeError: module 'gensim.corpora.dictionary' has no attribute 'doc2bow' 

どうすればこの問題を解決できますか?

答えて

0

事前にdictionaryを用意しておき、それをクラスMyCorpusで利用できるようにするだけです。メモリ優しいコーパスを作成するサンプル・クラスは次のようになります。(ログ情報なし)

import logging 
from pprint import pprint 
from six import iteritems 
from gensim import corpora 

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 


class MyCorpus(object): 
    def __init__(self, text_file='text_corpus.txt', dictionary=None): 
     """ 
     Checks if a dictionary has been given as a parameter. 
     If no dictionary has been given, it creates one and saves it in the disk. 
     """ 
     self.file_name = text_file 
     if dictionary is None: 
      self.prepare_dictionary() 
     else: 
      self.dictionary = dictionary 

    def __iter__(self): 
     for line in open(self.file_name): 
      # assume there's one document per line, tokens separated by whitespace 
      yield self.dictionary.doc2bow(line.lower().split()) 

    def prepare_dictionary(self): 
     stop_list = set('for a of the and to in'.split()) # List of stop words which can also be loaded from a file. 

     # Creating a dictionary using stored the text file and the Dictionary class defined by Gensim. 
     self.dictionary = corpora.Dictionary(line.lower().split() for line in open(self.file_name)) 

     # Collecting the id's of the tokens which exist in the stop-list 
     stop_ids = [self.dictionary.token2id[stop_word] for stop_word in stop_list if 
        stop_word in self.dictionary.token2id] 

     # Collecting the id's of the token which appear only once 
     once_ids = [token_id for token_id, doc_freq in iteritems(self.dictionary.dfs) if doc_freq == 1] 

     # Removing the unwanted tokens using collected id's 
     self.dictionary.filter_tokens(stop_ids + once_ids) 

     # Saving dictionary in the disk for later use: 
     self.dictionary.save('dictionary.dict') 

my_memory_fiendly_corpus = MyCorpus() 

# Saving the corpus 
# corpora.MmCorpus.serialize('corpus.mm', my_memory_fiendly_corpus) 

# To load the saved corpus: 
# corpus = corpora.MmCorpus('corpus.mm') 

print('\t:::The dictionary::::') 
pprint(my_memory_fiendly_corpus.dictionary.token2id) 
print(my_memory_fiendly_corpus) 
print('\n\t:::The corpus::::') 
for vector in my_memory_fiendly_corpus: 
    print(vector) 

出力:私はGensimとPythonの両方に非常に新しいですしたよう

:::The dictionary:::: 
{'computer': 2, 
'eps': 8, 
'graph': 10, 
'human': 0, 
'interface': 1, 
'minors': 11, 
'response': 6, 
'survey': 3, 
'system': 5, 
'time': 7, 
'trees': 9, 
'user': 4} 
<__main__.MyCorpus object at 0x7fe0e9ac5c18> 

    :::The corpus:::: 
[(0, 1), (1, 1), (2, 1)] 
[(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)] 
[(1, 1), (4, 1), (5, 1), (8, 1)] 
[(0, 1), (5, 2), (8, 1)] 
[(4, 1), (6, 1), (7, 1)] 
[(9, 1)] 
[(9, 1), (10, 1)] 
[(9, 1), (10, 1), (11, 1)] 
[(3, 1), (10, 1), (11, 1)] 

、私は類似したに直面しています問題の種類も。 this mailing-listはGensimの学習に本当に役立ちました。