テキスト処理 - フレーズ検出後のWord2Vecトレーニング（バイグラムモデル）

いつもより多くのnグラムのword2vecモデルを作りたいと思います。私が見つけたように、gensim.models.phraseのフレーズクラスは私が望むフレーズを見つけることができ、コーパスのフレーズを使用することができ、word2vecトレイン機能の結果モデルを使用することができます。テキスト処理 - フレーズ検出後のWord2Vecトレーニング（バイグラムモデル）

まず最初に、gensim documentationのサンプルコードとまったく同じようにします。

class MySentences(object): 
    def __init__(self, dirname): 
     self.dirname = dirname 

    def __iter__(self): 
     for fname in os.listdir(self.dirname): 
      for line in open(os.path.join(self.dirname, fname)): 
       yield word_tokenize(line) 

sentences = MySentences('sentences_directory') 

bigram = gensim.models.Phrases(sentences) 

model = gensim.models.Word2Vec(bigram['sentences'], size=300, window=5, workers=8)

モデルを評価し、警告に作成されますが、何か良い結果なしでされています：

WARNING : train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable)

私はそれを検索し、私はhttps://groups.google.com/forum/#!topic/gensim/XWQ8fPMFSi0を発見し、自分のコードを変更：

class MySentences(object): 
    def __init__(self, dirname): 
     self.dirname = dirname 

    def __iter__(self): 
     for fname in os.listdir(self.dirname): 
      for line in open(os.path.join(self.dirname, fname)): 
       yield word_tokenize(line) 

class PhraseItertor(object): 
    def __init__(self, my_phraser, data): 
     self.my_phraser, self.data = my_phraser, data 

    def __iter__(self): 
     yield self.my_phraser[self.data] 


sentences = MySentences('sentences_directory') 

bigram_transformer = gensim.models.Phrases(sentences) 

bigram = gensim.models.phrases.Phraser(bigram_transformer) 

corpus = PhraseItertor(bigram, sentences) 

model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8)

Iエラーが発生する：

Traceback (most recent call last): 
    File "/home/fatemeh/Desktop/Thesis/bigramModeler.py", line 36, in <module> 
    model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8) 
    File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 478, in init 
    self.build_vocab(sentences, trim_rule=trim_rule) 
    File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 553, in build_vocab 
    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey 
    File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 575, in scan_vocab 
    vocab[word] += 1 
TypeError: unhashable type: 'list'

今、自分のコードで何が間違っているかを知りたい。

出典

2017-09-26 Mahmood Kohansal

私はGensim GoogleGroupで私の質問をし、Mr Gordon Mohrは私に答え：

You typically wouldn't want an __iter__() method to do a single yield . It should return an iterator object (ready to return multiple objects via next() or a StopIteration exception). One way to effect a iterator is to use yield to have the method treated as a 'generator' – but that would typically require the yield to be inside a loop.

But I now see that my example code in the thread you reference does the wrong thing with its__iter__() return line: it should not be returning the raw phrasifier, but one that has already been started-as-an-iterator, by use of the iter() built-in method. That is, the example there should have read:
class PhrasingIterable(object): 
    def __init__(self, phrasifier, texts): 
     self. phrasifier, self.texts = phrasifier, texts 
    def __iter__(): 
     return iter(phrasifier[texts]) 
Making a similar change in your variation may resolve the TypeError: iter() returned non-iterator of type 'TransformedCorpus' error.

を

出典

2017-10-02 07:08:45

テキスト処理 - フレーズ検出後のWord2Vecトレーニング（バイグラムモデル）

答えて

関連する問題