テンソルフロー語彙プロセッサー

テンソルフローを使用したテキスト分類については、wildmlブログに従っています。私は、コード文でmax_document_lengthの目的を理解することはできませんよ。テンソルフロー語彙プロセッサー

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)

はまた、どのように私はvocabularyprocessorオブジェクトから語彙を抽出する方法を考え出したvocab_processor

出典

2016-11-17 Nitin

私は同じチュートリアルに従おうとしていますが、わからないことがいくつかあります。たぶんあなたは[私の質問を見て]（http://stackoverflow.com/questions/41665109/trying-to-understand-cnns-for-nlp-tutorial-using-tensorflow）と私を助けることができますか？ – displayname

から語彙を抽出することができます。これは私にとって完璧に機能しました。

import numpy as np 
from tensorflow.contrib import learn 

x_text = ['This is a cat','This must be boy', 'This is a a dog'] 
max_document_length = max([len(x.split(" ")) for x in x_text]) 

## Create the vocabularyprocessor object, setting the max lengh of the documents. 
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) 

## Transform the documents using the vocabulary. 
x = np.array(list(vocab_processor.fit_transform(x_text)))  

## Extract word:id mapping from the object. 
vocab_dict = vocab_processor.vocabulary_._mapping 

## Sort the vocabulary dictionary on the basis of values(id). 
## Both statements perform same task. 
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1)) 
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1]) 

## Treat the id's as index into list and create a list of words in the ascending order of id's 
## word with id i goes at index i of the list. 
vocabulary = list(list(zip(*sorted_vocab))[0]) 

print(vocabulary) 
print(x)

出典

2016-11-22 12:17:20 Nitin

vocab_dictが表示されている場合、「This」は1、「is」は2などと索引付けされていることがわかります。私は自分のインデックスを渡したいと思います。たとえば、頻度ベース。あなたはこれを行う方法を知っていますか？ – user1930402

ないVocabularyProcessorは、ベクターにあなたのテキスト文書をマッピングし、そしてあなたが一貫長さであるように、これらのベクトルを必要とする

max_document_lengthの目的を理解することができ。

あなたの入力データレコードは、同じ長さではないかもしれません。たとえば、センチメント分析のためのセンテンスで作業している場合、さまざまな長さになります。

出力ベクトルの長さを調整できるように、このパラメータはVocabularyProcessorに指定します。 max_document_length

、the documentationによると：ドキュメントの最大長。文書がより長い場合は、短く塗りつぶされていると整えられます。

source codeを参照してください。

def transform(self, raw_documents): """Transform documents to word-id matrix. Convert words to ids with vocabulary fitted with fit or the one provided in the constructor. Args: raw_documents: An iterable which yield either str or unicode. Yields: x: iterable, [n_samples, max_document_length]. Word-id matrix. """ for tokens in self._tokenizer(raw_documents): word_ids = np.zeros(self.max_document_length, np.int64) for idx, token in enumerate(tokens): if idx >= self.max_document_length: break word_ids[idx] = self.vocabulary_.get(token) yield word_ids

word_ids = np.zeros(self.max_document_length)に注意してください。

raw_documentsの各行は、長さがmax_document_lengthのベクトルにマップされます。

出典

2017-12-28 19:54:28

テンソルフロー語彙プロセッサー

答えて

関連する問題