keras.preprocessing.textでTokenizerを使用しているときにメモリが不足しています

kerasを使用して文章を分類するRNNモデルを作成したいと思います。keras.preprocessing.textでTokenizerを使用しているときにメモリが不足しています

私は、次のコードを試してみました：

docs = [] 
with open('all_dga.txt', 'r') as f: 
    for line in f.readlines(): 
     dga_domain, _ = line.split(' ') 
     docs.append(dga_domain) 

t = Tokenizer() 
t.fit_on_texts(docs) 
encoded_docs = t.texts_to_matrix(docs, mode='count') 
print(encoded_docs)

をしかしMemoryErrorを得ました。私はすべてのデータをメモリに読み込むことができないようでした。これは出力です：

Traceback (most recent call last): 
    File "test.py", line 11, in <module> 
    encoded_docs = t.texts_to_matrix(docs, mode='count') 
    File "/home/yurzho/anaconda3/envs/deepdga/lib/python3.6/site-packages/keras/preprocessing/text.py", line 273, in texts_to_matrix 
    return self.sequences_to_matrix(sequences, mode=mode) 
    File "/home/yurzho/anaconda3/envs/deepdga/lib/python3.6/site-packages/keras/preprocessing/text.py", line 303, in sequences_to_matrix 
    x = np.zeros((len(sequences), num_words)) 
MemoryError

ケラスに精通している人なら、データセットの前処理方法を教えてください。

ありがとうございます！

出典

2017-12-21 yuren zhong

あなたがエラーがt.texts_to_matrix(docs, mode='count')に起こったので、t.fit_on_texts(docs)から語彙を作成するために文書をフィッティング問題がないようです。

だから、バッチ

from keras.preprocessing.text import Tokenizer 

t = Tokenizer() 

with open('/Users/liling.tan/test.txt') as fin: 
    for line in fin:  
     t.fit_on_texts(line.split()) # Fitting the tokenizer line-by-line. 

M = [] 

with open('/Users/liling.tan/test.txt') as fin: 
    for line in fin: 
     # Converting the lines into matrix, line-by-line. 
     m = t.texts_to_matrix([line], mode='count')[0] 
     M.append(m)

に文書を変換することができますしかし、あなたはあなたのコンピュータがメモリ内のデータ量を処理できない場合は、後でいくつかの点でMemoryErrorに実行表示されます。

出典

2017-12-22 05:46:04 alvas

keras.preprocessing.textでTokenizerを使用しているときにメモリが不足しています

答えて

関連する問題