2017-08-04 6 views
2

サンプル数が25000、フィーチャー数が24995のデータセットがあります。このデータでケラスオートエンコーダーモデルをトレーニングし、OOMエラーに直面しています。モデルのいくつかの詳細は、この入力行列は、訓練および試験データとして設定され、検証に分割されているケラスオートエンコーダーリソースが枯渇したエラー

Input matrix shape : (25000, 24995) 

あります。

Train Matrix shape : (18750, 24995) 
Test Matrix shape : (6250, 24995) 

訓練のためのコードは

from keras.layers import Input, Dense 
input_layer = Input(shape=(train_matrix.shape[1],)) 

encoding_hlayer1_dims = 12500 
encoding_hlayer1 = Dense(encoding_hlayer1_dims, activation='relu', trainable=True, name="layer1")(input_layer) 

decoding_hlayer1 = Dense(train_matrix.shape[1], activation='relu')(encoding_hlayer1) 

autoencoder = Model(input_layer, decoding_hlayer1) 
autoencoder.compile(optimizer='adam', loss='binary_crossentropy') 

モデルの概要は、私はモードを訓練開始するとモデルに

## Train 
history = autoencoder.fit(train_matrix.toarray(), train_matrix.toarray(), 
       epochs=50, 
       batch_size=64, 
       shuffle=True, 
       validation_data=(test_matrix.toarray(), test_matrix.toarray())) 

を訓練する

Layer (type)     Output Shape    Param # 
================================================================= 
input_2 (InputLayer)   (None, 24995)    0   
_________________________________________________________________ 
layer1 (Dense)    (None, 12500)    312450000 
_________________________________________________________________ 
dense_1 (Dense)    (None, 24995)    312462495 
================================================================= 
Total params: 624,912,495 
Trainable params: 624,912,495 
Non-trainable params: 0 

コードです、次のエラーが表示されます。

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24995,12500] 
    [[Node: mul_3 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](beta_1/read, Variable/read)]] 

Nvidia Tesla K40c Gpu'sをそれぞれ12Gで使用しています。私の知る限りでは、モデルは25000 * 12500 * 2 = 0.625 GBのメモリに収まる必要があります。また、入力行列dtypeはnumpy.float32です。

ここで間違っているのは誰か正確に指摘できますか?

更新:完全なエラーログ

Train on 18750 samples, validate on 6250 samples 
Epoch 1/100 


ResourceExhaustedErrorTraceback (most recent call last) 
<ipython-input-8-503b20168fa5> in <module>() 
     6     batch_size=4096, 
     7     shuffle=True, 
----> 8     validation_data=(test_matrix.toarray(), test_matrix.toarray())) 
     9 #  autoencoder.save("/tmp/Models/sae_models/epochs_" + str(epochs) + ".model", include_optimizer=True) 
    10 

/usr/local/lib/python2.7/dist-packages/keras/engine/training.pyc in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, **kwargs) 
    1428        val_f=val_f, val_ins=val_ins, shuffle=shuffle, 
    1429        callback_metrics=callback_metrics, 
-> 1430        initial_epoch=initial_epoch) 
    1431 
    1432  def evaluate(self, x, y, batch_size=32, verbose=1, sample_weight=None): 

/usr/local/lib/python2.7/dist-packages/keras/engine/training.pyc in _fit_loop(self, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch) 
    1077     batch_logs['size'] = len(batch_ids) 
    1078     callbacks.on_batch_begin(batch_index, batch_logs) 
-> 1079     outs = f(ins_batch) 
    1080     if not isinstance(outs, list): 
    1081      outs = [outs] 

/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.pyc in __call__(self, inputs) 
    2263     value = (indices, sparse_coo.data, sparse_coo.shape) 
    2264    feed_dict[tensor] = value 
-> 2265   session = get_session() 
    2266   updated = session.run(self.outputs + [self.updates_op], 
    2267        feed_dict=feed_dict, 

/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.pyc in get_session() 
    166  if not _MANUAL_VAR_INIT: 
    167   with session.graph.as_default(): 
--> 168    _initialize_variables() 
    169  return session 
    170 

/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.pyc in _initialize_variables() 
    339  if uninitialized_variables: 
    340   sess = get_session() 
--> 341   sess.run(tf.variables_initializer(uninitialized_variables)) 
    342 
    343 

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata) 
    787  try: 
    788  result = self._run(None, fetches, feed_dict, options_ptr, 
--> 789       run_metadata_ptr) 
    790  if run_metadata: 
    791   proto_data = tf_session.TF_GetBuffer(run_metadata_ptr) 

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata) 
    995  if final_fetches or final_targets: 
    996  results = self._do_run(handle, final_targets, final_fetches, 
--> 997        feed_dict_string, options, run_metadata) 
    998  else: 
    999  results = [] 

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 
    1130  if handle is None: 
    1131  return self._do_call(_run_fn, self._session, feed_dict, fetch_list, 
-> 1132       target_list, options, run_metadata) 
    1133  else: 
    1134  return self._do_call(_prun_fn, self._session, handle, feed_dict, 

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args) 
    1150   except KeyError: 
    1151   pass 
-> 1152  raise type(e)(node_def, op, message) 
    1153 
    1154 def _extend_graph(self): 

ResourceExhaustedError: OOM when allocating tensor with shape[24995,12500] 
    [[Node: layer1/kernel/Assign = Assign[T=DT_FLOAT, _class=["loc:@layer1/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](layer1/kernel, layer1/random_uniform)]] 

Caused by op u'layer1/kernel/Assign', defined at: 
    File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main 
    "__main__", fname, loader, pkg_name) 
    File "/usr/lib/python2.7/runpy.py", line 72, in _run_code 
    exec code in run_globals 
    File "/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py", line 16, in <module> 
    app.launch_new_instance() 
    File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 658, in launch_instance 
    app.start() 
    File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelapp.py", line 477, in start 
    ioloop.IOLoop.instance().start() 
    File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 177, in start 
    super(ZMQIOLoop, self).start() 
    File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 888, in start 
    handler_func(fd_obj, events) 
    File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 277, in null_wrapper 
    return fn(*args, **kwargs) 
    File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events 
    self._handle_recv() 
    File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv 
    self._run_callback(callback, msg) 
    File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback 
    callback(*args, **kwargs) 
    File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 277, in null_wrapper 
    return fn(*args, **kwargs) 
    File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher 
    return self.dispatch_shell(stream, msg) 
    File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell 
    handler(stream, idents, msg) 
    File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request 
    user_expressions, allow_stdin) 
    File "/usr/local/lib/python2.7/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute 
    res = shell.run_cell(code, store_history=store_history, silent=silent) 
    File "/usr/local/lib/python2.7/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell 
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs) 
    File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell 
    interactivity=interactivity, compiler=compiler, result=result) 
    File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2822, in run_ast_nodes 
    if self.run_code(code, result): 
    File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code 
    exec(code_obj, self.user_global_ns, self.user_ns) 
    File "<ipython-input-4-ee2fe8e92d7c>", line 4, in <module> 
    encoding_hlayer1 = Dense(encoding_hlayer1_dims, activation='relu', trainable=True, name="layer1")(input_layer) 
    File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 569, in __call__ 
    self.build(input_shapes[0]) 
    File "/usr/local/lib/python2.7/dist-packages/keras/layers/core.py", line 825, in build 
    constraint=self.kernel_constraint) 
    File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper 
    return func(*args, **kwargs) 
    File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 391, in add_weight 
    weight = K.variable(initializer(shape), dtype=dtype, name=name) 
    File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 321, in variable 
    v = tf.Variable(value, dtype=_convert_string_dtype(dtype), name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 200, in __init__ 
    expected_shape=expected_shape) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 309, in _init_from_args 
    validate_shape=validate_shape).op 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/state_ops.py", line 271, in assign 
    validate_shape=validate_shape) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 45, in assign 
    use_locking=use_locking, name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op 
    op_def=op_def) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op 
    original_op=self._default_original_op, op_def=op_def) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__ 
    self._traceback = _extract_stack() 

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24995,12500] 
    [[Node: layer1/kernel/Assign = Assign[T=DT_FLOAT, _class=["loc:@layer1/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](layer1/kernel, layer1/random_uniform)]] 
+0

完全なエラーメッセージを表示できますか?また、このスクリプトをどのように実行するのかを教えてください。コンソールから実行しますか? 「ジュピターノート」? –

+0

'decoding_hlayer1_dims'は使用されていません。 –

+0

@MarcinMożejko完全なエラーメッセージを追加しました。また、私はジュピターのノートでそれを実行します。 – user1683894

答えて

0

パラメータの総数があなたのコードにつきとして624,912,495です。これは、重みを格納するためにただ624912495 * 4/1024**3 = 2.32 GBを取る必要があります(計算すると0.625ではありません)。

これに加えて、イニシャライザと少なくとも3つのオプティマイザ用のコピーを保存する必要があります.1つは1次モーメンタム、2次モーメンタム、もう1つは実際のアップデート用です。いつでもあなたはa + bと書くので、それを保存するメモリが必要です。隠れている可能性があります。

全体的に見て、全体のメモリ使用量は12 GBをはるかに上回っているため、メモリが不足しています。

メモリを使用するSGDオプティマイザを試してみることもできますが、まだ使い果たしている可能性があります。

+0

それは参考になった情報でした。私は2GPUを持っていて、どちらも12Gigsのk40cです。だから理論的に私は24Gのメモリを持っており、モデルはメモリに収まる必要があります。なぜ私はメモリの問題をまだ解決していないのですか?モデルがGPUに分散されていないか、コードで指定されていませんか? – user1683894

関連する問題