2016-11-28 7 views
0

トレーニングスクリプトを実行すると、GPUに最適化されたUbuntuマシンでメモリエラーが発生します。マシンにはアルゴリズムを実行するのに十分なメモリがあるため、エラーは疑わしいようです。ここでTensorFlow:メモリが不足して16.0KiBを割り当てようとしています

は誤りです:

TensorFlow: Ran out of memory trying to allocate 16.0KiB

メモリの状態:エラーと

$ free -m 
       total  used  free  shared buff/cache available 
Mem:   15038   190  6580   8  8267  14670 
Swap:    0   0   0 

コンソール:

$ python ./train.py --run --continue 
Using TensorFlow backend. 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 
Loading data.. 
Number of categories: 2 
Number of samples 425 
/home/ubuntu/DeepClassificationBot-master/data.py:134: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future 
    val = np.random.choice(dataset_indx, size=number_of_samples) 
/home/ubuntu/DeepClassificationBot-master/data.py:127: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future 
    train = np.random.choice(dataset_indx, size=number_of_samples) 
Building and Compiling model.. 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GRID K520 
major: 3 minor: 0 memoryClockRate (GHz) 0.797 
pciBusID 0000:00:03.0 
Total memory: 3.94GiB 
Free memory: 3.91GiB 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) 
Training.. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16384):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (32768):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (65536):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (134217728):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (268435456):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:656] Bin for 16.0KiB was 16.0KiB, Chunk State: 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702580000 of size 6912 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702581b00 of size 6912 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702583600 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702583700 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x702583800 of size 147456 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7025a7800 of size 147456 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7025cb800 of size 256 
....... Very long list of chunks 
I tensorflow/core/common_runtime/bfc_allocator.cc:689]  Summary of in-use Chunks by size: 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 115 Chunks of size 256 totalling 28.8KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 34 Chunks of size 512 totalling 17.0KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 21 Chunks of size 1024 totalling 21.0KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 42 Chunks of size 2048 totalling 84.0KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 6912 totalling 47.2KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 42 Chunks of size 16384 totalling 672.0KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 5 Chunks of size 32768 totalling 160.0KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 147456 totalling 1008.0KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 294912 totalling 1.97MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 589824 totalling 3.94MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 1179648 totalling 7.88MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 14 Chunks of size 2359296 totalling 31.50MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7 Chunks of size 4718592 totalling 31.50MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 35 Chunks of size 9437184 totalling 315.00MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 5 Chunks of size 67108864 totalling 320.00MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 6 Chunks of size 411041792 totalling 2.30GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 663988224 totalling 633.23MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 3.61GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats: 
Limit:     3878682624 
InUse:     3878682624 
MaxInUse:    3878682624 
NumAllocs:      362 
MaxAllocSize:   663988224 

W tensorflow/core/common_runtime/bfc_allocator.cc:270] **********************************************************************************************xxxxxx 
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 16.0KiB. See logs for memory state. 
W tensorflow/core/framework/op_kernel.cc:930] Internal: Dst tensor is not initialized. 
E tensorflow/core/common_runtime/executor.cc:334] Executor failed to create kernel. Internal: Dst tensor is not initialized. 
    [[Node: Variable_91/initial_value = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [4096] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]] 
Traceback (most recent call last): 
    File "./train.py", line 154, in <module> 
    run(extract=extract_mode, cont=continue_) 
    File "./train.py", line 104, in run 
    sample_weight=None) 
    File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit 
    sample_weight=sample_weight) 
    File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1031, in fit 
    self._make_train_function() 
    File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 658, in _make_train_function 
    training_updates = self.optimizer.get_updates(trainable_weights, self.constraints, self.total_loss) 
    File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 314, in get_updates 
    vs = [K.variable(np.zeros(K.get_value(p).shape)) for p in params] 
    File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 78, in variable 
    get_session().run(v.initializer) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 710, in run 
    run_metadata_ptr) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 908, in _run 
    feed_dict_string, options, run_metadata) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 958, in _do_run 
    target_list, options, run_metadata) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 978, in _do_call 
    raise type(e)(node_def, op, message) 
tensorflow.python.framework.errors.InternalError: Dst tensor is not initialized. 
    [[Node: Variable_91/initial_value = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [4096] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]] 
Caused by op u'Variable_91/initial_value', defined at: 
    File "./train.py", line 154, in <module> 
    run(extract=extract_mode, cont=continue_) 
    File "./train.py", line 104, in run 
    sample_weight=None) 
    File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit 
    sample_weight=sample_weight) 
    File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1031, in fit 
    self._make_train_function() 
    File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 658, in _make_train_function 
    training_updates = self.optimizer.get_updates(trainable_weights, self.constraints, self.total_loss) 
    File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 314, in get_updates 
    vs = [K.variable(np.zeros(K.get_value(p).shape)) for p in params] 
    File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 75, in variable 
    v = tf.Variable(np.asarray(value, dtype=dtype), name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 211, in __init__ 
    dtype=dtype) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 289, in _init_from_args 
    dtype=dtype) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 628, in convert_to_tensor 
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 180, in _constant_tensor_conversion_function 
    return constant(v, dtype=dtype, name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 167, in constant 
    attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0] 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2317, in create_op 
    original_op=self._default_original_op, op_def=op_def) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1239, in __init__ 
    self._traceback = _extract_stack() 

これは動作しますが、それでもメモリ警告を発行:

$ python deploy.py --URL http://www.example.com/image.jpg 
Using TensorFlow backend. 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GRID K520 
major: 3 minor: 0 memoryClockRate (GHz) 0.797 
pciBusID 0000:00:03.0 
Total memory: 3.94GiB 
Free memory: 3.91GiB 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) 
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
1/1 [==============================] - 2s 
______________________________________________ 
Image Name: image.jpg 
Categories: 
1. shrek 100.00% 
2. darth vader 0.00% 
+1

私は、メモリ不足エラーがGPUメモリに関連している可能性が高いと思います。実行しているK520 GPUには4GBのメモリがあり、さまざまなオーバーヘッドがあります。あなたは3.6GBが割り当てられているようです。だからあなたはおそらく(GPU)のメモリの外であるようです。 –

+0

はい、それは理にかなっています。それでも、私は3.6GBのGPUメモリを使用するのに暑いと確信していません。このトレーニングアルゴリズムは、1.5GBのGPUメモリを搭載したマシンで実行することができました。余分なメモリがどこにあるのか、このエラーを避けるためにどのように制限するのかを知りたい。 –

+0

私が知っている最も簡単な方法は、常にtrueを返して再実行するために、https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/log_memory.ccのIsEnabledを編集することです。これにより、各テンソルの割り当てが名前とオペレーションの責任で印刷されます。 –

答えて

0

私はthis answer on another questionを投稿しましたが、ここに公開すると思います。私はあなたのモデルのサイズについてはわかりませんが、私の場合、私は小さいサイズのモデルを実行していましたが、そのような小さなモデルがOOMエラーを出していたことに幾分驚きました。

具体的には、GTX 970で小さなCNNを訓練するときにメモリ不足のエラーが発生しました。私は多少のばかげて、TensorFlowにGPU上のメモリを割り当てることを発見しました。すべての私の問題。これは次のPythonコードを使用して達成することができる。

config = tf.ConfigProto() 
config.gpu_options.allow_growth = True 
sess = tf.Session(config = config) 

以前、TensorFlowは〜GPUメモリの90%を事前に割り当てることになります。しかし、何らかの未知の理由により、ネットワークのサイズを増やしたときにメモリ不足エラーが発生する可能性があります。上記のコードを使用することで、私はもはやOOMエラーが発生しなくなりました。

関連する問題