2017-02-27 63 views
9

Szkolę trochę muzycznych danych na LSTM-RNN w Tensorflow i napotkam pewien problem z GPU-Memory-Allocation, którego nie rozumiem: spotykam OOM, kiedy faktycznie wydaje się, że właśnie chodzi dostępna jest wystarczająca ilość pamięci VRAM. Niektóre tła: Pracuję nad Ubuntu Gnome 16.04, używając GTX1060 6GB, Intel Xeon E3-1231V3 i 8 GB RAM. Więc teraz pierwszą część błędów wiadomości, które mogę zrozumieć, w i będę dodać cały komunikat o błędzie ponownie końca dla każdego, kto może poprosić o to, aby pomóc:Tensorflow OOM na GPU

I tensorflow/rdzeń /common_runtime/bfc_allocator.cc:696] 8 kawałki rozmiarze 256 w wysokości 2.0KiB I tensorflow/rdzeń/common_runtime/bfc_allocator.cc: 696] 1 kawałki wielkości 1280 łącznej 1.2KiB I tensorflow/rdzeń/common_runtime/bfc_allocator .cc: 696] 5 kawałków o rozmiarze 44288 o łącznej wartości 216,2 KiB I tensorflow/core/common_runtime/bfc_allocator.cc: 696] 5 kawałków o rozmiarze 56064 o łącznej wartości 273,8 KbIB Itensorflow/Rdzeń/common_runtime/bfc_allocator.cc: 696] 4 kawałki o wielkości 154350080 łącznej 588.80MiB I tensorflow/Rdzeń/common_runtime/bfc_allocator.cc: 696], 3 kawałki o wielkości 813400064 łączną 2.27GiB I tensorflow/rdzeń /common_runtime/bfc_allocator.cc:696] 1 kawałki o wielkości 1612612352 łączną 1.50GiB I tensorflow/rdzeń/common_runtime/bfc_allocator.cc: 700] sumaryczna z w użyciu fragmentów: 4.35GiB I tensorflow/rdzeń/common_runtime /bfc_allocator.cc:702] Statystyki:

krańcowego 5484118016

INUSE: 4670717952

MaxInUse: 5484118016

NumAllocs: 29

MaxAllocSize: 1612612352

W tensorflow/Rdzeń/common_runtime/bfc_allocator.cc: 274] ********** *********** ___________ * __ ******************************** *************** xxxxxxxxxxxxxx W tensorflow/core/common_runtime/bfc_allocator.cc: 275] Wybiegł z pamięci próbując przydzielić 775.72MiB. Zobacz dzienniki stanu pamięci. W tensorflow/core/Framework/op_kernel.cc: 993] wyczerpania zasobów: OOM przy przydzielaniu tensor z kształtu [14525,14000]

więc mogę czytać, że jest maksymalnie 5484118016 bajtów, które zostaną przyznane , 4670717952 bajtów są już używane, a kolejne 777.72MB = 775720000 bajtów należy przydzielić. 5484118016 bajtów - 4670717952 bajtów - 775720000 bajtów = 37680064 bajtów według mojego kalkulatora. Tak więc powinno pozostać jeszcze 37 MB wolnej pamięci VRAM po przydzieleniu miejsca dla nowej Tensor, którą chce tam wprowadzić. Wydaje mi się to całkiem uzasadnione, ponieważ Tensorflow prawdopodobnie (chyba?) Nie spróbuje przydzielić więcej pamięci VRAM, niż jest jeszcze dostępna i po prostu zatrzyma resztę danych w pamięci RAM lub czymś podobnym.

Teraz wydaje mi się, że jest jakiś wielki błąd w moim myśleniu, ale byłbym bardzo wdzięczny, gdyby ktoś mógł mi wyjaśnić, jaki jest ten błąd. Oczywistą strategią rozwiązywania mojego problemu jest sprawienie, by moje partie były nieco mniejsze, ponieważ każdy z nich o pojemności około 1,5 GB jest prawdopodobnie zbyt duży. Nadal chciałbym wiedzieć, jaki jest obecny problem.

edit: Znalazłem coś mówi mi spróbować:

config = tf.ConfigProto() 
config.gpu_options.allocator_type = 'BFC' 
with tf.Session(config = config) as s: 

który nadal nie działa, ale jak dokumentacja tensorflow brakuje żadnego wyjaśnienia co

gpu_options.allocator_type = 'BFC' 

byłoby, chciałbym zapytać was.

Dodawanie resztę komunikatu błędu dla wszystkich zainteresowanych:

Przepraszam za długi kopiuj/wklej, ale może ktoś potrzebuje/chce go zobaczyć,

Dziękuję bardzo z góry, Leon

(gputensorflow) [email protected]:~/Tensorflow$ python Netzwerk_v0.5.1_gamma.py 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1060 6GB 
major: 6 minor: 1 memoryClockRate (GHz) 1.7335 
pciBusID 0000:01:00.0 
Total memory: 5.93GiB 
Free memory: 5.40GiB 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0) 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728):  Total Chunks: 1, Chunks in use: 0 147.20MiB allocated for chunks. 147.20MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456):  Total Chunks: 1, Chunks in use: 0 628.52MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 775.72MiB was 256.00MiB, Chunk State: 
I tensorflow/core/common_runtime/bfc_allocator.cc:666] Size: 628.52MiB | Requested Size: 0B | in_use: 0, prev: Size: 147.20MiB | Requested Size: 147.20MiB | in_use: 1, next: Size: 54.8KiB | Requested Size: 54.7KiB | in_use: 1 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000000 of size 1280 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000500 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000600 of size 56064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e100 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e200 of size 44288 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208018f00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019000 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019100 of size 813400064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387d1100 of size 56064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387dec00 of size 154350080 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b11e00 of size 44288 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cb00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cc00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cd00 of size 154350080 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102722d4d00 of size 56064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b615a00 of size 44288 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620700 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620800 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620900 of size 813400064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102abdd8900 of size 813400064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc590900 of size 56064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc59e400 of size 56064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc5abf00 of size 154350080 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102e58df100 of size 154350080 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec12300 of size 44288 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec1d000 of size 44288 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec27d00 of size 1612612352 
I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x1024ae4ff00 of size 659049984 
I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x102722e2800 of size 154350080 
I tensorflow/core/common_runtime/bfc_allocator.cc:693]  Summary of in-use Chunks by size: 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 256 totalling 2.0KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 44288 totalling 216.2KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 56064 totalling 273.8KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 154350080 totalling 588.80MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 813400064 totalling 2.27GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1612612352 totalling 1.50GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 4.35GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:     5484118016 
InUse:     4670717952 
MaxInUse:    5484118016 
NumAllocs:      29 
MaxAllocSize:   1612612352 

W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********************___________*__***************************************************xxxxxxxxxxxxxx 
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 775.72MiB. See logs for memory state. 
W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[14525,14000] 
Traceback (most recent call last): 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call 
    return fn(*args) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn 
    status, run_metadata) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/contextlib.py", line 66, in __exit__ 
    next(self.gen) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status 
    pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000] 
    [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]] 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
    File "Netzwerk_v0.5.1_gamma.py", line 171, in <module> 
    session.run(tf.global_variables_initializer()) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run 
    run_metadata_ptr) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run 
    feed_dict_string, options, run_metadata) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run 
    target_list, options, run_metadata) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call 
    raise type(e)(node_def, op, message) 
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000] 
    [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]] 

Caused by op 'rnn/basic_lstm_cell/weights/Initializer/random_uniform', defined at: 
    File "Netzwerk_v0.5.1_gamma.py", line 94, in <module> 
    initial_state=initial_state, time_major=False)  # time_major = FALSE currently 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 545, in dynamic_rnn 
    dtype=dtype) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 712, in _dynamic_rnn_loop 
    swap_memory=swap_memory) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2626, in while_loop 
    result = context.BuildLoop(cond, body, loop_vars, shape_invariants) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2459, in BuildLoop 
    pred, body, original_loop_vars, loop_vars, shape_invariants) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2409, in _BuildLoop 
    body_result = body(*packed_vars_for_body) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 697, in _time_step 
    (output, new_state) = call_cell() 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 683, in <lambda> 
    call_cell = lambda: cell(input_t, state) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 179, in __call__ 
    concat = _linear([inputs, h], 4 * self._num_units, True, scope=scope) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 747, in _linear 
    "weights", [total_arg_size, output_size], dtype=dtype) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 988, in get_variable 
    custom_getter=custom_getter) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 890, in get_variable 
    custom_getter=custom_getter) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 348, in get_variable 
    validate_shape=validate_shape) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 333, in _true_getter 
    caching_device=caching_device, validate_shape=validate_shape) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 684, in _get_single_variable 
    validate_shape=validate_shape) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 226, in __init__ 
    expected_shape=expected_shape) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 303, in _init_from_args 
    initial_value(), name="initial_value", dtype=dtype) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 673, in <lambda> 
    shape.as_list(), dtype=dtype, partition_info=partition_info) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/init_ops.py", line 360, in __call__ 
    dtype, seed=self.seed) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/random_ops.py", line 246, in random_uniform 
    return math_ops.add(rnd * (maxval - minval), minval, name=name) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 73, in add 
    result = _op_def_lib.apply_op("Add", x=x, y=y, name=name) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op 
    op_def=op_def) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op 
    original_op=self._default_original_op, op_def=op_def) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__ 
    self._traceback = _extract_stack() 

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[14525,14000] 
    [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]] 
+0

Ostatnio napotkałem ten problem i napotykam problem związany z wyczerpaniem zasobów w trakcie szkolenia. I poszedłem za tym https://github.com/tensorflow/tensorflow/issues/4735 i rozwiązałem ten problem, zmniejszając wielkość partii sprawdzania poprawności. – RyanLiu

Odpowiedz

3

Spróbuj spojrzeć na to

uważać, aby nie aby uruchomić binarne testy ewaluacyjne i szkoleniowe na tym samym GPU , bo inaczej zabraknie Ci pamięci. Rozważ przeprowadzenie analizy binarnej na oddzielnym GPU, jeśli jest dostępne, lub zawieszenie binarne podczas wykonywania obliczeń na tym samym GPU.

https://www.tensorflow.org/tutorials/deep_cnn

1

rozwiązać ten problem poprzez zmniejszenie batch_size=52 tylko do zmniejszenia zużycia pamięci jest zmniejszenie batch_size.

batch_size zależy od karty graficznej GPU wielkości pamięci VRAM, pamięci podręcznej itp

proszę preferują ten Another Stack Overflow Link

0

Gdy napotyka OOM na GPU wierzę zmieniającym batch size jest właściwym rozwiązaniem, aby spróbuj na początku.

Dla różnych procesorów graficznych może być potrzebny inny rozmiar wsadu w oparciu o posiadaną pamięć GPU .

Niedawno stanąłem przed podobnym rodzajem problemu, zmodyfikowałem wiele, aby wykonać inny typ eksperymentu.

Oto link do question (również niektóre triki są uwzględniane).

Jednak zmniejszając rozmiar partii może się okazać, że trening będzie wolniejszy. Więc jeśli masz wiele GPU, możesz z nich korzystać. Aby sprawdzić o GPU można pisać na terminalu,

nvidia-smi 

To pokaże Ci niezbędnych informacji na temat swojej szafie GPU.