This repository was archived by the owner on Dec 9, 2024. It is now read-only.
This repository was archived by the owner on Dec 9, 2024. It is now read-only.
Error running in replicated mode #111
Open
Description
System information:
OS Platform: ubuntu 16.04
TensorFlow : install from source
Python version: Python 2.7.5
- Run with the command:
python tf_cnn_benchmarks.py --num_batches 100 --display_every 1 --num_gus 8 --model resnet50 --batch_size 64 --data_name imagenet --data_dir /root/imagenet_data --xla True --variable_update replicated --local_parameter_device gpu
I got the following Error:
1 images/sec: 314.7 +/- 0.0 (jitter = 0.0) 9.717
2 images/sec: 314.9 +/- 0.2 (jitter = 0.3) 9.874
3 images/sec: 315.5 +/- 0.5 (jitter = 0.7) 9.294
4 images/sec: 315.2 +/- 0.5 (jitter = 0.7) 9.784
5 images/sec: 315.1 +/- 0.4 (jitter = 0.6) 8.846
6 images/sec: 315.0 +/- 0.3 (jitter = 0.5) 8.822
7 images/sec: 315.1 +/- 0.3 (jitter = 0.6) 8.449
8 images/sec: 315.1 +/- 0.3 (jitter = 0.5) 8.233
9 images/sec: 314.9 +/- 0.3 (jitter = 0.6) 8.213
10 images/sec: 314.9 +/- 0.2 (jitter = 0.5) 8.291
11 images/sec: 315.0 +/- 0.2 (jitter = 0.5) 8.054
12 images/sec: 315.1 +/- 0.2 (jitter = 0.8) 8.295
13 images/sec: 315.2 +/- 0.2 (jitter = 0.9) 8.510
14 images/sec: 315.2 +/- 0.2 (jitter = 0.9) 8.074
15 images/sec: 315.3 +/- 0.2 (jitter = 0.8) 8.225
16 images/sec: 315.4 +/- 0.2 (jitter = 0.9) 8.041
17 images/sec: 315.4 +/- 0.2 (jitter = 0.9) 8.122
18 images/sec: 315.2 +/- 0.2 (jitter = 0.9) 8.068
19 images/sec: 315.2 +/- 0.2 (jitter = 0.8) 8.036
20 images/sec: 315.2 +/- 0.2 (jitter = 0.9) 8.120
21 images/sec: 315.3 +/- 0.2 (jitter = 1.0) 8.074
22 images/sec: 315.3 +/- 0.2 (jitter = 1.2) 8.101
23 images/sec: 315.4 +/- 0.2 (jitter = 1.2) 8.182
24 images/sec: 315.4 +/- 0.2 (jitter = 1.2) 8.302
25 images/sec: 315.5 +/- 0.2 (jitter = 1.3) 7.991
26 images/sec: 315.5 +/- 0.2 (jitter = 1.2) 8.184
27 images/sec: 315.4 +/- 0.2 (jitter = 1.3) 8.307
28 images/sec: 315.4 +/- 0.2 (jitter = 1.2) 8.022
29 images/sec: 315.4 +/- 0.2 (jitter = 1.3) 8.061
30 images/sec: 315.4 +/- 0.2 (jitter = 1.3) 7.962
31 images/sec: 315.4 +/- 0.2 (jitter = 1.2) 8.218
32 images/sec: 315.4 +/- 0.2 (jitter = 1.3) 7.944
33 images/sec: 315.4 +/- 0.2 (jitter = 1.3) 8.070
34 images/sec: 315.3 +/- 0.2 (jitter = 1.3) 7.977
35 images/sec: 315.3 +/- 0.2 (jitter = 1.3) 7.940
36 images/sec: 315.3 +/- 0.2 (jitter = 1.3) 7.910
37 images/sec: 315.3 +/- 0.2 (jitter = 1.3) 6808459.000
38 images/sec: 315.3 +/- 0.2 (jitter = 1.3) 9828381.000
39 images/sec: 315.2 +/- 0.2 (jitter = 1.3) 9444037.000
40 images/sec: 315.2 +/- 0.2 (jitter = 1.4) 11600396.000
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 47, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "tf_cnn_benchmarks.py", line 43, in main
bench.run()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1097, in run
return self._benchmark_cnn()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1332, in _benchmark_cnn
fetch_summary)
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 584, in benchmark_one_step
results = sess.run(fetches, options=run_options, run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Retval[0] does not have value
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 467, in run
global_step_val, = self.sess.run([self.global_step_op])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1053, in _run
raise RuntimeError('Attempted to use a closed Session.')
RuntimeError: Attempted to use a closed Session.
- Run with the command:
python tf_cnn_benchmarks.py --num_batches 100 --display_every 1 --num_gus 8 --model resnet50 --batch_size 64 --data_name imagenet --data_dir /root/imagenet_data --xla True --variable_update replicated --all_reduce_spec nccl --local_parameter_device gpu
I got the following Error:
Step Img/sec loss
1 images/sec: 763.6 +/- 0.0 (jitter = 0.0) nan
2 images/sec: 761.4 +/- 1.6 (jitter = 3.3) nan
3 images/sec: 757.3 +/- 3.5 (jitter = 6.6) nan
4 images/sec: 755.0 +/- 3.3 (jitter = 8.1) nan
5 images/sec: 756.0 +/- 2.8 (jitter = 6.6) nan
6 images/sec: 756.6 +/- 2.4 (jitter = 3.5) nan
7 images/sec: 755.3 +/- 2.4 (jitter = 6.6) nan
8 images/sec: 756.8 +/- 2.5 (jitter = 9.1) nan
9 images/sec: 756.6 +/- 2.2 (jitter = 6.6) nan
10 images/sec: 756.5 +/- 2.0 (jitter = 6.6) nan
11 images/sec: 757.3 +/- 2.0 (jitter = 6.6) nan
12 images/sec: 757.8 +/- 1.9 (jitter = 6.2) nan
2018-01-05 19:50:07.566284: E tensorflow/stream_executor/cuda/cuda_dnn.cc:2456] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
2018-01-05 19:50:07.566345: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566369: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566374: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566378: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566382: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566387: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566391: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566395: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566401: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566409: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566430: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566437: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566442: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566448: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566453: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566459: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566464: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS