Skip to content
This repository was archived by the owner on Dec 9, 2024. It is now read-only.
This repository was archived by the owner on Dec 9, 2024. It is now read-only.

Error running in replicated mode #111

Open
@Agoniii

Description

@Agoniii

System information:
OS Platform: ubuntu 16.04
TensorFlow : install from source
Python version: Python 2.7.5

  1. Run with the command:
    python tf_cnn_benchmarks.py --num_batches 100 --display_every 1 --num_gus 8 --model resnet50 --batch_size 64 --data_name imagenet --data_dir /root/imagenet_data --xla True --variable_update replicated --local_parameter_device gpu

I got the following Error:

1	images/sec: 314.7 +/- 0.0 (jitter = 0.0)	9.717
2	images/sec: 314.9 +/- 0.2 (jitter = 0.3)	9.874
3	images/sec: 315.5 +/- 0.5 (jitter = 0.7)	9.294
4	images/sec: 315.2 +/- 0.5 (jitter = 0.7)	9.784
5	images/sec: 315.1 +/- 0.4 (jitter = 0.6)	8.846
6	images/sec: 315.0 +/- 0.3 (jitter = 0.5)	8.822
7	images/sec: 315.1 +/- 0.3 (jitter = 0.6)	8.449
8	images/sec: 315.1 +/- 0.3 (jitter = 0.5)	8.233
9	images/sec: 314.9 +/- 0.3 (jitter = 0.6)	8.213
10	images/sec: 314.9 +/- 0.2 (jitter = 0.5)	8.291
11	images/sec: 315.0 +/- 0.2 (jitter = 0.5)	8.054
12	images/sec: 315.1 +/- 0.2 (jitter = 0.8)	8.295
13	images/sec: 315.2 +/- 0.2 (jitter = 0.9)	8.510
14	images/sec: 315.2 +/- 0.2 (jitter = 0.9)	8.074
15	images/sec: 315.3 +/- 0.2 (jitter = 0.8)	8.225
16	images/sec: 315.4 +/- 0.2 (jitter = 0.9)	8.041
17	images/sec: 315.4 +/- 0.2 (jitter = 0.9)	8.122
18	images/sec: 315.2 +/- 0.2 (jitter = 0.9)	8.068
19	images/sec: 315.2 +/- 0.2 (jitter = 0.8)	8.036
20	images/sec: 315.2 +/- 0.2 (jitter = 0.9)	8.120
21	images/sec: 315.3 +/- 0.2 (jitter = 1.0)	8.074
22	images/sec: 315.3 +/- 0.2 (jitter = 1.2)	8.101
23	images/sec: 315.4 +/- 0.2 (jitter = 1.2)	8.182
24	images/sec: 315.4 +/- 0.2 (jitter = 1.2)	8.302
25	images/sec: 315.5 +/- 0.2 (jitter = 1.3)	7.991
26	images/sec: 315.5 +/- 0.2 (jitter = 1.2)	8.184
27	images/sec: 315.4 +/- 0.2 (jitter = 1.3)	8.307
28	images/sec: 315.4 +/- 0.2 (jitter = 1.2)	8.022
29	images/sec: 315.4 +/- 0.2 (jitter = 1.3)	8.061
30	images/sec: 315.4 +/- 0.2 (jitter = 1.3)	7.962
31	images/sec: 315.4 +/- 0.2 (jitter = 1.2)	8.218
32	images/sec: 315.4 +/- 0.2 (jitter = 1.3)	7.944
33	images/sec: 315.4 +/- 0.2 (jitter = 1.3)	8.070
34	images/sec: 315.3 +/- 0.2 (jitter = 1.3)	7.977
35	images/sec: 315.3 +/- 0.2 (jitter = 1.3)	7.940
36	images/sec: 315.3 +/- 0.2 (jitter = 1.3)	7.910
37	images/sec: 315.3 +/- 0.2 (jitter = 1.3)	6808459.000
38	images/sec: 315.3 +/- 0.2 (jitter = 1.3)	9828381.000
39	images/sec: 315.2 +/- 0.2 (jitter = 1.3)	9444037.000
40	images/sec: 315.2 +/- 0.2 (jitter = 1.4)	11600396.000
Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 47, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "tf_cnn_benchmarks.py", line 43, in main
    bench.run()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1097, in run
    return self._benchmark_cnn()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1332, in _benchmark_cnn
    fetch_summary)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 584, in benchmark_one_step
    results = sess.run(fetches, options=run_options, run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Retval[0] does not have value
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 467, in run
    global_step_val, = self.sess.run([self.global_step_op])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1053, in _run
    raise RuntimeError('Attempted to use a closed Session.')
RuntimeError: Attempted to use a closed Session.
  1. Run with the command:
    python tf_cnn_benchmarks.py --num_batches 100 --display_every 1 --num_gus 8 --model resnet50 --batch_size 64 --data_name imagenet --data_dir /root/imagenet_data --xla True --variable_update replicated --all_reduce_spec nccl --local_parameter_device gpu

I got the following Error:

Step	Img/sec	loss
1	images/sec: 763.6 +/- 0.0 (jitter = 0.0)	nan
2	images/sec: 761.4 +/- 1.6 (jitter = 3.3)	nan
3	images/sec: 757.3 +/- 3.5 (jitter = 6.6)	nan
4	images/sec: 755.0 +/- 3.3 (jitter = 8.1)	nan
5	images/sec: 756.0 +/- 2.8 (jitter = 6.6)	nan
6	images/sec: 756.6 +/- 2.4 (jitter = 3.5)	nan
7	images/sec: 755.3 +/- 2.4 (jitter = 6.6)	nan
8	images/sec: 756.8 +/- 2.5 (jitter = 9.1)	nan
9	images/sec: 756.6 +/- 2.2 (jitter = 6.6)	nan
10	images/sec: 756.5 +/- 2.0 (jitter = 6.6)	nan
11	images/sec: 757.3 +/- 2.0 (jitter = 6.6)	nan
12	images/sec: 757.8 +/- 1.9 (jitter = 6.2)	nan
2018-01-05 19:50:07.566284: E tensorflow/stream_executor/cuda/cuda_dnn.cc:2456] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
2018-01-05 19:50:07.566345: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566369: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566374: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566378: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566382: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566387: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566391: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566395: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566401: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566409: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566430: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566437: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566442: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566448: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566453: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566459: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566464: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions