【問題記錄】Tensorflow-GPU下訓練出現 CUDA_ERROR_LAUNCH_TIMEOUT問題

太長不看版

解決問題的思路:

  • 從頭到尾看看自己安裝配置的環節是否齊全,包括C++編譯庫、CUDA安裝、CuDNN環境配置、tensorflow-gpu的下載安裝。
  • 檢查版本是否對應,Python, CUDA, CuDNN, Tensorflow版本是否對應以及兼容。
  • 是否只是運行特定代碼時出錯
    • 否:繼續嘗試下一步
    • 是: 檢查代碼是不是太過復雜,你的機器承受不了
      你可以運行tensorflow官網給出的簡單示例代碼:
      >>> import tensorflow as tf
      >>> hello = tf.constant('Hello, TensorFlow!')
      >>> sess = tf.Session()
      >>> print(sess.run(hello))
      
  • 目前來說最萬全之策:從源碼編譯安裝Tensorflow。
    源碼編譯安裝是為了最大程度上使得Tensorflow的運行更適配你的計算機配置,發揮出最大效用,也能支出AVX等更進一步加速計算,也能在一種程度上解決運算效率的問題。
    參考 Build from source on Windows
    Speed up TensorFlow inference by compiling it from source

問題記錄

> python .\0042_demo.py
...
Extracting MNIST_data\t10k-images-idx3-ubyte.gz
Extracting MNIST_data\t10k-labels-idx1-ubyte.gz
...
2018-09-28 14:56:04.341923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with
1409 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0)
Iter 0, Test Accuracy 0.9493 Training Accuracy 0.9581636
2018-09-28 14:56:20.996376: E tensorflow/stream_executor/cuda/cuda_driver.cc:1000] could not wait stream on event: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated
2018-09-28 14:56:20.996373: E tensorflow/stream_executor/cuda/cuda_driver.cc:1130] failed to enqueue async memcpy from host to device: CUDA_ERROR_LAUNCH_TIMEOUT:
the launch timed out and was terminated; GPU dst: 0000000402DD1100; host src: 000001196EC8CB80; size: 313600=0x4c900
2018-09-28 14:56:20.996423: E tensorflow/stream_executor/cuda/cuda_driver.cc:1000] could not wait stream on event: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated
2018-09-28 14:56:21.015902: I tensorflow/stream_executor/stream.cc:4986] [stream=0000011977EA22B0,impl=00000119008317F0] did not memcpy host-to-device; source: 0000011966E1FC00
2018-09-28 14:56:21.093012: E tensorflow/stream_executor/stream.cc:325] Error recording event in stream: error recording CUDA event on stream 000001197FBFD2C0: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2018-09-28 14:56:21.103354: I tensorflow/stream_executor/stream.cc:4986] [stream=0000011977EA22B0,impl=00000119008317F0] did not memcpy host-to-device; source: 000001196EBA6180
2018-09-28 14:56:21.128940: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_TIMEOUT:
the launch timed out and was terminated
2018-09-28 14:56:21.179315: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1

環境

  • Win 10
  • GeForce 940MX
  • CUDA 9.0
  • CuDNN 7.3 for CUDA9.0
  • Tensorflow 1.11.0
  • Visual Studio 2010

遇到同樣的問題出現

#1060

https://github.com/tensorflow/tensorflow/issues/1060

推薦的解決方案是:

From a different issue #2810, we've found some problems with 940M cuda driver. The problem was solved by:
#2810 (comment)

  1. Build from source while explicitly setting 5.0 build target in "configure".
  2. Or install the latest graphics driver 367.27.
    Not sure whether it is related. But it is worth trying.

#8517

https://github.com/tensorflow/tensorflow/issues/8517

Than you poxvoculi, it occurs every time I run the program.
Actually, this issue does not occur on the TensorFlow built from source. But it does occur on pip version.
BTW, I think it only happens on multi-gpu system.

cudaErrorLaunchTimeout

This indicates that the device kernel took too long to execute. This can only occur if timeouts are enabled - see the device property kernelExecTimeoutEnabled for more information. The device cannot be used until cudaThreadExit() is called. All existing device memory allocations are invalid and must be reconstructed if the program is to continue using CUDA.
------本文來自 todayq 的CSDN 博客 ,全文地址請點擊:https://blog.csdn.net/dan1900/article/details/17411203?utm_source=copy

目前的處理策略 關閉顯卡TDR(沒用)

百度一番之后發現原來是windows系統的顯卡超時檢測和恢復(TDR)功能惹的禍。關閉TDR的方法是在HKLM\System\CurrentControlSet\Control\GraphicsDrivers下創建Dword值TdrLevel,并賦值為0
https://answers.microsoft.com/zh-hans/windows/forum/windows_7-hardware/win7%E4%B8%AD%E5%A6%82%E4%BD%95%E9%85%8D%E7%BD%AE/69384e71-5075-4afe-a437-372425c0a3bb?auth=1
---------------------本文來自 qq_32464407 的CSDN 博客 ,全文地址請點擊:https://blog.csdn.net/qq_32464407/article/details/79164305?utm_source=copy

所以,我調這么久的錯,原因只是,我的電腦,配置不夠高。

  • 網上的解決方案,包括源碼構建,升級顯卡驅動,都是為了盡可能提升性能,提升瓶頸
  • 我把隱藏層神經元個數從2000調整成200,完美運行

運行設別相關的代碼

  • 指定CPU設備運行 tf.device() 指定本地或者遠程的設備
with tf.device('/cpu:0'):
  #各種operation
  • 查看運行每一個運算的設備: Session()中指定參數
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess :
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容