P40 resnet 50 multi card random fail #11225

guochaorong · 2018-06-06T06:02:32Z

测P40上 resnet模型单机8卡跑不起来。（5,6,7,8卡都跑不起来）

run model resnet50
cudaid
1,2,3,4,5,6,7
CUDA_VISIBLE_DEVICES
1,2,3,4,5,6,7

----------- Configuration Arguments -----------
batch_size: 64
data_format: NCHW
data_set: flowers
device: GPU
gpu_id: 4,5,6,7
infer_only: False
iterations: 80
log_dir: ./
model: resnet_imagenet
pass_num: 5
skip_batch_num: 5
use_cprof: False
use_fake_data: True
use_nvprof: False

*** Aborted at 1528264637 (unix time) try "date -d @1528264637" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x50) received by PID 46576 (TID 0x7f13dd421700) from PID 80; stack trace: ***
@ 0x7f13dcc447e0 (unknown)
@ 0x7f131320b7d3 commFree()
@ 0x7f131320f82d ncclCommInitAll
@ 0x7f13ba1ef075 paddle::platform::NCCLContextMap::NCCLContextMap()
@ 0x7f13ba1eae01 paddle::framework::ParallelExecutor::ParallelExecutor()
@ 0x7f13ba186195 ZZN8pybind1112cpp_function10initializeIZNS_6detail4initIIRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS8_8CPUPlaceENS8_15CUDAPinnedPlaceENS5_6detail7variant5void_ESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaISF_EERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEESS_RKNS7_9framework11ProgramDescERKSsPNST_5ScopeERS4_IS10_SaIS10_EERKNST_7details17ExecutionStrategyERKNS14_13BuildStrategyEmmEE7executeINS_6class_INST_16ParallelExecutorEIEEEIELi0EEEvRT_DpRKT0_EUlPS1E_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmE_vIS1M_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENKUlRNS2_13function_callEE1_clES23
@ 0x7f13ba1862ee ZZN8pybind1112cpp_function10initializeIZNS_6detail4initIIRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS8_8CPUPlaceENS8_15CUDAPinnedPlaceENS5_6detail7variant5void_ESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaISF_EERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEESS_RKNS7_9framework11ProgramDescERKSsPNST_5ScopeERS4_IS10_SaIS10_EERKNST_7details17ExecutionStrategyERKNS14_13BuildStrategyEmmEE7executeINS_6class_INST_16ParallelExecutorEIEEEIELi0EEEvRT_DpRKT0_EUlPS1E_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmE_vIS1M_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS2_13function_callEE1_4_FUNES23
@ 0x7f13ba149b74 pybind11::cpp_function::dispatcher()
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceaaf6f instancemethod_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceee7fe slot_tp_init
@ 0x7f13dceed468 type_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dcf31877 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcec026d function_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceaaf6f instancemethod_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceee7fe slot_tp_init
@ 0x7f13dceed468 type_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dcf31877 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcf32491 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcf34232 PyEval_EvalCode
@ 0x7f13dcf4e61c run_mod
@ 0x7f13dcf4e6f0 PyRun_FileExFlags
@ 0x7f13dcf4fbfc PyRun_SimpleFileExFlags
@ 0x7f13dcf614bc Py_Main
./run.xsh: line 22: 46576 Segmentation fault

模型代码
PaddlePaddle/paddle-ce-latest-kpis@a2d1273

nccl:
continuous_evaluation]# ls -ltr /chaorong/lib/libnccl*
-rwxrwxrwx 1 root root 232842694 Feb 22 22:00 /chaorong/lib/libnccl_static.a
-rwxrwxrwx 1 root root 227911007 Feb 22 22:00 /chaorong/lib/libnccl.so.2.1.15
lrwxrwxrwx 1 root root 17 Feb 22 22:00 /chaorong/lib/libnccl.so.2 -> libnccl.so.2.1.15
lrwxrwxrwx 1 root root 12 Feb 22 22:00 /chaorong/lib/libnccl.so -> libnccl.so.2

4卡能跑起来：
CUDA_VISIBLE_DEVICES
4,5,6,7
----------- Configuration Arguments -----------
batch_size: 128
data_format: NCHW
data_set: flowers
device: GPU
gpu_id: 4,5,6,7
infer_only: False
iterations: 80
log_dir: ./
model: resnet_imagenet
pass_num: 5
skip_batch_num: 5
use_cprof: False
use_fake_data: True
use_nvprof: False

Pass: 0, Iter: 0, loss: 6.1183624, acc: 0.0
Pass: 0, Iter: 1, loss: 5.5844965, acc: 0.0

The text was updated successfully, but these errors were encountered:

panyx0718 · 2018-06-06T06:50:44Z

single or multi machine?

panyx0718 · 2018-06-06T06:51:44Z

based on this PR?
PaddlePaddle/paddle-ce-latest-kpis#34

guochaorong · 2018-06-06T07:03:24Z

对，，于洋老师说查问题贴个pr方便看一点。

是单机 P40,

早上用的nccl这个版本能稳定跑4卡，现在只能跑单卡了。
目前的这个机器上貌似潘老师也在用？

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 69454 C python 265MiB |
| 1 69454 C python 265MiB |
| 2 69454 C python 265MiB |
| 3 69454 C python 265MiB |
| 4 69454 C python 265MiB |
| 5 69454 C python 265MiB |
| 6 69454 C python 265MiB |
| 7 69454 C python 265MiB |
+-----------------------------------------------------------------------------+
[root@yq01-gpu-255-120-24-00 ~]# ll /proc/69454/cwd
lrwxrwxrwx 1 root root 0 Jun 6 11:18 /proc/69454/cwd -> /paddle/Paddle/benchmark/fluid

how to debug？~~~

guochaorong · 2018-06-06T07:20:37Z

重启了docker service，又能跑4卡了。。。。。
大于等于5卡稳定失败

chengduoZH · 2018-06-06T08:59:12Z

建议排查一下是否是docker的问题

guochaorong · 2018-06-10T16:46:34Z

使用8卡的时候，这一行出错，test_exe = fluid.ParallelExecutor
参考wuyi老师fluid_benchmark.py 中 test 不使用ParallelExecutor，则ok

已修正：PaddlePaddle/paddle-ce-latest-kpis#37

chengduoZH · 2018-07-20T04:42:49Z

The mission of this issue has completed.

guochaorong assigned typhoonzero, panyx0718 and qingqing01 Jun 6, 2018

guochaorong unassigned panyx0718, qingqing01 and typhoonzero Jun 6, 2018

panyx0718 assigned chengduoZH and reyoung Jun 6, 2018

chengduoZH closed this as completed Jul 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P40 resnet 50 multi card random fail #11225

P40 resnet 50 multi card random fail #11225

guochaorong commented Jun 6, 2018 •

edited

Loading

panyx0718 commented Jun 6, 2018

panyx0718 commented Jun 6, 2018

guochaorong commented Jun 6, 2018

guochaorong commented Jun 6, 2018 •

edited

Loading

chengduoZH commented Jun 6, 2018

guochaorong commented Jun 10, 2018

chengduoZH commented Jul 20, 2018

P40 resnet 50 multi card random fail #11225

P40 resnet 50 multi card random fail #11225

Comments

guochaorong commented Jun 6, 2018 • edited Loading

panyx0718 commented Jun 6, 2018

panyx0718 commented Jun 6, 2018

guochaorong commented Jun 6, 2018

guochaorong commented Jun 6, 2018 • edited Loading

chengduoZH commented Jun 6, 2018

guochaorong commented Jun 10, 2018

chengduoZH commented Jul 20, 2018

guochaorong commented Jun 6, 2018 •

edited

Loading

guochaorong commented Jun 6, 2018 •

edited

Loading