Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P40 resnet 50 multi card random fail #11225

Closed
guochaorong opened this issue Jun 6, 2018 · 7 comments
Closed

P40 resnet 50 multi card random fail #11225

guochaorong opened this issue Jun 6, 2018 · 7 comments
Assignees

Comments

@guochaorong
Copy link
Contributor

guochaorong commented Jun 6, 2018

测P40上 resnet模型单机8卡跑不起来。(5,6,7,8卡都跑不起来)

run model resnet50
cudaid
1,2,3,4,5,6,7
CUDA_VISIBLE_DEVICES
1,2,3,4,5,6,7

----------- Configuration Arguments -----------
batch_size: 64
data_format: NCHW
data_set: flowers
device: GPU
gpu_id: 4,5,6,7
infer_only: False
iterations: 80
log_dir: ./
model: resnet_imagenet
pass_num: 5
skip_batch_num: 5
use_cprof: False
use_fake_data: True
use_nvprof: False

*** Aborted at 1528264637 (unix time) try "date -d @1528264637" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x50) received by PID 46576 (TID 0x7f13dd421700) from PID 80; stack trace: ***
@ 0x7f13dcc447e0 (unknown)
@ 0x7f131320b7d3 commFree()
@ 0x7f131320f82d ncclCommInitAll
@ 0x7f13ba1ef075 paddle::platform::NCCLContextMap::NCCLContextMap()
@ 0x7f13ba1eae01 paddle::framework::ParallelExecutor::ParallelExecutor()
@ 0x7f13ba186195 ZZN8pybind1112cpp_function10initializeIZNS_6detail4initIIRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS8_8CPUPlaceENS8_15CUDAPinnedPlaceENS5_6detail7variant5void_ESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaISF_EERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEESS_RKNS7_9framework11ProgramDescERKSsPNST_5ScopeERS4_IS10_SaIS10_EERKNST_7details17ExecutionStrategyERKNS14_13BuildStrategyEmmEE7executeINS_6class_INST_16ParallelExecutorEIEEEIELi0EEEvRT_DpRKT0_EUlPS1E_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmE_vIS1M_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENKUlRNS2_13function_callEE1_clES23
@ 0x7f13ba1862ee ZZN8pybind1112cpp_function10initializeIZNS_6detail4initIIRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS8_8CPUPlaceENS8_15CUDAPinnedPlaceENS5_6detail7variant5void_ESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaISF_EERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEESS_RKNS7_9framework11ProgramDescERKSsPNST_5ScopeERS4_IS10_SaIS10_EERKNST_7details17ExecutionStrategyERKNS14_13BuildStrategyEmmEE7executeINS_6class_INST_16ParallelExecutorEIEEEIELi0EEEvRT_DpRKT0_EUlPS1E_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmE_vIS1M_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS2_13function_callEE1_4_FUNES23
@ 0x7f13ba149b74 pybind11::cpp_function::dispatcher()
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceaaf6f instancemethod_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceee7fe slot_tp_init
@ 0x7f13dceed468 type_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dcf31877 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcec026d function_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceaaf6f instancemethod_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceee7fe slot_tp_init
@ 0x7f13dceed468 type_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dcf31877 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcf32491 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcf34232 PyEval_EvalCode
@ 0x7f13dcf4e61c run_mod
@ 0x7f13dcf4e6f0 PyRun_FileExFlags
@ 0x7f13dcf4fbfc PyRun_SimpleFileExFlags
@ 0x7f13dcf614bc Py_Main
./run.xsh: line 22: 46576 Segmentation fault

模型代码
PaddlePaddle/paddle-ce-latest-kpis@a2d1273

nccl:
continuous_evaluation]# ls -ltr /chaorong/lib/libnccl*
-rwxrwxrwx 1 root root 232842694 Feb 22 22:00 /chaorong/lib/libnccl_static.a
-rwxrwxrwx 1 root root 227911007 Feb 22 22:00 /chaorong/lib/libnccl.so.2.1.15
lrwxrwxrwx 1 root root 17 Feb 22 22:00 /chaorong/lib/libnccl.so.2 -> libnccl.so.2.1.15
lrwxrwxrwx 1 root root 12 Feb 22 22:00 /chaorong/lib/libnccl.so -> libnccl.so.2

4卡能跑起来:
CUDA_VISIBLE_DEVICES
4,5,6,7
----------- Configuration Arguments -----------
batch_size: 128
data_format: NCHW
data_set: flowers
device: GPU
gpu_id: 4,5,6,7
infer_only: False
iterations: 80
log_dir: ./
model: resnet_imagenet
pass_num: 5
skip_batch_num: 5
use_cprof: False
use_fake_data: True
use_nvprof: False

Pass: 0, Iter: 0, loss: 6.1183624, acc: 0.0
Pass: 0, Iter: 1, loss: 5.5844965, acc: 0.0

@panyx0718
Copy link
Contributor

single or multi machine?

@panyx0718
Copy link
Contributor

based on this PR?
PaddlePaddle/paddle-ce-latest-kpis#34

@guochaorong
Copy link
Contributor Author

对,, 于洋老师说查问题贴个pr方便看一点。

是单机 P40,

早上用的nccl这个版本能稳定跑4卡,现在只能跑单卡了。
目前的这个机器上貌似潘老师也在用?

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 69454 C python 265MiB |
| 1 69454 C python 265MiB |
| 2 69454 C python 265MiB |
| 3 69454 C python 265MiB |
| 4 69454 C python 265MiB |
| 5 69454 C python 265MiB |
| 6 69454 C python 265MiB |
| 7 69454 C python 265MiB |
+-----------------------------------------------------------------------------+
[root@yq01-gpu-255-120-24-00 ~]# ll /proc/69454/cwd
lrwxrwxrwx 1 root root 0 Jun 6 11:18 /proc/69454/cwd -> /paddle/Paddle/benchmark/fluid

how to debug?~~~

@guochaorong
Copy link
Contributor Author

guochaorong commented Jun 6, 2018

重启了docker service, 又能跑4卡了。。。。。
大于等于5卡 稳定失败

@chengduoZH
Copy link
Contributor

建议排查一下是否是docker的问题

@guochaorong
Copy link
Contributor Author

使用8卡的时候, 这一行出错,test_exe = fluid.ParallelExecutor
参考wuyi老师fluid_benchmark.py 中 test 不使用ParallelExecutor, 则ok

已修正:PaddlePaddle/paddle-ce-latest-kpis#37

@chengduoZH
Copy link
Contributor

The mission of this issue has completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants