-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
P40 resnet 50 multi card random fail #11225
Comments
single or multi machine? |
based on this PR? |
对,, 于洋老师说查问题贴个pr方便看一点。 是单机 P40, 早上用的nccl这个版本能稳定跑4卡,现在只能跑单卡了。 | Processes: GPU Memory | how to debug?~~~ |
重启了docker service, 又能跑4卡了。。。。。 |
建议排查一下是否是docker的问题 |
使用8卡的时候, 这一行出错,test_exe = fluid.ParallelExecutor |
The mission of this issue has completed. |
测P40上 resnet模型单机8卡跑不起来。(5,6,7,8卡都跑不起来)
run model resnet50
cudaid
1,2,3,4,5,6,7
CUDA_VISIBLE_DEVICES
1,2,3,4,5,6,7
----------- Configuration Arguments -----------
batch_size: 64
data_format: NCHW
data_set: flowers
device: GPU
gpu_id: 4,5,6,7
infer_only: False
iterations: 80
log_dir: ./
model: resnet_imagenet
pass_num: 5
skip_batch_num: 5
use_cprof: False
use_fake_data: True
use_nvprof: False
*** Aborted at 1528264637 (unix time) try "date -d @1528264637" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x50) received by PID 46576 (TID 0x7f13dd421700) from PID 80; stack trace: ***
@ 0x7f13dcc447e0 (unknown)
@ 0x7f131320b7d3 commFree()
@ 0x7f131320f82d ncclCommInitAll
@ 0x7f13ba1ef075 paddle::platform::NCCLContextMap::NCCLContextMap()
@ 0x7f13ba1eae01 paddle::framework::ParallelExecutor::ParallelExecutor()
@ 0x7f13ba186195 ZZN8pybind1112cpp_function10initializeIZNS_6detail4initIIRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS8_8CPUPlaceENS8_15CUDAPinnedPlaceENS5_6detail7variant5void_ESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaISF_EERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEESS_RKNS7_9framework11ProgramDescERKSsPNST_5ScopeERS4_IS10_SaIS10_EERKNST_7details17ExecutionStrategyERKNS14_13BuildStrategyEmmEE7executeINS_6class_INST_16ParallelExecutorEIEEEIELi0EEEvRT_DpRKT0_EUlPS1E_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmE_vIS1M_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENKUlRNS2_13function_callEE1_clES23
@ 0x7f13ba1862ee ZZN8pybind1112cpp_function10initializeIZNS_6detail4initIIRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS8_8CPUPlaceENS8_15CUDAPinnedPlaceENS5_6detail7variant5void_ESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaISF_EERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEESS_RKNS7_9framework11ProgramDescERKSsPNST_5ScopeERS4_IS10_SaIS10_EERKNST_7details17ExecutionStrategyERKNS14_13BuildStrategyEmmEE7executeINS_6class_INST_16ParallelExecutorEIEEEIELi0EEEvRT_DpRKT0_EUlPS1E_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmE_vIS1M_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS2_13function_callEE1_4_FUNES23
@ 0x7f13ba149b74 pybind11::cpp_function::dispatcher()
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceaaf6f instancemethod_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceee7fe slot_tp_init
@ 0x7f13dceed468 type_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dcf31877 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcec026d function_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceaaf6f instancemethod_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceee7fe slot_tp_init
@ 0x7f13dceed468 type_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dcf31877 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcf32491 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcf34232 PyEval_EvalCode
@ 0x7f13dcf4e61c run_mod
@ 0x7f13dcf4e6f0 PyRun_FileExFlags
@ 0x7f13dcf4fbfc PyRun_SimpleFileExFlags
@ 0x7f13dcf614bc Py_Main
./run.xsh: line 22: 46576 Segmentation fault
模型代码
PaddlePaddle/paddle-ce-latest-kpis@a2d1273
nccl:
continuous_evaluation]# ls -ltr /chaorong/lib/libnccl*
-rwxrwxrwx 1 root root 232842694 Feb 22 22:00 /chaorong/lib/libnccl_static.a
-rwxrwxrwx 1 root root 227911007 Feb 22 22:00 /chaorong/lib/libnccl.so.2.1.15
lrwxrwxrwx 1 root root 17 Feb 22 22:00 /chaorong/lib/libnccl.so.2 -> libnccl.so.2.1.15
lrwxrwxrwx 1 root root 12 Feb 22 22:00 /chaorong/lib/libnccl.so -> libnccl.so.2
4卡能跑起来:
CUDA_VISIBLE_DEVICES
4,5,6,7
----------- Configuration Arguments -----------
batch_size: 128
data_format: NCHW
data_set: flowers
device: GPU
gpu_id: 4,5,6,7
infer_only: False
iterations: 80
log_dir: ./
model: resnet_imagenet
pass_num: 5
skip_batch_num: 5
use_cprof: False
use_fake_data: True
use_nvprof: False
Pass: 0, Iter: 0, loss: 6.1183624, acc: 0.0
Pass: 0, Iter: 1, loss: 5.5844965, acc: 0.0
The text was updated successfully, but these errors were encountered: