-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cifar10 and Resnet Code Compiles But Does Not Run to Completion #26
Comments
I just made some changes regarding Cifar10_model and Wide_ResNet. You may want to pull it from master. As for your hanging problem, I would recommend debugging it from here. Put some prints before and after model.train_iter and exchanger.exchange to see where it gets stuck. This is normally the way I use to debug it. I would say it probably gets stuck in the exchanging parts. If so, then check if the collectives here in NCCL works, you can make a toy code like this to test it. This depends on NCCL and libgpuarray/pygpu correctly installed. I don't have the hanging problem even before my last commits. Here is how it runs now with mahe6562@cop8 8-2 $ nvidia-smi
Tue Aug 8 11:55:23 2017
+------------------------------------------------------+
| NVIDIA-SMI 352.93 Driver Version: 352.93 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 49C P0 146W / 149W | 2505MiB / 11519MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:06:00.0 Off | 0 |
| N/A 67C P0 147W / 149W | 2485MiB / 11519MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:09:00.0 Off | 0 |
| N/A 46C P0 149W / 149W | 2530MiB / 11519MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:0A:00.0 Off | 0 |
| N/A 37C P8 30W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 22C P8 26W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:85:00.0 Off | 0 |
| N/A 26C P8 29W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:88:00.0 Off | 0 |
| N/A 21C P8 25W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:89:00.0 Off | 0 |
| N/A 25C P8 28W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 169548 C /opt/sharcnet/python/2.7.10/intel/bin/python 2448MiB |
| 1 169549 C /opt/sharcnet/python/2.7.10/intel/bin/python 2428MiB |
| 2 169550 C /opt/sharcnet/python/2.7.10/intel/bin/python 2473MiB |
+-----------------------------------------------------------------------------+ mahe6562@cop8 8-2 $ top
top - 11:56:23 up 216 days, 46 min, 3 users, load average: 4.06, 3.88, 3.58
Tasks: 599 total, 4 running, 595 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.3%us, 4.7%sy, 0.0%ni, 94.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu2 : 78.1%us, 21.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 76.5%us, 23.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 72.8%us, 27.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 98957020k total, 37396796k used, 61560224k free, 3371692k buffers
Swap: 0k total, 0k used, 0k free, 23229008k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
169550 mahe6562 20 0 199g 2.3g 143m R 100.2 2.4 69:23.97 python
169548 mahe6562 20 0 199g 2.3g 143m R 99.8 2.4 68:26.97 python
169549 mahe6562 20 0 199g 2.3g 143m R 99.8 2.4 67:17.42 python
4218 nobody 20 0 260m 49m 2088 S 3.3 0.1 3568:01 gmond
67 root 20 0 0 0 0 S 1.7 0.0 2346:01 events/0
68 root 20 0 0 0 0 S 0.3 0.0 1646:14 events/1
172324 mahe6562 20 0 16332 1692 964 R 0.3 0.0 0:00.09 top
1 root 20 0 21452 1232 928 S 0.0 0.0 0:19.65 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.50 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 1:12.79 migration/0 |
The memory allocation part looks weird to me. I don't have this configured anywhere ( like cnmem in .theanorc) and I don't see this in my standard output and error. |
@hma02 : I now have the packages compiled correctly. However, still on running the BSP (or EASGD) based Cifar10_model (including ResNet) the behavior when running the code seems odd on my end:
The terminal output stays as above until my terminal session times out ... after more than 3 hrs at least? I tried using 1gpu, 2gpus, 3gpus and I still get the same behavior as above.
I checked my devices and the GPU utilization remains at 0% even though 95% memory is allocated.
Where do I change the device memory allocation in your code? Could this be due to memory allocation?
The text was updated successfully, but these errors were encountered: