Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cifar10 and Resnet Code Compiles But Does Not Run to Completion #26

Open
Nqabz opened this issue Aug 6, 2017 · 2 comments
Open

Cifar10 and Resnet Code Compiles But Does Not Run to Completion #26

Nqabz opened this issue Aug 6, 2017 · 2 comments

Comments

@Nqabz
Copy link

Nqabz commented Aug 6, 2017

@hma02 : I now have the packages compiled correctly. However, still on running the BSP (or EASGD) based Cifar10_model (including ResNet) the behavior when running the code seems odd on my end:

#launch_session.py
from theanompi import BSP
#from theanompi import EASGD

#rule=EASGD()
rule=BSP()
# modelfile: the relative path to the model file
# modelclass: the class name of the model to be imported from that file
rule.init(devices=['cuda0', 'cuda1', 'cuda2'] ,
         modelfile = 'theanompi.models.cifar10',
         modelclass = 'Cifar10_model')
rule.wait()
Using cuDNN version 6021 on context None
Preallocating 10943/11519 Mb (0.950000) on cuda2
Mapped name None to device cuda2: Tesla K80 (0000:08:00.0)
Using cuDNN version 6021 on context None
Preallocating 10943/11519 Mb (0.950000) on cuda0
Mapped name None to device cuda0: Tesla K80 (0000:04:00.0)
Using Theano backend.
Using Theano backend.
Using Theano backend.
rank0: bad list is [], extended to 195
rank0: bad list is [], extended to 39
Cifar10_model
Layer Subtract	 	 in (3, 32, 32, 256) --> out (3, 32, 32, 256)
Layer Crop	 	 in [  3  32  32 256] --> out (3, 28, 28, 256)
Layer Dimshuffle     	 in [  3  28  28 256] --> out (256, 3, 28, 28)
Layer Conv (cudnn) 	 in [256   3  28  28] --> out (256, 64, 24, 24)
Layer Pool	 	 in [256  64  24  24] --> out (256, 64, 12, 12)
Layer Conv (cudnn) 	 in [256  64  12  12] --> out (256, 128, 8, 8)
Layer Pool	 	 in [256 128   8   8] --> out (256, 128, 4, 4)
Layer Conv (cudnn) 	 in [256 128   4   4] --> out (256, 64, 2, 2)
Layer Flatten	 	 in [256  64   2   2] --> out (256, 256)
Layer FC	 	 in [256 256] --> out (256, 256)
Layer Dropout0.5 	 in [256 256] --> out (256, 256)
Layer Softmax	 	 in [256 256] --> out (256, 10)
[64  3  5  5]
[64]
[128  64   5   5]
[128]
[ 64 128   3   3]
[64]
[256 256]
[256]
[256  10]
[10]
model size 0.336 M floats
compiling training function...
compiling validation function...
Compile time: 3.236 s
calculating lr warming up power base: 1.246
learning rate 0.010000 will be used for epoch 0



The terminal output stays as above until my terminal session times out ... after more than 3 hrs at least? I tried using 1gpu, 2gpus, 3gpus and I still get the same behavior as above.

I checked my devices and the GPU utilization remains at 0% even though 95% memory is allocated.

+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:04:00.0     Off |                    0 |
| N/A   47C    P0    58W / 149W |  11081MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
| N/A   38C    P0    72W / 149W |  11122MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:08:00.0     Off |                    0 |
| N/A   45C    P0    62W / 149W |  11122MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:09:00.0     Off |                    0 |
| N/A   29C    P8    30W / 149W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:85:00.0     Off |                    0 |
| N/A   33C    P8    26W / 149W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:86:00.0     Off |                    0 |
| N/A   28C    P8    29W / 149W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:89:00.0     Off |                    0 |
| N/A   34C    P8    25W / 149W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:8A:00.0     Off |                    0 |
| N/A   26C    P8    29W / 149W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     10841    C  python               11056MiB 
|    1     10842    C   python               11097MiB 
|    2     10844    C  python               11097MiB 
+-----------------------------------------------------------------------------+

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                     
22819 root      20   0  665952  24612   5540 R   5.6  0.0   0:00.17 node                                                                                        
   10 root      20   0       0      0      0 S   0.7  0.0 283:05.93 rcu_sched                                                                                                                                                                         
 3202 root      20   0       0      0      0 S   0.3  0.0   0:00.87 kworker/21:2
5571 root      20   0       0      0      0 S   0.3  0.0   0:01.34 kworker/6:9

Where do I change the device memory allocation in your code? Could this be due to memory allocation?

@Nqabz Nqabz changed the title Cifar10, Resnet Code Compiles But Does Not Run to Completion Cifar10 and Resnet Code Compiles But Does Not Run to Completion Aug 6, 2017
@hma02
Copy link
Collaborator

hma02 commented Aug 8, 2017

I just made some changes regarding Cifar10_model and Wide_ResNet. You may want to pull it from master.

As for your hanging problem, I would recommend debugging it from here. Put some prints before and after

model.train_iter

and

exchanger.exchange

to see where it gets stuck. This is normally the way I use to debug it. I would say it probably gets stuck in the exchanging parts. If so, then check if the collectives here in NCCL works, you can make a toy code like this to test it. This depends on NCCL and libgpuarray/pygpu correctly installed.

I don't have the hanging problem even before my last commits. Here is how it runs now with Wide_ResNet.

mahe6562@cop8 8-2 $ nvidia-smi
Tue Aug  8 11:55:23 2017       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.93     Driver Version: 352.93         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
| N/A   49C    P0   146W / 149W |   2505MiB / 11519MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:06:00.0     Off |                    0 |
| N/A   67C    P0   147W / 149W |   2485MiB / 11519MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:09:00.0     Off |                    0 |
| N/A   46C    P0   149W / 149W |   2530MiB / 11519MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:0A:00.0     Off |                    0 |
| N/A   37C    P8    30W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   22C    P8    26W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:85:00.0     Off |                    0 |
| N/A   26C    P8    29W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   21C    P8    25W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:89:00.0     Off |                    0 |
| N/A   25C    P8    28W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0    169548    C   /opt/sharcnet/python/2.7.10/intel/bin/python  2448MiB |
|    1    169549    C   /opt/sharcnet/python/2.7.10/intel/bin/python  2428MiB |
|    2    169550    C   /opt/sharcnet/python/2.7.10/intel/bin/python  2473MiB |
+-----------------------------------------------------------------------------+
mahe6562@cop8 8-2 $ top
top - 11:56:23 up 216 days, 46 min,  3 users,  load average: 4.06, 3.88, 3.58
Tasks: 599 total,   4 running, 595 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.3%us,  4.7%sy,  0.0%ni, 94.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  : 78.1%us, 21.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 76.5%us, 23.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 72.8%us, 27.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  98957020k total, 37396796k used, 61560224k free,  3371692k buffers
Swap:        0k total,        0k used,        0k free, 23229008k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                      
169550 mahe6562  20   0  199g 2.3g 143m R 100.2  2.4  69:23.97 python                                                                                                       
169548 mahe6562  20   0  199g 2.3g 143m R 99.8  2.4  68:26.97 python                                                                                                        
169549 mahe6562  20   0  199g 2.3g 143m R 99.8  2.4  67:17.42 python                                                                                                        
  4218 nobody    20   0  260m  49m 2088 S  3.3  0.1   3568:01 gmond                                                                                                         
    67 root      20   0     0    0    0 S  1.7  0.0   2346:01 events/0                                                                                                      
    68 root      20   0     0    0    0 S  0.3  0.0   1646:14 events/1                                                                                                      
172324 mahe6562  20   0 16332 1692  964 R  0.3  0.0   0:00.09 top                                                                                                           
     1 root      20   0 21452 1232  928 S  0.0  0.0   0:19.65 init                                                                                                          
     2 root      20   0     0    0    0 S  0.0  0.0   0:00.50 kthreadd                                                                                                      
     3 root      RT   0     0    0    0 S  0.0  0.0   1:12.79 migration/0

@hma02
Copy link
Collaborator

hma02 commented Aug 8, 2017

@Nqabz

The memory allocation part looks weird to me. I don't have this configured anywhere ( like cnmem in .theanorc) and I don't see this in my standard output and error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants