-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 #4386
Comments
do you think switching to this higher mini batch after having already train the usual way will give added value as well? |
@HagegeR I didn't test it well. So just try. In general - yes. You can try to train the first several % of iterations with large mini_batch, |
Please could you explain in more detail the meaning of the options or how to work out a good configuration? I'm trying to get this feature going with my custom gaussian cfg but I'm not having success so far. |
Param
For Yolov3-spp 416x416 model on 8GB-GPU and 32GB-CPU-RAM try to use: https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg
What problem did you encounter? What GPU do you use? |
Such accuracy:
can be achieved only if you train with very large mini_batch size (~1024):
With small mini_batch size (~32) instead of
|
I tried mixnet_m_gpu.cfg with following setting :
I always get the following error:
Could you help to find out the cause? |
Thank you very much! I will try now. |
By the way, I found the 'Decay' value (0.00005) is different from the other cfg(decay=0.0005) in mixnet_m_g.cfg as following:
It's a special setting for mixnet_m_gpu.cfg ? or just a type error? |
Still get error as following:
|
@erikguo Do you get this error if you disable memory optimization?
Since Mixnet is a continuation of the EfficientNet that is a continuation of the (MobileNet ...), in the EfficientNet is used
|
After comment these lines, the training is running very well. If using these lines, it can run well occasionally. But It will crach in most of cases. |
|
It will crash at the first iteration. Crash message is as the following:
My server has 128G memory, 4 x 1080ti 11G GPU. Darknet is compiled with GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0 |
I just trained 2600 iterations successfully on RTX 2070 and CPU Core i7 32 GB CPU-RAM by using this command: and this cfg-file: mixnet_m_gpu.cfg.txt |
I use only one gpu for training. Command as following:
batch and subdivsion as following:
|
|
I have tried the following combination:
|
because my image's aspect is about 1:3 (h:w). So I set the network size with rectangle. |
@erikguo
|
@erikguo Try to use workspace_size_limit_MB=8000
|
isn't there a gpu memory leak ? After doing "free_network" there are still memory used on nvidia-smi. Adding a loop will full-fill gpu then crash.
|
@kossolax Is it related to |
I'm using optimized_memory=0, so it's just related to free_network. As you changed much memory usage, I guess this could be related, should I start a new issue? |
@kossolax Yes, start new issue, I will investigate it. |
@AlexeyAB Hello, I think cross iteration batch normalization can achieve similar result but higher training speed. |
@WongKinYiu Hi, I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing I.e. using I.e. using You can try it on Classifier csresnext50 So inside 1 batch it will average the values of Mean and Variance.
For using:
or
or
Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence. Paper: https://arxiv.org/abs/2002.05712v2 I used these formulas: |
Thank you a lot, i ll give you the feedback after finish training. |
I also added dynamic mini batch size when you train with random=1: c814d56 Just add
So
So even if part of CBN will not work properly, you can still use
Line 191 in c814d56
Also you can adjust mini batch size to your GPU-RAM amount (not necessarily batch and subdivision should be a multiple of 2) |
@AlexeyAB OK, Thank you, SpineNet-49-omega will finish training in half hour. |
I tried yolov3-spp.cfg with following setting :
could you tell me why? |
now, I got error
|
Just tried to run with this on:
and got this error:
I've tried several different values for workspace_size_limit_MB and subdivisions and all fail with the same message. I was running with a single gpu, and I peaked at about 40 GB / 64 GB memory usage on the cpu. |
@WongKinYiu @AlexeyAB @cenit @LukeAI Hi everyone!
Has far as my understanding goes, the batch size is the number of samples processed before the weighs update I would be very happy with an answer to those questions and I'm sure I am not alone not understanding. |
What parameters I can use with nVidia
Tried to use these, but 100h+ for training - too long. On other PC with GTX970
I've got ~16-20h of training
|
@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings. |
I want to speed up training without mAP loss :) |
@igoriok1994 CPU memory is very slow, in my experience 5x + slower than regular GPU training. The benefit of CPU memory training is to increase precision (mAP) by increasing the batch size beyond the memory available on your GPU. |
does we need to change batch_normalize's setting in every convolutional part in cfg file ? , the numbers of convolutional is 73 @AlexeyAB |
@AlexeyAB have you seen this implementation for decreasing memory usage allowing larger batches with the same GPU memory? https://github.com/MegEngine/MegEngine/wiki/Reduce-GPU-memory-usage-by-Dynamic-Tensor-Rematerialization |
Higher mini_batch -> higher accuracy mAP/Top1/Top5.
Training on GPU by using CPU-RAM allows significantly increase the size of the mini batch 4x-16x times and more.
You can train with 16x higher mini_batch, but with 5x lower speed on Yolov3-spp, it should give you ~+2-4 mAP.
Use in your cfg-file:
random=1
is not supportedTested:
Tested on model https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg with
wifth=416 height=416
on 8GB_GPU_VRAM + 32GB_CPU_RAM./darknet detector train data/obj.data yolov3-spp.cfg -map
default
: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=6.5 GB, iteration = 3 secoptimized_memory=1
: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=5.8 GB, iteration = 3 secoptimized_memory=2 workspace_size_limit_MB=1000
: mini_batch=20 = batch_60 / subdivisions_3, GPU-RAM-usage=5.4 GB, iteration = 15 secoptimized_memory=3 workspace_size_limit_MB=1000
: mini_batch=32 = batch_64 / subdivisions_2, GPU-RAM-usage=4.0 GB, iteration = 15 sec (CPU-RAM-usage = 31 GB)Not well tested yet:
optimized_memory=3 workspace_size_limit_MB=2000
: mini_batch=64 = batch_128 / subdivisions_2, GPU-RAM-usage=7.5 GB, iteration = 15 sec (CPU-RAM-usage = 62 GB)optimized_memory=3 workspace_size_limit_MB=2000
or4000
: mini_batch=128 = batch_256 / subdivisions_2, GPU-RAM-usage=13.5 GB, iteration = 15 sec (CPU-RAM-usage = 124 GB)mini_batch=24 - 24 GB VRAM RTX Titan - $2500: https://www.amazon.com/NVIDIA-Titan-RTX-Graphics-Card/dp/B07L8YGDL5
mini_batch=48 - 48 GB VRAM Quadro RTX 8000 - $5500: https://www.amazon.com/PNY-VCQRTX8000-PB-NVIDIA-Quadro-Graphic/dp/B07NH3HKG9/
mini_batch=128 - 128 GB RAM - $1700 = RTX 2080 Ti 11 GB - $1100 + $600 CPU-RAM 128 GB = 4x32 +
with this software solution
mini_batch=512 - 512 GB RAM - $9200 = 48 GB VRAM Quadro RTX 8000 - $5500 + 512GB=2 x (8 x 32GB), $2600 + $1100 - CPU AMD EPYC 7401P - 32 cores, 16 memory slots up to 2 TB RAM and 128 PCIe® 3.0 lanes +
with this software solution
mini_batch=512 - 512 GB VRAM (16 x 32GB Tesla V100) DGX2 - $400 000 https://www.nvidia.com/en-us/data-center/dgx-2/ + with synchronized batch normalization technique solution like: https://arxiv.org/abs/1711.07240v4
Example of trained model: yolov3-tiny_pan_gaus_giou_scale.cfg.txt
+5 [email protected]
The text was updated successfully, but these errors were encountered: