Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 #4386

AlexeyAB · 2019-11-26T20:38:29Z

Higher mini_batch -> higher accuracy mAP/Top1/Top5.

Training on GPU by using CPU-RAM allows significantly increase the size of the mini batch 4x-16x times and more.

You can train with 16x higher mini_batch, but with 5x lower speed on Yolov3-spp, it should give you ~+2-4 mAP.

Use in your cfg-file:

[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000

multi-GPU is not tested
random=1 is not supported

Tested:

GeForce RTX 2070 - 8 GB VRAM
CPU Core i7 6700K - 32 GB RAM

Tested on model https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg with wifth=416 height=416 on 8GB_GPU_VRAM + 32GB_CPU_RAM

./darknet detector train data/obj.data yolov3-spp.cfg -map

default: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=6.5 GB, iteration = 3 sec
optimized_memory=1: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=5.8 GB, iteration = 3 sec
optimized_memory=2 workspace_size_limit_MB=1000: mini_batch=20 = batch_60 / subdivisions_3, GPU-RAM-usage=5.4 GB, iteration = 15 sec
optimized_memory=3 workspace_size_limit_MB=1000: mini_batch=32 = batch_64 / subdivisions_2, GPU-RAM-usage=4.0 GB, iteration = 15 sec (CPU-RAM-usage = 31 GB)

Not well tested yet:

optimized_memory=3 workspace_size_limit_MB=2000: mini_batch=64 = batch_128 / subdivisions_2, GPU-RAM-usage=7.5 GB, iteration = 15 sec (CPU-RAM-usage = 62 GB)
optimized_memory=3 workspace_size_limit_MB=2000 or 4000: mini_batch=128 = batch_256 / subdivisions_2, GPU-RAM-usage=13.5 GB, iteration = 15 sec (CPU-RAM-usage = 124 GB)

mini_batch=24 - 24 GB VRAM RTX Titan - $2500: https://www.amazon.com/NVIDIA-Titan-RTX-Graphics-Card/dp/B07L8YGDL5
mini_batch=48 - 48 GB VRAM Quadro RTX 8000 - $5500: https://www.amazon.com/PNY-VCQRTX8000-PB-NVIDIA-Quadro-Graphic/dp/B07NH3HKG9/
mini_batch=128 - 128 GB RAM - $1700 = RTX 2080 Ti 11 GB - $1100 + $600 CPU-RAM 128 GB = 4x32 + with this software solution
mini_batch=512 - 512 GB RAM - $9200 = 48 GB VRAM Quadro RTX 8000 - $5500 + 512GB=2 x (8 x 32GB), $2600 + $1100 - CPU AMD EPYC 7401P - 32 cores, 16 memory slots up to 2 TB RAM and 128 PCIe® 3.0 lanes + with this software solution
mini_batch=512 - 512 GB VRAM (16 x 32GB Tesla V100) DGX2 - $400 000 https://www.nvidia.com/en-us/data-center/dgx-2/ + with synchronized batch normalization technique solution like: https://arxiv.org/abs/1711.07240v4

Example of trained model: yolov3-tiny_pan_gaus_giou_scale.cfg.txt

mini_batch=32 `+5 [email protected]`	mini_batch=8

---	---

The text was updated successfully, but these errors were encountered:

HagegeR · 2019-11-26T21:28:46Z

do you think switching to this higher mini batch after having already train the usual way will give added value as well?

AlexeyAB · 2019-11-26T21:50:19Z

@HagegeR I didn't test it well. So just try.

In general - yes.

You can try to train the first several % of iterations with large mini_batch,
then continue training with small mini_batch for fast training,
and then continue training the last few percent of iterations with high mini_batch.

LukeAI · 2019-11-28T09:53:27Z

Please could you explain in more detail the meaning of the options or how to work out a good configuration? I'm trying to get this feature going with my custom gaussian cfg but I'm not having success so far.
What does this mean?
optimized_memory=3
workspace_size_limit_MB=1000

AlexeyAB · 2019-11-28T11:42:21Z

@LukeAI

Param optimized_memory= is related to GPU-memory optimization:

optimized_memory=0 - there is no additional memory optimization (by default)
optimized_memory=1 - there is optimized delta_gpu, instead of many arrays - it allocates 2 global_delta_gpu & state_delta_gpu arrays which will be used for the most of layers. It doesn't slowdown training, but can work incorrectly on a new models which will be made later.
optimized_memory=2 - also will be used CPU-RAM instead of GPU-VRAM for array output_gpu (output of layer), activation_input_gpu (input of activation) and x_gpu (input of batch-normalization) in each of layer
optimized_memory=3 - also it will use CPU-RAM instead of GPU-VRAM for arrays global_delta_gpu & state_delta_gpu
workspace_size_limit_MB=1000 - will be used 1000 MB for cuDNN-workspace.
- If GPU memory is not enough (CUDA out of memory), then try to reduce this value.
- If Darknet is halted or falls with strange errors - try to increase this value.
- (Try to use 1000 if you have 32 GB CPU-RAM and 2000 if 64 CPU-RAM)
- if GPU is lost - try to reboot your PC

For Yolov3-spp 416x416 model on 8GB-GPU and 32GB-CPU-RAM try to use: https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg

[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000

I'm trying to get this feature going with my custom gaussian cfg but I'm not having success so far.

What problem did you encounter?

What GPU do you use?
How many CPU-RAM do you have?
Rename your cfg-file to txt file and attach.

AlexeyAB · 2019-11-30T11:35:25Z

Such accuracy:

MobileNetv3 - Top1 75.37%
MixNet-S - Top1 75.68%
EfficientNetB0 - Top1 76.3%

can be achieved only if you train with very large mini_batch size (~1024):

either you use TPU-cluster ~1M$ or DGX-2 400K$ with synchronized batch-normalization (which slows down training) https://arxiv.org/abs/1711.07240v4
or you use CPU-RAM instead of GPU-RAM which 100x time cheaper, but slows down training more (except IBM Power8-CPU with nVlink between CPU & GPUs) Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 #4386

With small mini_batch size (~32) instead of Top1 76.3% we get: #3380 (comment)

Our EfficientNet B0 (224x224) 0.9 BFLOPS - 0.45 B_FMA (16ms / RTX 2070), 4.9M params - 71.3% Top1
Official EfficientNetB0 (224x224) 0.78 BFLOPS - 0.39 FMA, 5.3M params - 70.0% Top1

erikguo · 2019-12-03T15:18:38Z

@AlexeyAB

I tried mixnet_m_gpu.cfg with following setting :

optimized_memory=2
workspace_size_limit_MB=1000

I always get the following error:

 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
CUDA status Error: file: ./src/dark_cuda.c : () : line: 423 : build time: Dec  3 2019 - 23:02:36 
CUDA Error: invalid argument
CUDA Error: invalid argument: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

Could you help to find out the cause?

AlexeyAB · 2019-12-03T15:39:11Z

@erikguo I fixed it: 5d0352f

Just tried mixnet_m_gpu.cfg with

[net]
# Training
batch=120
subdivisions=2
optimized_memory=3
workspace_size_limit_MB=1000

erikguo · 2019-12-03T15:54:14Z

Thank you very much!

I will try now.

erikguo · 2019-12-03T15:57:18Z

By the way, I found the 'Decay' value (0.00005) is different from the other cfg(decay=0.0005) in mixnet_m_g.cfg as following:

momentum=0.9
decay=0.00005

It's a special setting for mixnet_m_gpu.cfg ? or just a type error?

@AlexeyAB

erikguo · 2019-12-03T16:01:15Z

@AlexeyAB

Still get error as following:

Pinned block_id = 3, filled = 99.917603 % 
 241 route  240 238 236 234 	                   ->    9 x   3 x1200 
 242 avg                             9 x   3 x1200 ->   1200
 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
 251 conv     51       1 x 1/ 1      1 x   1 x1536 ->    1 x   1 x  51 0.000 BF
 252 softmax                                          51

 Pinned block_id = 4, filled = 98.600769 % 
Total BFLOPS 0.592 
 Allocate additional workspace_size = 18.58 MB 
Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 253 layers from weights-file 
Learning Rate: 0.064, Momentum: 0.9, Decay: 0.0005
304734
Loaded: 0.933879 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  3 2019 - 23:02:38 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

AlexeyAB · 2019-12-03T20:34:03Z

@erikguo Do you get this error if you disable memory optimization?
Comment these lines:

#optimized_memory=3
#workspace_size_limit_MB=1000

By the way, I found the 'Decay' value (0.00005) is different from the other cfg(decay=0.0005)

Since Mixnet is a continuation of the EfficientNet that is a continuation of the (MobileNet ...), in the EfficientNet is used decay=0.00001 https://arxiv.org/pdf/1905.11946v2.pdf

weight decay 1e-5;

erikguo · 2019-12-03T23:43:56Z

After comment these lines, the training is running very well. If using these lines, it can run well occasionally. But It will crach in most of cases.

@AlexeyAB

AlexeyAB · 2019-12-04T00:19:53Z

@erikguo

How many iterations before crashing?
What is the error message?
How many CPU RAM do you have?
What GPU do you use?
Do you use GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=1 ?

erikguo · 2019-12-04T04:12:40Z

@AlexeyAB

It will crash at the first iteration.

Crash message is as the following:

Pinned block_id = 3, filled = 99.917603 % 
 241 route  240 238 236 234 	                   ->    9 x   3 x1200 
 242 avg                             9 x   3 x1200 ->   1200
 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
 251 conv     51       1 x 1/ 1      1 x   1 x1536 ->    1 x   1 x  51 0.000 BF
 252 softmax                                          51

 Pinned block_id = 4, filled = 98.600769 % 
Total BFLOPS 0.592 
 Allocate additional workspace_size = 18.58 MB 
Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 253 layers from weights-file 
Learning Rate: 0.064, Momentum: 0.9, Decay: 5e-05
304734
Loaded: 1.104122 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  3 2019 - 23:02:38 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
已放弃 (核心已转储)

My server has 128G memory, 4 x 1080ti 11G GPU.

Darknet is compiled with GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0

AlexeyAB · 2019-12-04T13:00:03Z

@erikguo

Do you use 4 x GPU for training?
What command do you use for training?
What batch and subdivisions did you set?

I just trained 2600 iterations successfully on RTX 2070 and CPU Core i7 32 GB CPU-RAM by using this command:
darknet.exe classifier train cfg/imagenet1k_c.data cfg/mixnet_m_gpu.cfg backup/mixnet_m_gpu_last.weights -topk

and this cfg-file: mixnet_m_gpu.cfg.txt

erikguo · 2019-12-04T14:57:07Z

I use only one gpu for training.

Command as following:

darknet classifier train dengdi.data mixnet_m_gpu.cfg backup/mixnet_m_gpu_last.cfg -dont_show

batch and subdivsion as following:

batch=128
subdivisions=2

mixnet_m_gpu_mem.cfg.txt

@AlexeyAB

AlexeyAB · 2019-12-04T15:11:01Z

@erikguo

Why do you use height=96 width=288 ?
I successfully run training with your cfg-file mixnet_m_gpu_mem.cfg.txt on RTX 2070 8 GB-VRAM + 32 GB CPU_RAM
darknet.exe classifier train cfg/imagenet1k_c.data cfg/mixnet_m_gpu_mem.cfg backup/mixnet_m_gpu_last.weights -topk

erikguo · 2019-12-04T15:11:18Z

@AlexeyAB

I have tried the following combination:

batch=128
subdivisions=2
running very well now

batch=256
subdivisions=2
running very well now

batch=256
subdivisions=1
running crashed in the first iteration

batch=512
subdivisions=2
running crashed in the first iteration

erikguo · 2019-12-04T15:12:06Z

because my image's aspect is about 1:3 (h:w). So I set the network size with rectangle.

AlexeyAB · 2019-12-04T15:13:38Z

@erikguo
Check this combination:
batch=128
subdivisions=1

batch=256
subdivisions=1
running crashed in the first iteration

Show screenshot of CPU_RAM usage
Show screenshot of GPU_RAM usage
Show screenshot of the error message

erikguo · 2019-12-04T15:25:42Z

My OS is Ubuntu 16.04

this combination is crashed two times and run well one time now. The execution is not stable:
batch=128
subdivisions=1

this combination is bad, always crashed:
batch=256
subdivisions=1

AlexeyAB · 2019-12-04T15:29:27Z

@erikguo Try to use workspace_size_limit_MB=8000

batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000

kossolax · 2020-01-22T10:25:49Z

isn't there a gpu memory leak ? After doing "free_network" there are still memory used on nvidia-smi. Adding a loop will full-fill gpu then crash.

for(int p=0; p<1000; p++) {

        network subnet = parse_network_cfg(cfgfile);
        if (weightfile) {
            load_weights(&subnet, weightfile);
        }

        *subnet.seen = 0;
        
        while ( *subnet.seen < train_images_num ) {
            
            pthread_join(load_thread, 0);
            train = buffer;
            load_thread = load_data(args);

            float loss = train_network_waitkey(subnet, train, 0);
            free_data(train);
        }

        int tmp = subnet.batch;

        set_batch_network(&subnet, 1);
        float map = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, subnet.letter_box, &subnet);
	printf("%f", map);

        set_batch_network(&subnet, tmp);

        free_network(subnet);
}

AlexeyAB · 2020-01-22T13:05:44Z

@kossolax Is it related to optimized_memory=3 and GPU-processing on CPU-RAM? Or just realted to free_network()?

kossolax · 2020-01-22T14:05:01Z

I'm using optimized_memory=0, so it's just related to free_network. As you changed much memory usage, I guess this could be related, should I start a new issue?

AlexeyAB · 2020-01-22T14:13:33Z

@kossolax Yes, start new issue, I will investigate it.

WongKinYiu · 2020-02-19T00:59:14Z

@AlexeyAB Hello,

I think cross iteration batch normalization can achieve similar result but higher training speed.
https://github.com/Howal/Cross-iterationBatchNorm

AlexeyAB · 2020-02-21T17:30:45Z

@WongKinYiu Hi,

I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing batch= in cfg-file, and set cbn=1 instead of batch_normalize=1
So batch=120 subdivisions=4 with CBN, should work better than batch=120 subdivisions=4 with BN.
But batch=120 subdivisions=4 with CBN, will work worse than batch=120 subdivisions=1 with BN.

I.e. using batch=64 subdivisions=8 with BN, avg mini_batch_size = 8
64/8 = 8

I.e. using batch=64 subdivisions=8 with CBN, avg mini_batch_size = 36
(8+16+24+32+40+48+56+64)/8 = 36

You can try it on Classifier csresnext50

So inside 1 batch it will average the values of Mean and Variance.
I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.

For the 1st mini_batch will use Mean[1] & Variance[1]
For the 2nd mini_batch will use avg(Mean[1], Mean[2]) & avg(Variance[1], Variance[2])
For the 3rd mini_batch will use avg(Mean[1], Mean[2], Mean[3]) & avg(Variance[1], Variance[2], Variance[3])
....

For using:

[convolutional]
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=1
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=2
filters=16
size=3
stride=1
pad=1
activation=leaky

Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.

Paper: https://arxiv.org/abs/2002.05712v2

I used these formulas:

WongKinYiu · 2020-02-21T22:14:35Z

@AlexeyAB

Thank you a lot, i ll give you the feedback after finish training.

AlexeyAB · 2020-03-02T12:44:31Z

@WongKinYiu

I also added dynamic mini batch size when you train with random=1: c814d56

Just add dynamic_minibatch=1 in the [net] section:

[net]
batch=64
subdivisions=8
dynamic_minibatch=1
width=416
height=416

...
[yolo]
random=1

So

network resolution will be 288x288 - 608x608 due to random=1
for 608x608 the mini batch size = batch/subdivisions = 8
for 416x416 the mini batch size = 0.8 x ((608x608)/(416x416)) x batch/subdivisions = 13
for 288x288 the mini batch size = 0.8 x ((608x608)/(288x288)) x batch/subdivisions = 28

So even if part of CBN will not work properly, you can still use dynamic_minibatch=1 to increase mini_batch size.

0.8 is just a coefficient to avoid out of memory for some network resolutions (sometime cuDNN require much more memory for lower resolution than for higher), but you can try to set 0.9:

darknet/src/detector.c

Line 191 in c814d56

int new_dim_b = (int)(dim_b * 0.8);

Also you can adjust mini batch size to your GPU-RAM amount (not necessarily batch and subdivision should be a multiple of 2)
batch / subdivisions = mini_batch_size
64/8 = 8
63/7 = 9
70/7 = 10
66/6 = 11
60/5 = 12
65/5 = 13
70/5 = 14
60/4 = 15
64/4 = 16

WongKinYiu · 2020-03-02T12:55:39Z

@AlexeyAB OK,

Thank you, SpineNet-49-omega will finish training in half hour.
Will report the result soon.

Answergeng · 2020-03-23T07:05:27Z

I tried yolov3-spp.cfg with following setting :
optimized_memory=3
workspace_size_limit_MB=1000
my cpu-ram is 64g, after loading use 20.9g
but always stuck at here

net.optimized_memory = 3
batch = 1, time_steps = 1, train = 0
yolov3-spp
net.optimized_memory = 3
pre_allocate... pinned_ptr = 0000000000000000
pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
batch = 8, time_steps = 1, train = 1
Pinned block_id = 0, filled = 88.134911 %
Pinned block_id = 1, filled = 96.948578 %
Pinned block_id = 2, filled = 96.949005 %
Pinned block_id = 3, filled = 99.152946 %
Pinned block_id = 4, filled = 99.153809 %
Pinned block_id = 5, filled = 98.830368 %
Pinned block_id = 6, filled = 99.875595 %
Done! Loaded 85 layers from weights-file

could you tell me why?

Answergeng · 2020-03-23T08:51:39Z

I tried yolov3-spp.cfg with following setting :
optimized_memory=3
workspace_size_limit_MB=1000
my cpu-ram is 64g, after loading use 20.9g
but always stuck at here

net.optimized_memory = 3
batch = 1, time_steps = 1, train = 0
yolov3-spp
net.optimized_memory = 3
pre_allocate... pinned_ptr = 0000000000000000
pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
batch = 8, time_steps = 1, train = 1
Pinned block_id = 0, filled = 88.134911 %
Pinned block_id = 1, filled = 96.948578 %
Pinned block_id = 2, filled = 96.949005 %
Pinned block_id = 3, filled = 99.152946 %
Pinned block_id = 4, filled = 99.153809 %
Pinned block_id = 5, filled = 98.830368 %
Pinned block_id = 6, filled = 99.875595 %
Done! Loaded 85 layers from weights-file

could you tell me why?

now, I got error

CUDA Error: invalid device pointer: No error
Assertion failed: 0, file ....\src\utils.c, line 325

LucasSloan · 2020-05-28T21:01:39Z

Just tried to run with this on:

batch=64
subdivisions=4
dynamic_minibatch=1
width=960
height=576
optimized_memory=3
workspace_size_limit_MB=8000

and got this error:

CUDA status Error: file: /home/lucas/Development/darknet/src/dark_cuda.c : () : line: 454 : build time: May 18 2020 - 15:30:02 

 CUDA Error: invalid device pointer
CUDA Error: invalid device pointer: Resource temporarily unavailable

I've tried several different values for workspace_size_limit_MB and subdivisions and all fail with the same message. I was running with a single gpu, and I peaked at about 40 GB / 64 GB memory usage on the cpu.

arnaud-nt2i · 2020-09-04T09:54:17Z

@WongKinYiu @AlexeyAB @cenit @LukeAI

Hi everyone!
Two simple questions I could not find answers everywhere else... Even on google scholar for the second one...

~~Is it possible to use dynamic_mini batch=1 while using custom resize of the network eg: "random=1.34"?~~
|--> Yes
~~Is it possible to use dynamic_mini batch=1 and batch_normalize=2 at the same Time Without messing everything up?~~
|--> Yes
~~How is it possible that the mini_batch parameter has an influence on mAP with consistent batch size?~~
|--> Because Batch normalization is done on Mini-Batch size and not on Batch size.

Has far as my understanding goes, the batch size is the number of samples processed before the weighs update
but mini_batch is just a computational trick to avoid loading and processing the batch in one time and should not have an impact...

I would be very happy with an answer to those questions and I'm sure I am not alone not understanding.

igoriok1994 · 2020-11-17T21:21:53Z

What parameters I can use with nVidia Quadro M1000M (GPU_RAM = 2GB) and I7 + CPU_RAM = 64GB?

###
# Training
batch=64
subdivisions=8

###
width=608
height=608

###
optimized_memory=3
workspace_size_limit_MB=2000
mini_batch=16

Tried to use these, but 100h+ for training - too long.

On other PC with GTX970 4GB and I5 16GB with parameters

###
# Training
batch=64
subdivisions=16

###s
width=608
height=608

I've got ~16-20h of training

Classes=5, max iterations= 10000.

igoriok1994 · 2020-11-17T21:25:20Z

On laptop with settings:

###
# Training
batch=64
subdivisions=32

###
width=608
height=608

### NOT USED ###
# optimized_memory=3
# workspace_size_limit_MB=2000
# mini_batch=16

getting this:

Btw this is Tiny YoloV4

pullmyleg · 2020-11-17T21:53:40Z

@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.

igoriok1994 · 2020-11-17T21:59:10Z

@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.

I want to speed up training without mAP loss :)

pullmyleg · 2020-11-17T22:12:21Z

@igoriok1994 CPU memory is very slow, in my experience 5x + slower than regular GPU training. The benefit of CPU memory training is to increase precision (mAP) by increasing the batch size beyond the memory available on your GPU.

nanhui69 · 2021-01-26T01:19:13Z

@WongKinYiu Hi,

I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing batch= in cfg-file, and set cbn=1 instead of batch_normalize=1
So batch=120 subdivisions=4 with CBN, should work better than batch=120 subdivisions=4 with BN.
But batch=120 subdivisions=4 with CBN, will work worse than batch=120 subdivisions=1 with BN.

I.e. using batch=64 subdivisions=8 with BN, avg mini_batch_size = 8
64/8 = 8

I.e. using batch=64 subdivisions=8 with CBN, avg mini_batch_size = 36
(8+16+24+32+40+48+56+64)/8 = 36

You can try it on Classifier csresnext50

So inside 1 batch it will average the values of Mean and Variance.
I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.

For the 1st mini_batch will use Mean[1] & Variance[1]

For the 2nd mini_batch will use avg(Mean[1], Mean[2]) & avg(Variance[1], Variance[2])

For the 3rd mini_batch will use avg(Mean[1], Mean[2], Mean[3]) & avg(Variance[1], Variance[2], Variance[3])
....

For using:
[convolutional]
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky
or
[convolutional]
batch_normalize=1
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky
or
[convolutional]
batch_normalize=2
filters=16
size=3
stride=1
pad=1
activation=leaky
Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.

Paper: https://arxiv.org/abs/2002.05712v2

I used these formulas:

does we need to change batch_normalize's setting in every convolutional part in cfg file ? , the numbers of convolutional is 73 @AlexeyAB

pullmyleg · 2021-08-21T22:14:53Z

@AlexeyAB have you seen this implementation for decreasing memory usage allowing larger batches with the same GPU memory? https://github.com/MegEngine/MegEngine/wiki/Reduce-GPU-memory-usage-by-Dynamic-Tensor-Rematerialization

AlexeyAB added the ToDo RoadMap label Nov 26, 2019

AlexeyAB mentioned this issue Nov 26, 2019

ASFF - Learning Spatial Fusion for Single-Shot Object Detection - 63% [email protected] with 45.5FPS #4382

Closed

AlexeyAB changed the title ~~Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128~~ Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 Nov 26, 2019

AlexeyAB mentioned this issue Nov 27, 2019

Matrix Nets: A New Deep Architecture for Object Detection - mAP of [email protected] on MS COCO, #3772

Open

This was referenced Nov 28, 2019

CSPNet - New models and the most comprehensive comparison of detection models #4406

Closed

EfficientNet | Implementation ? #3380

Closed

AlexeyAB mentioned this issue Dec 3, 2019

Classifier trainin - Mosaic data augmentation #4432

Closed

AlexeyAB mentioned this issue Dec 11, 2019

MixNet (Mix_Conv) - 0.360 (0.5) BFlops - 77.0% (71.5%) Top1 #4203

Closed

mmartin56 mentioned this issue Jan 15, 2020

Random resizing - downsize only #4694

Open

AlexeyAB mentioned this issue Jan 23, 2020

Impact of Training's parameters on Networks performances #4745

Open

This was referenced Feb 21, 2020

Dropblock: A regularization method for convolutional networks +1.6 [email protected] (and +1.6 Top1) #4498

Open

Feature Request: To be able to specify the random resize factor upper and lower limits. #4906

Closed

xevolesi mentioned this issue Apr 7, 2020

CPU-RAM training crashes with CUDA Error #5186

Open

This was referenced Apr 15, 2020

GPU memory is not being utilized completely #5231

Open

No performance improvement with CUDNN_HALF=1 on Jetson Xavier AGX #5234

Closed

pullmyleg mentioned this issue May 10, 2020

Small object detection @1080p on Xavier AGX results & request for improvement suggestions? #5552

Closed

arnaud-nt2i mentioned this issue Sep 8, 2020

Non trivial question regarding mini_batch size #6623

Closed

Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 #4386

Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 #4386

Comments

AlexeyAB commented Nov 26, 2019 • edited Loading

HagegeR commented Nov 26, 2019

AlexeyAB commented Nov 26, 2019 • edited Loading

LukeAI commented Nov 28, 2019

AlexeyAB commented Nov 28, 2019 • edited Loading

AlexeyAB commented Nov 30, 2019 • edited Loading

erikguo commented Dec 3, 2019

AlexeyAB commented Dec 3, 2019

erikguo commented Dec 3, 2019

erikguo commented Dec 3, 2019

erikguo commented Dec 3, 2019

AlexeyAB commented Dec 3, 2019

erikguo commented Dec 3, 2019

AlexeyAB commented Dec 4, 2019

erikguo commented Dec 4, 2019 • edited Loading

AlexeyAB commented Dec 4, 2019

erikguo commented Dec 4, 2019

AlexeyAB commented Dec 4, 2019

erikguo commented Dec 4, 2019

erikguo commented Dec 4, 2019

AlexeyAB commented Dec 4, 2019 • edited Loading

erikguo commented Dec 4, 2019

AlexeyAB commented Dec 4, 2019 • edited Loading

kossolax commented Jan 22, 2020 • edited Loading

AlexeyAB commented Jan 22, 2020

kossolax commented Jan 22, 2020

AlexeyAB commented Jan 22, 2020

WongKinYiu commented Feb 19, 2020

AlexeyAB commented Feb 21, 2020 • edited Loading

WongKinYiu commented Feb 21, 2020

AlexeyAB commented Mar 2, 2020

WongKinYiu commented Mar 2, 2020

Answergeng commented Mar 23, 2020

Answergeng commented Mar 23, 2020

LucasSloan commented May 28, 2020

arnaud-nt2i commented Sep 4, 2020 • edited Loading

igoriok1994 commented Nov 17, 2020

igoriok1994 commented Nov 17, 2020

pullmyleg commented Nov 17, 2020

igoriok1994 commented Nov 17, 2020

pullmyleg commented Nov 17, 2020

nanhui69 commented Jan 26, 2021

pullmyleg commented Aug 21, 2021

AlexeyAB commented Nov 26, 2019 •

edited

Loading

AlexeyAB commented Nov 26, 2019 •

edited

Loading

AlexeyAB commented Nov 28, 2019 •

edited

Loading

AlexeyAB commented Nov 30, 2019 •

edited

Loading

erikguo commented Dec 4, 2019 •

edited

Loading

AlexeyAB commented Dec 4, 2019 •

edited

Loading

AlexeyAB commented Dec 4, 2019 •

edited

Loading

kossolax commented Jan 22, 2020 •

edited

Loading

AlexeyAB commented Feb 21, 2020 •

edited

Loading

arnaud-nt2i commented Sep 4, 2020 •

edited

Loading