Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 #4386

Open
AlexeyAB opened this issue Nov 26, 2019 · 71 comments
Open

Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 #4386

AlexeyAB opened this issue Nov 26, 2019 · 71 comments
Labels
Likely bug Maybe a bug, maybe not ToDo RoadMap

Comments

@AlexeyAB
Copy link
Owner

AlexeyAB commented Nov 26, 2019

Higher mini_batch -> higher accuracy mAP/Top1/Top5.

Training on GPU by using CPU-RAM allows significantly increase the size of the mini batch 4x-16x times and more.

You can train with 16x higher mini_batch, but with 5x lower speed on Yolov3-spp, it should give you ~+2-4 mAP.

Use in your cfg-file:

[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000
  • multi-GPU is not tested
  • random=1 is not supported

Tested:

  • GeForce RTX 2070 - 8 GB VRAM
  • CPU Core i7 6700K - 32 GB RAM

Tested on model https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg with wifth=416 height=416 on 8GB_GPU_VRAM + 32GB_CPU_RAM

./darknet detector train data/obj.data yolov3-spp.cfg -map

  • default: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=6.5 GB, iteration = 3 sec

  • optimized_memory=1: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=5.8 GB, iteration = 3 sec

  • optimized_memory=2 workspace_size_limit_MB=1000: mini_batch=20 = batch_60 / subdivisions_3, GPU-RAM-usage=5.4 GB, iteration = 15 sec

  • optimized_memory=3 workspace_size_limit_MB=1000: mini_batch=32 = batch_64 / subdivisions_2, GPU-RAM-usage=4.0 GB, iteration = 15 sec (CPU-RAM-usage = 31 GB)


Not well tested yet:

  • optimized_memory=3 workspace_size_limit_MB=2000: mini_batch=64 = batch_128 / subdivisions_2, GPU-RAM-usage=7.5 GB, iteration = 15 sec (CPU-RAM-usage = 62 GB)

  • optimized_memory=3 workspace_size_limit_MB=2000 or 4000: mini_batch=128 = batch_256 / subdivisions_2, GPU-RAM-usage=13.5 GB, iteration = 15 sec (CPU-RAM-usage = 124 GB)



Example of trained model: yolov3-tiny_pan_gaus_giou_scale.cfg.txt

mini_batch=32 +5 [email protected] mini_batch=8
chart chart
--- ---
@AlexeyAB AlexeyAB added the ToDo RoadMap label Nov 26, 2019
@AlexeyAB AlexeyAB changed the title Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 Nov 26, 2019
@HagegeR
Copy link

HagegeR commented Nov 26, 2019

do you think switching to this higher mini batch after having already train the usual way will give added value as well?

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Nov 26, 2019

@HagegeR I didn't test it well. So just try.

In general - yes.

You can try to train the first several % of iterations with large mini_batch,
then continue training with small mini_batch for fast training,
and then continue training the last few percent of iterations with high mini_batch.

@LukeAI
Copy link

LukeAI commented Nov 28, 2019

Please could you explain in more detail the meaning of the options or how to work out a good configuration? I'm trying to get this feature going with my custom gaussian cfg but I'm not having success so far.
What does this mean?
optimized_memory=3
workspace_size_limit_MB=1000

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Nov 28, 2019

@LukeAI

Param optimized_memory= is related to GPU-memory optimization:

  • optimized_memory=0 - there is no additional memory optimization (by default)

  • optimized_memory=1 - there is optimized delta_gpu, instead of many arrays - it allocates 2 global_delta_gpu & state_delta_gpu arrays which will be used for the most of layers. It doesn't slowdown training, but can work incorrectly on a new models which will be made later.

  • optimized_memory=2 - also will be used CPU-RAM instead of GPU-VRAM for array output_gpu (output of layer), activation_input_gpu (input of activation) and x_gpu (input of batch-normalization) in each of layer

  • optimized_memory=3 - also it will use CPU-RAM instead of GPU-VRAM for arrays global_delta_gpu & state_delta_gpu

  • workspace_size_limit_MB=1000 - will be used 1000 MB for cuDNN-workspace.

    • If GPU memory is not enough (CUDA out of memory), then try to reduce this value.
    • If Darknet is halted or falls with strange errors - try to increase this value.
    • (Try to use 1000 if you have 32 GB CPU-RAM and 2000 if 64 CPU-RAM)
    • if GPU is lost - try to reboot your PC

For Yolov3-spp 416x416 model on 8GB-GPU and 32GB-CPU-RAM try to use: https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg

[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000

I'm trying to get this feature going with my custom gaussian cfg but I'm not having success so far.

What problem did you encounter?

What GPU do you use?
How many CPU-RAM do you have?
Rename your cfg-file to txt file and attach.

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Nov 30, 2019

Such accuracy:

  • MobileNetv3 - Top1 75.37%
  • MixNet-S - Top1 75.68%
  • EfficientNetB0 - Top1 76.3%

can be achieved only if you train with very large mini_batch size (~1024):

With small mini_batch size (~32) instead of Top1 76.3% we get: #3380 (comment)

  • Our EfficientNet B0 (224x224) 0.9 BFLOPS - 0.45 B_FMA (16ms / RTX 2070), 4.9M params - 71.3% Top1
  • Official EfficientNetB0 (224x224) 0.78 BFLOPS - 0.39 FMA, 5.3M params - 70.0% Top1

@erikguo
Copy link

erikguo commented Dec 3, 2019

@AlexeyAB

I tried mixnet_m_gpu.cfg with following setting :

optimized_memory=2
workspace_size_limit_MB=1000

I always get the following error:

 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
CUDA status Error: file: ./src/dark_cuda.c : () : line: 423 : build time: Dec  3 2019 - 23:02:36 
CUDA Error: invalid argument
CUDA Error: invalid argument: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

Could you help to find out the cause?

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 3, 2019

@erikguo I fixed it: 5d0352f

Just tried mixnet_m_gpu.cfg with

[net]
# Training
batch=120
subdivisions=2
optimized_memory=3
workspace_size_limit_MB=1000

@erikguo
Copy link

erikguo commented Dec 3, 2019

Thank you very much!

I will try now.

@erikguo
Copy link

erikguo commented Dec 3, 2019

By the way, I found the 'Decay' value (0.00005) is different from the other cfg(decay=0.0005) in mixnet_m_g.cfg as following:

momentum=0.9
decay=0.00005

It's a special setting for mixnet_m_gpu.cfg ? or just a type error?

@AlexeyAB

@erikguo
Copy link

erikguo commented Dec 3, 2019

@AlexeyAB

Still get error as following:

Pinned block_id = 3, filled = 99.917603 % 
 241 route  240 238 236 234 	                   ->    9 x   3 x1200 
 242 avg                             9 x   3 x1200 ->   1200
 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
 251 conv     51       1 x 1/ 1      1 x   1 x1536 ->    1 x   1 x  51 0.000 BF
 252 softmax                                          51

 Pinned block_id = 4, filled = 98.600769 % 
Total BFLOPS 0.592 
 Allocate additional workspace_size = 18.58 MB 
Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 253 layers from weights-file 
Learning Rate: 0.064, Momentum: 0.9, Decay: 0.0005
304734
Loaded: 0.933879 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  3 2019 - 23:02:38 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 3, 2019

@erikguo Do you get this error if you disable memory optimization?
Comment these lines:

#optimized_memory=3
#workspace_size_limit_MB=1000

By the way, I found the 'Decay' value (0.00005) is different from the other cfg(decay=0.0005)

Since Mixnet is a continuation of the EfficientNet that is a continuation of the (MobileNet ...), in the EfficientNet is used decay=0.00001 https://arxiv.org/pdf/1905.11946v2.pdf

weight decay 1e-5;

@erikguo
Copy link

erikguo commented Dec 3, 2019

After comment these lines, the training is running very well. If using these lines, it can run well occasionally. But It will crach in most of cases.

@AlexeyAB

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 4, 2019

@erikguo

  • How many iterations before crashing?
  • What is the error message?
  • How many CPU RAM do you have?
  • What GPU do you use?
  • Do you use GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=1 ?

@erikguo
Copy link

erikguo commented Dec 4, 2019

@AlexeyAB

It will crash at the first iteration.

Crash message is as the following:

Pinned block_id = 3, filled = 99.917603 % 
 241 route  240 238 236 234 	                   ->    9 x   3 x1200 
 242 avg                             9 x   3 x1200 ->   1200
 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
 251 conv     51       1 x 1/ 1      1 x   1 x1536 ->    1 x   1 x  51 0.000 BF
 252 softmax                                          51

 Pinned block_id = 4, filled = 98.600769 % 
Total BFLOPS 0.592 
 Allocate additional workspace_size = 18.58 MB 
Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 253 layers from weights-file 
Learning Rate: 0.064, Momentum: 0.9, Decay: 5e-05
304734
Loaded: 1.104122 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  3 2019 - 23:02:38 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
已放弃 (核心已转储)

My server has 128G memory, 4 x 1080ti 11G GPU.

Darknet is compiled with GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 4, 2019

@erikguo

  • Do you use 4 x GPU for training?
  • What command do you use for training?
  • What batch and subdivisions did you set?

I just trained 2600 iterations successfully on RTX 2070 and CPU Core i7 32 GB CPU-RAM by using this command:
darknet.exe classifier train cfg/imagenet1k_c.data cfg/mixnet_m_gpu.cfg backup/mixnet_m_gpu_last.weights -topk

and this cfg-file: mixnet_m_gpu.cfg.txt

@erikguo
Copy link

erikguo commented Dec 4, 2019

I use only one gpu for training.

Command as following:

darknet classifier train dengdi.data mixnet_m_gpu.cfg backup/mixnet_m_gpu_last.cfg -dont_show 

batch and subdivsion as following:

batch=128
subdivisions=2

mixnet_m_gpu_mem.cfg.txt

@AlexeyAB

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 4, 2019

@erikguo

  • Why do you use height=96 width=288 ?

  • I successfully run training with your cfg-file mixnet_m_gpu_mem.cfg.txt on RTX 2070 8 GB-VRAM + 32 GB CPU_RAM
    darknet.exe classifier train cfg/imagenet1k_c.data cfg/mixnet_m_gpu_mem.cfg backup/mixnet_m_gpu_last.weights -topk


image


image

@erikguo
Copy link

erikguo commented Dec 4, 2019

@AlexeyAB

I have tried the following combination:

batch=128
subdivisions=2
running very well now

batch=256
subdivisions=2
running very well now

batch=256
subdivisions=1
running crashed in the first iteration

batch=512
subdivisions=2
running crashed in the first iteration

@erikguo
Copy link

erikguo commented Dec 4, 2019

because my image's aspect is about 1:3 (h:w). So I set the network size with rectangle.

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 4, 2019

@erikguo
Check this combination:
batch=128
subdivisions=1


batch=256
subdivisions=1
running crashed in the first iteration

  • Show screenshot of CPU_RAM usage
  • Show screenshot of GPU_RAM usage
  • Show screenshot of the error message

@erikguo
Copy link

erikguo commented Dec 4, 2019

My OS is Ubuntu 16.04

this combination is crashed two times and run well one time now. The execution is not stable:
batch=128
subdivisions=1

image
image
image

this combination is bad, always crashed:
batch=256
subdivisions=1
image
image
image

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 4, 2019

@erikguo Try to use workspace_size_limit_MB=8000

batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000

@kossolax
Copy link

kossolax commented Jan 22, 2020

isn't there a gpu memory leak ? After doing "free_network" there are still memory used on nvidia-smi. Adding a loop will full-fill gpu then crash.

for(int p=0; p<1000; p++) {

        network subnet = parse_network_cfg(cfgfile);
        if (weightfile) {
            load_weights(&subnet, weightfile);
        }

        *subnet.seen = 0;
        
        while ( *subnet.seen < train_images_num ) {
            
            pthread_join(load_thread, 0);
            train = buffer;
            load_thread = load_data(args);

            float loss = train_network_waitkey(subnet, train, 0);
            free_data(train);
        }

        int tmp = subnet.batch;

        set_batch_network(&subnet, 1);
        float map = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, subnet.letter_box, &subnet);
	printf("%f", map);

        set_batch_network(&subnet, tmp);

        free_network(subnet);
}

@AlexeyAB
Copy link
Owner Author

@kossolax Is it related to optimized_memory=3 and GPU-processing on CPU-RAM? Or just realted to free_network()?

@kossolax
Copy link

I'm using optimized_memory=0, so it's just related to free_network. As you changed much memory usage, I guess this could be related, should I start a new issue?

@AlexeyAB
Copy link
Owner Author

@kossolax Yes, start new issue, I will investigate it.

@WongKinYiu
Copy link
Collaborator

@AlexeyAB Hello,

I think cross iteration batch normalization can achieve similar result but higher training speed.
https://github.com/Howal/Cross-iterationBatchNorm

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Feb 21, 2020

@WongKinYiu Hi,

I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing batch= in cfg-file, and set cbn=1 instead of batch_normalize=1
So batch=120 subdivisions=4 with CBN, should work better than batch=120 subdivisions=4 with BN.
But batch=120 subdivisions=4 with CBN, will work worse than batch=120 subdivisions=1 with BN.

I.e. using batch=64 subdivisions=8 with BN, avg mini_batch_size = 8
64/8 = 8

I.e. using batch=64 subdivisions=8 with CBN, avg mini_batch_size = 36
(8+16+24+32+40+48+56+64)/8 = 36

You can try it on Classifier csresnext50


So inside 1 batch it will average the values of Mean and Variance.
I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.

  • For the 1st mini_batch will use Mean[1] & Variance[1]
  • For the 2nd mini_batch will use avg(Mean[1], Mean[2]) & avg(Variance[1], Variance[2])
  • For the 3rd mini_batch will use avg(Mean[1], Mean[2], Mean[3]) & avg(Variance[1], Variance[2], Variance[3])
    ....

For using:

[convolutional]
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=1
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=2
filters=16
size=3
stride=1
pad=1
activation=leaky

Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.

Paper: https://arxiv.org/abs/2002.05712v2

image


I used these formulas:

image


image

@WongKinYiu
Copy link
Collaborator

@AlexeyAB

Thank you a lot, i ll give you the feedback after finish training.

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Mar 2, 2020

@WongKinYiu

I also added dynamic mini batch size when you train with random=1: c814d56

Just add dynamic_minibatch=1 in the [net] section:

[net]
batch=64
subdivisions=8
dynamic_minibatch=1
width=416
height=416

...
[yolo]
random=1

So

  • network resolution will be 288x288 - 608x608 due to random=1
  • for 608x608 the mini batch size = batch/subdivisions = 8
  • for 416x416 the mini batch size = 0.8 x ((608x608)/(416x416)) x batch/subdivisions = 13
  • for 288x288 the mini batch size = 0.8 x ((608x608)/(288x288)) x batch/subdivisions = 28

So even if part of CBN will not work properly, you can still use dynamic_minibatch=1 to increase mini_batch size.

0.8 is just a coefficient to avoid out of memory for some network resolutions (sometime cuDNN require much more memory for lower resolution than for higher), but you can try to set 0.9:

int new_dim_b = (int)(dim_b * 0.8);


Also you can adjust mini batch size to your GPU-RAM amount (not necessarily batch and subdivision should be a multiple of 2)
batch / subdivisions = mini_batch_size
64/8 = 8
63/7 = 9
70/7 = 10
66/6 = 11
60/5 = 12
65/5 = 13
70/5 = 14
60/4 = 15
64/4 = 16

@WongKinYiu
Copy link
Collaborator

@AlexeyAB OK,

Thank you, SpineNet-49-omega will finish training in half hour.
Will report the result soon.

@Answergeng
Copy link

I tried yolov3-spp.cfg with following setting :
optimized_memory=3
workspace_size_limit_MB=1000
my cpu-ram is 64g, after loading use 20.9g
but always stuck at here

net.optimized_memory = 3
batch = 1, time_steps = 1, train = 0
yolov3-spp
net.optimized_memory = 3
pre_allocate... pinned_ptr = 0000000000000000
pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
batch = 8, time_steps = 1, train = 1
Pinned block_id = 0, filled = 88.134911 %
Pinned block_id = 1, filled = 96.948578 %
Pinned block_id = 2, filled = 96.949005 %
Pinned block_id = 3, filled = 99.152946 %
Pinned block_id = 4, filled = 99.153809 %
Pinned block_id = 5, filled = 98.830368 %
Pinned block_id = 6, filled = 99.875595 %
Done! Loaded 85 layers from weights-file

could you tell me why?

@Answergeng
Copy link

I tried yolov3-spp.cfg with following setting :
optimized_memory=3
workspace_size_limit_MB=1000
my cpu-ram is 64g, after loading use 20.9g
but always stuck at here

net.optimized_memory = 3
batch = 1, time_steps = 1, train = 0
yolov3-spp
net.optimized_memory = 3
pre_allocate... pinned_ptr = 0000000000000000
pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
batch = 8, time_steps = 1, train = 1
Pinned block_id = 0, filled = 88.134911 %
Pinned block_id = 1, filled = 96.948578 %
Pinned block_id = 2, filled = 96.949005 %
Pinned block_id = 3, filled = 99.152946 %
Pinned block_id = 4, filled = 99.153809 %
Pinned block_id = 5, filled = 98.830368 %
Pinned block_id = 6, filled = 99.875595 %
Done! Loaded 85 layers from weights-file

could you tell me why?

now, I got error

CUDA Error: invalid device pointer: No error
Assertion failed: 0, file ....\src\utils.c, line 325

@LucasSloan
Copy link

Just tried to run with this on:

batch=64
subdivisions=4
dynamic_minibatch=1
width=960
height=576
optimized_memory=3
workspace_size_limit_MB=8000

and got this error:

CUDA status Error: file: /home/lucas/Development/darknet/src/dark_cuda.c : () : line: 454 : build time: May 18 2020 - 15:30:02 

 CUDA Error: invalid device pointer
CUDA Error: invalid device pointer: Resource temporarily unavailable

I've tried several different values for workspace_size_limit_MB and subdivisions and all fail with the same message. I was running with a single gpu, and I peaked at about 40 GB / 64 GB memory usage on the cpu.

@arnaud-nt2i
Copy link

arnaud-nt2i commented Sep 4, 2020

@WongKinYiu @AlexeyAB @cenit @LukeAI

Hi everyone!
Two simple questions I could not find answers everywhere else... Even on google scholar for the second one...

  1. Is it possible to use dynamic_mini batch=1 while using custom resize of the network eg: "random=1.34"?
    |--> Yes
  2. Is it possible to use dynamic_mini batch=1 and batch_normalize=2 at the same Time Without messing everything up?
    |--> Yes
  3. How is it possible that the mini_batch parameter has an influence on mAP with consistent batch size?
    |--> Because Batch normalization is done on Mini-Batch size and not on Batch size.

Has far as my understanding goes, the batch size is the number of samples processed before the weighs update
but mini_batch is just a computational trick to avoid loading and processing the batch in one time and should not have an impact...

I would be very happy with an answer to those questions and I'm sure I am not alone not understanding.

@igoriok1994
Copy link

What parameters I can use with nVidia Quadro M1000M (GPU_RAM = 2GB) and I7 + CPU_RAM = 64GB?

image
image

###
# Training
batch=64
subdivisions=8

###
width=608
height=608

###
optimized_memory=3
workspace_size_limit_MB=2000
mini_batch=16

Tried to use these, but 100h+ for training - too long.


On other PC with GTX970 4GB and I5 16GB with parameters

###
# Training
batch=64
subdivisions=16

###s
width=608
height=608

I've got ~16-20h of training

Classes=5, max iterations= 10000.

@igoriok1994
Copy link

On laptop with settings:

###
# Training
batch=64
subdivisions=32

###
width=608
height=608

### NOT USED ###
# optimized_memory=3
# workspace_size_limit_MB=2000
# mini_batch=16

getting this:

image

Btw this is Tiny YoloV4

@pullmyleg
Copy link

@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.

@igoriok1994
Copy link

@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.

I want to speed up training without mAP loss :)

@pullmyleg
Copy link

@igoriok1994 CPU memory is very slow, in my experience 5x + slower than regular GPU training. The benefit of CPU memory training is to increase precision (mAP) by increasing the batch size beyond the memory available on your GPU.

@nanhui69
Copy link

@WongKinYiu Hi,

I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing batch= in cfg-file, and set cbn=1 instead of batch_normalize=1
So batch=120 subdivisions=4 with CBN, should work better than batch=120 subdivisions=4 with BN.
But batch=120 subdivisions=4 with CBN, will work worse than batch=120 subdivisions=1 with BN.

I.e. using batch=64 subdivisions=8 with BN, avg mini_batch_size = 8
64/8 = 8

I.e. using batch=64 subdivisions=8 with CBN, avg mini_batch_size = 36
(8+16+24+32+40+48+56+64)/8 = 36

You can try it on Classifier csresnext50

So inside 1 batch it will average the values of Mean and Variance.
I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.

  • For the 1st mini_batch will use Mean[1] & Variance[1]
  • For the 2nd mini_batch will use avg(Mean[1], Mean[2]) & avg(Variance[1], Variance[2])
  • For the 3rd mini_batch will use avg(Mean[1], Mean[2], Mean[3]) & avg(Variance[1], Variance[2], Variance[3])
    ....

For using:

[convolutional]
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=1
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=2
filters=16
size=3
stride=1
pad=1
activation=leaky

Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.

Paper: https://arxiv.org/abs/2002.05712v2

image

I used these formulas:

image

image

does we need to change batch_normalize's setting in every convolutional part in cfg file ? , the numbers of convolutional is 73 @AlexeyAB

@pullmyleg
Copy link

@AlexeyAB have you seen this implementation for decreasing memory usage allowing larger batches with the same GPU memory? https://github.com/MegEngine/MegEngine/wiki/Reduce-GPU-memory-usage-by-Dynamic-Tensor-Rematerialization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Likely bug Maybe a bug, maybe not ToDo RoadMap
Projects
None yet
Development

No branches or pull requests