You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The OpenBLAS inference benchmark seems wrong in IntelOptimizedPaddle.md.
In all three networks, when BatchSize increases, the images/second decreases.
VGG-19
BatchSize
1
2
4
8
16
OpenBLAS
1.07
1.08
1.06
0.88
0.65
ResNet-50
BatchSize
1
2
4
8
16
OpenBLAS
3.35
3.19
3.09
2.55
1.96
GoogLeNet
BatchSize
1
2
4
8
16
OpenBLAS
12.04
11.31
10.00
9.07
4.34
Possible Reason
Why images/second increases with BatchSize increasing in training benchmark?
OPENBLAS_NUM_THREADS * trainer_count = core number
The minimum BatchSize used in training is 64, which is larger than core number (40 in the experiment). Thus, we export OPENBLAS_NUM_THREADS=1 and trainer_count=40.
However, in inference, the BatchSize is smaller than core number. For example, when BatchSize=2, we export OPENBLAS_NUM_THREADS=20 and trainer_count=2. Which may cause the conflict in thread affinity.
How could I disable OpenBLAS threading affinity on runtime?
You can define the OPENBLAS_MAIN_FREE or GOTOBLAS_MAIN_FREE environment variable to disable threading affinity on runtime. For example, before the running, export OPENBLAS_MAIN_FREE=1
Alternatively, you can disable affinity feature with enabling NO_AFFINITY=1 in Makefile.rule. https://github.com/xianyi/OpenBLAS/wiki/Faq#no_affinity
Thus, I export OPENBLAS_MAIN_FREE=1, and test VGG inference, the result speedups:
BatchSize
1
2
4
8
16
OpenBLAS
1.07->1.08
1.08->1.99
1.06->3.64
0.88->3.57
0.65->2.27
@tensor-tang Can you help double check this result?
Solution
If OpenBLAS threading affinity affects the elapsed time, should we auto set it in the program like MKL does?
The OpenBLAS inference benchmark seems wrong in IntelOptimizedPaddle.md.
In all three networks, when BatchSize increases, the images/second decreases.
Possible Reason
The minimum BatchSize used in training is 64, which is larger than core number (40 in the experiment). Thus, we
export OPENBLAS_NUM_THREADS=1
andtrainer_count=40
.However, in inference, the BatchSize is smaller than core number. For example, when
BatchSize=2
, weexport OPENBLAS_NUM_THREADS=20
andtrainer_count=2
. Which may cause the conflict in thread affinity.You can define the OPENBLAS_MAIN_FREE or GOTOBLAS_MAIN_FREE environment variable to disable threading affinity on runtime. For example, before the running,
export OPENBLAS_MAIN_FREE=1
Alternatively, you can disable affinity feature with enabling
NO_AFFINITY=1
in Makefile.rule. https://github.com/xianyi/OpenBLAS/wiki/Faq#no_affinityThus, I
export OPENBLAS_MAIN_FREE=1
, and test VGG inference, the result speedups:@tensor-tang Can you help double check this result?
Solution
If OpenBLAS threading affinity affects the elapsed time, should we auto set it in the program like MKL does?
export OPENBLAS_MAIN_FREE
in paddle/scripts/submit_local.sh.in and python/paddle/v2/init.py , or directly addNO_AFFINITY=1
in openblas.cmake ?OPENBLAS_MAIN_FREE
related with hyperthreading, likeKMP_AFFINITY
does?@tensor-tang Can you give some suggestion about this?
The text was updated successfully, but these errors were encountered: