Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN? #173

Open
NIIAS3050 opened this issue Sep 20, 2018 · 150 comments
Open

Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN? #173

NIIAS3050 opened this issue Sep 20, 2018 · 150 comments

Comments

@NIIAS3050
Copy link

It would be very useful to compare real training performance on amd and nvidia cards.
For Nvidia cards we have a lot of graphs and tests, for example:
https://github.com/u39kun/deep-learning-benchmark
But for AMD cards there is no performance metrics.
It will be great to made direct comparsion between AND and NVIDIA with last cuDNN.

@pricebenjamin
Copy link

If you happen to have access to some AMD GPUs that are supported by the ROCm stack, consider running some benchmarks from the TensorFlow benchmarks repository. The README in the benchmarks/scripts/tf_cnn_benchmarks directory provides some example usage.

Those scripts were used for the benchmarks shown on TensorFlows website.

I've run the following on a Vega FE (tensorflow-rocm==1.11.0 and rocm-dkms==1.9.211).

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

This yields the following.

[...]
Done warm up
Step	Img/sec	total_loss
1	images/sec: 182.2 +/- 0.0 (jitter = 0.0)	8.325
10	images/sec: 182.3 +/- 0.1 (jitter = 0.2)	8.170
20	images/sec: 182.3 +/- 0.1 (jitter = 0.3)	8.247
30	images/sec: 182.1 +/- 0.1 (jitter = 0.3)	8.369
40	images/sec: 182.0 +/- 0.1 (jitter = 0.4)	8.401
50	images/sec: 181.9 +/- 0.1 (jitter = 0.5)	8.147
60	images/sec: 181.8 +/- 0.1 (jitter = 0.6)	8.340
70	images/sec: 181.6 +/- 0.1 (jitter = 0.7)	8.120
80	images/sec: 181.3 +/- 0.2 (jitter = 0.9)	8.415
90	images/sec: 180.5 +/- 0.3 (jitter = 1.1)	8.278
100	images/sec: 179.5 +/- 0.4 (jitter = 1.4)	8.328
----------------------------------------------------------------
total images/sec: 179.44
----------------------------------------------------------------

For comparison, the same command being run on a Tesla P100-PCIE-16GB (CUDA==9.2, cuDNN==7.1.4, and tf.__version__ == '1.11.0')

[...]
Done warm up
Step	Img/sec	total_loss
1	images/sec: 248.6 +/- 0.0 (jitter = 0.0)	8.325
10	images/sec: 248.6 +/- 0.2 (jitter = 0.6)	8.164
20	images/sec: 248.5 +/- 0.1 (jitter = 0.8)	8.251
30	images/sec: 248.4 +/- 0.1 (jitter = 0.7)	8.355
40	images/sec: 248.3 +/- 0.1 (jitter = 0.6)	8.417
50	images/sec: 248.2 +/- 0.1 (jitter = 0.6)	8.152
60	images/sec: 248.2 +/- 0.1 (jitter = 0.6)	8.353
70	images/sec: 248.1 +/- 0.1 (jitter = 0.7)	8.109
80	images/sec: 247.7 +/- 0.1 (jitter = 0.8)	8.405
90	images/sec: 247.5 +/- 0.1 (jitter = 0.9)	8.266
100	images/sec: 247.2 +/- 0.2 (jitter = 1.2)	8.344
----------------------------------------------------------------
total images/sec: 247.13
----------------------------------------------------------------

Bear in mind, I haven't done anything to try and optimize performance on the Vega FE. These are essentially "out-of-the-box" results.

@Mandrewoid
Copy link

@pricebenjamin when I try to run that same script ( I cloned the repo ) I get an import error:

ImportError: No module named 'tensorflow.python.data.experimental'

@pricebenjamin
Copy link

pricebenjamin commented Nov 17, 2018

@Mandrewoid, if you haven't already, I'd recommend checking out the branch corresponding to your version of tensorflow, e.g.

cd /path/to/benchmarks
git checkout cnn_tf_v1.11_compatible

@Mandrewoid
Copy link

Nice that seems to have done it. I did not realize mainline TF had already advanced to 1.12 rookie mistake

@kazulittlefox
Copy link

kazulittlefox commented Nov 23, 2018

I have tried runnning benchmarks on my environment(Kernel 4.15, ROCm1.9.2, TF1.12 with RX 580).

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=(32|64)  \ 
--model=(alexnet|inceptionv3|vgg16|googlenet|resnet50)

result are as follow:

AlexNet        batch:32 397.27/sec
                     batch:64 518.03/sec
InceptionV3 batch:32   47.78/sec
                    batch:64   50.66/sec
googLeNet batch:32 239.28/sec
                   batch:64 256.05/sec
ResNet50   batch:32  86.81/sec
                 batch:64  98.57/sec

In my environment, Vgg16 has not runnning well.

@fshi98
Copy link

fshi98 commented Nov 30, 2018

I have tested with vega64, ubuntu18.04, ROCm1.9.2, tf1.12:
1 resnet50: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
1080ti: 212 images/sec (278 fp16)
vega64: 191 images/sec (190 fp16)
2 resnet101: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet101
1080ti: 121.14 images/sec (168 fp16)
vega64: 101.15 images/sec (93 fp16), if fp16, --batch_size could be 64, while fp32, 64 will crash
3. inception3: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception3
1080ti: 140.08 images/sec (166 fp16)
vega64: 99.02 images/sec (50 fp16)

4 mobilenet: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=mobilenet
1080ti: 2865 images/sec
vega64: 462 images/sec

The nv gtx1080 ti was tested on another machine with cuda10, ubuntu 18.04.

There are two values didn't add up:

  1. for mobilenet, the 1080ti result doesn't make sense.
  2. i also tested with --use_fp16, which gives fair amount of speedup for 1080ti. However, for vega64, it ends up slower in all tests if using --use_fp16. This is especially true for inception3.

Considering vega64 supports native half precision and fp16 should be a good selling point for amd vega. how is it slower if using fp16? I guess this is probably due to software support, especially ROCm. Can anyone please test it with --use_fp16 and see if having similar results.

@kazulittlefox my vega runs smoothly with vgg16 @105images/sec

@Mandrewoid
Copy link

@fshi98 that might be because of
#143 (comment)

@fshi98
Copy link

fshi98 commented Dec 1, 2018

@Mandrewoid Thanks. That may be the reason. However, my rocblas version is 0.14.3.0,
and I tested //tensorflow/python/kernel_tests:batch_matmul_op_test, and passed all 47 tests in 10.653s as in #143
Also, i tested and passed ROCm/rocBLAS#340

This may not be the same error bugs as #143, but may be some performance issues

@pricebenjamin
Copy link

@sebpuetz Would you be willing to post some numbers for the Radeon VII, including fp16 performance? I have yet to find any cloud providers with these cards. Trying to get some info for #288.

@sebpuetz
Copy link

#288
Radeon VII
rocm==2.1.96 installed through apt
tensorflow==1.12 installed through pip
no further tuning

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 190.3 +/- 0.0 (jitter = 0.0)	8.217
10	images/sec: 195.7 +/- 0.9 (jitter = 3.1)	8.123
20	images/sec: 196.4 +/- 0.5 (jitter = 1.8)	8.231
30	images/sec: 196.8 +/- 0.4 (jitter = 1.1)	8.268
40	images/sec: 197.1 +/- 0.3 (jitter = 0.9)	8.355
50	images/sec: 197.2 +/- 0.2 (jitter = 0.8)	8.013
60	images/sec: 197.3 +/- 0.2 (jitter = 0.7)	8.263
70	images/sec: 196.8 +/- 0.3 (jitter = 1.1)	8.304
80	images/sec: 196.9 +/- 0.2 (jitter = 1.1)	8.228
90	images/sec: 196.9 +/- 0.2 (jitter = 0.9)	8.283
100	images/sec: 197.0 +/- 0.2 (jitter = 0.8)	8.271
----------------------------------------------------------------
total images/sec: 196.98
----------------------------------------------------------------

FP16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50  --use_fp16
Step	Img/sec	total_loss
1	images/sec: 262.9 +/- 0.0 (jitter = 0.0)	8.162
10	images/sec: 261.9 +/- 0.6 (jitter = 0.7)	8.211
20	images/sec: 260.4 +/- 0.6 (jitter = 2.6)	8.375
30	images/sec: 260.6 +/- 0.5 (jitter = 2.6)	8.264
40	images/sec: 259.6 +/- 0.6 (jitter = 3.1)	8.116
50	images/sec: 259.6 +/- 0.5 (jitter = 3.1)	8.169
60	images/sec: 259.9 +/- 0.5 (jitter = 2.6)	8.325
70	images/sec: 259.3 +/- 0.5 (jitter = 3.5)	8.374
80	images/sec: 259.4 +/- 0.4 (jitter = 3.4)	8.041
90	images/sec: 259.3 +/- 0.4 (jitter = 3.6)	8.298
100	images/sec: 259.4 +/- 0.3 (jitter = 3.5)	8.376
----------------------------------------------------------------
total images/sec: 259.29
----------------------------------------------------------------

This one made the GPU sound like a jet engine:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 216.3 +/- 0.0 (jitter = 0.0)	8.219
10	images/sec: 215.9 +/- 0.3 (jitter = 0.3)	8.289
20	images/sec: 216.0 +/- 0.2 (jitter = 0.3)	8.064
30	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.310
40	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.197
50	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.277
60	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.162
70	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.159
80	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.139
90	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.196
100	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.163
----------------------------------------------------------------
total images/sec: 215.72
----------------------------------------------------------------

FP 16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 288.2 +/- 0.0 (jitter = 0.0)	8.209
10	images/sec: 283.8 +/- 1.1 (jitter = 2.7)	8.189
20	images/sec: 284.0 +/- 0.9 (jitter = 4.6)	8.316
30	images/sec: 284.9 +/- 0.7 (jitter = 4.5)	8.195
40	images/sec: 284.5 +/- 0.6 (jitter = 4.0)	8.180
50	images/sec: 284.3 +/- 0.5 (jitter = 3.7)	8.402
60	images/sec: 285.0 +/- 0.5 (jitter = 4.8)	8.271
70	images/sec: 285.4 +/- 0.4 (jitter = 3.7)	8.134
80	images/sec: 285.7 +/- 0.4 (jitter = 2.7)	8.299
90	images/sec: 286.0 +/- 0.4 (jitter = 1.5)	8.349
100	images/sec: 286.2 +/- 0.3 (jitter = 1.4)	8.213
----------------------------------------------------------------
total images/sec: 286.17
----------------------------------------------------------------

@sunway513
Copy link

@sebpuetz
Copy link

sebpuetz commented Feb 18, 2019

Improvements across the board with TF_ROCM_FUSION_ENABLE=1. The displayed temp in rocm-smi went above 90°C on all tests, the rocm-smi output didn't include clocks so I can't tell whether any termal throttling was happening.

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 208.4 +/- 0.0 (jitter = 0.0)	8.217
10	images/sec: 207.6 +/- 0.5 (jitter = 0.5)	8.124
20	images/sec: 207.7 +/- 0.3 (jitter = 0.5)	8.235
30	images/sec: 207.3 +/- 0.4 (jitter = 0.4)	8.268
40	images/sec: 207.2 +/- 0.4 (jitter = 0.4)	8.357
50	images/sec: 207.2 +/- 0.4 (jitter = 0.4)	8.012
60	images/sec: 207.2 +/- 0.3 (jitter = 0.4)	8.248
70	images/sec: 207.1 +/- 0.3 (jitter = 0.4)	8.305
80	images/sec: 207.0 +/- 0.3 (jitter = 0.5)	8.223
90	images/sec: 205.7 +/- 0.9 (jitter = 0.5)	8.322
100	images/sec: 205.7 +/- 0.8 (jitter = 0.5)	8.268
----------------------------------------------------------------
total images/sec: 205.65
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 273.0 +/- 0.0 (jitter = 0.0)	8.171
10	images/sec: 272.6 +/- 0.9 (jitter = 1.0)	8.223
20	images/sec: 271.5 +/- 1.1 (jitter = 0.9)	8.375
30	images/sec: 272.0 +/- 0.8 (jitter = 0.9)	8.282
40	images/sec: 272.1 +/- 0.6 (jitter = 0.9)	8.122
50	images/sec: 272.1 +/- 0.6 (jitter = 0.8)	8.144
60	images/sec: 272.0 +/- 0.5 (jitter = 0.8)	8.333
70	images/sec: 271.5 +/- 0.5 (jitter = 1.0)	8.357
80	images/sec: 271.2 +/- 0.5 (jitter = 1.3)	8.034
90	images/sec: 271.2 +/- 0.4 (jitter = 1.3)	8.289
100	images/sec: 270.9 +/- 0.4 (jitter = 1.5)	8.361
----------------------------------------------------------------
total images/sec: 270.81
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 227.7 +/- 0.0 (jitter = 0.0)	8.221
10	images/sec: 225.6 +/- 0.5 (jitter = 2.2)	8.289
20	images/sec: 225.5 +/- 0.4 (jitter = 1.9)	8.068
30	images/sec: 225.7 +/- 0.3 (jitter = 1.8)	8.304
40	images/sec: 225.4 +/- 0.5 (jitter = 1.2)	8.183
50	images/sec: 225.5 +/- 0.4 (jitter = 1.0)	8.261
60	images/sec: 225.6 +/- 0.4 (jitter = 1.1)	8.203
70	images/sec: 225.6 +/- 0.3 (jitter = 1.1)	8.165
80	images/sec: 225.6 +/- 0.3 (jitter = 1.0)	8.168
90	images/sec: 225.7 +/- 0.3 (jitter = 1.0)	8.196
100	images/sec: 225.6 +/- 0.2 (jitter = 1.1)	8.138
----------------------------------------------------------------
total images/sec: 225.62
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 302.0 +/- 0.0 (jitter = 0.0)	8.213
10	images/sec: 300.2 +/- 0.5 (jitter = 1.5)	8.181
20	images/sec: 298.7 +/- 0.8 (jitter = 2.5)	8.324
30	images/sec: 297.7 +/- 0.8 (jitter = 2.2)	8.197
40	images/sec: 297.7 +/- 0.6 (jitter = 3.0)	8.173
50	images/sec: 297.9 +/- 0.6 (jitter = 3.0)	8.400
60	images/sec: 297.9 +/- 0.5 (jitter = 3.0)	8.267
70	images/sec: 298.4 +/- 0.5 (jitter = 2.8)	8.140
80	images/sec: 298.6 +/- 0.4 (jitter = 2.7)	8.283
90	images/sec: 298.6 +/- 0.4 (jitter = 2.8)	8.337
100	images/sec: 298.7 +/- 0.4 (jitter = 2.6)	8.208
----------------------------------------------------------------
total images/sec: 298.60
----------------------------------------------------------------

@sunway513
Copy link

Hi @sebpuetz , thanks for the update!
However, the performance numbers seem not right.
Can you provide me the VBIOS version of your board? The following command would do:
/opt/rocm/bin/rocm-smi -v

@sebpuetz
Copy link

/opt/rocm/bin/rocm-smi -v 
GPU[0] 		: VBIOS version: 113-D3600200-105

@WrightChen
Copy link

Radeon RX Vega 64
memoryClockRate (GHz) 1.63
Total memory: 7.98GiB
Free memory: 7.73GiB
rocm==2.1.96 installed through apt
tensorflow==1.12 installed through pip

Some Frameworks use option ' TF_ROCM_FUSION_ENABLE=1 ' doesn't change much, so I'm not giving the FUSION = 1 results. Due to lack of memory, there are some frameworks can't run on the batch_size=128.

  ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=512 / 1573.01 / / / /
batch_size=256 / 1420.65 / / / /
batch_size=128 / 1345.73 / / 498.73 /
batch_size=64 190.58 1151.98 103.82 101.95 474.07 /
batch_size=32 171.70 971.85 98.50 91.80 424.32 68.71
batch_size=128; FUSION = 1 / / / / / /
batch_size=64; FUSION = 1 208.78 / 109.66 / / /
batch_size=32; FUSION = 1 187.76 / 105.20 / / 75.81

@sunway513
Copy link

Hi @sebpuetz , could you try to refresh your performance numbers using our official docker image?
If you've not configured the docker, the following script should do:
curl -sSL https://get.docker.com/ | sh

To run the benchmarks inside docker image:

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx -v /data/imagenet/tf:/imagenet'
drun rocm/tensorflow:rocm2.1-tf1.12-python3
cd ~/benchmarks/scripts/tf_cnn_benchmarks
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Thanks for your attention, and looking forward to your updates :-)

@jimdowling
Copy link

jimdowling commented Feb 21, 2019

6-core Intel i7 8700 with 16GB ram, and 400GB SSD disk.
Radeon VII
rocm==2.1.96 installed through apt
tensorflow==1.12 installed through pip
no further tuning

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step Img/sec total_loss
1 images/sec: 250.0 +/- 0.0 (jitter = 0.0) 8.348
10 images/sec: 248.0 +/- 1.4 (jitter = 0.7) 8.144
20 images/sec: 248.7 +/- 0.8 (jitter = 0.4) 8.440
30 images/sec: 248.8 +/- 0.6 (jitter = 0.4) 8.140
40 images/sec: 248.7 +/- 0.6 (jitter = 0.4) 8.474
50 images/sec: 248.5 +/- 0.5 (jitter = 0.4) 8.322
60 images/sec: 248.5 +/- 0.5 (jitter = 0.5) 8.317
70 images/sec: 248.5 +/- 0.4 (jitter = 0.6) 8.010
80 images/sec: 248.4 +/- 0.4 (jitter = 0.6) 8.272
90 images/sec: 248.5 +/- 0.4 (jitter = 0.6) 8.289
100 images/sec: 248.4 +/- 0.3 (jitter = 0.6) 8.108

total images/sec: 248.34

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step Img/sec total_loss
1 images/sec: 265.1 +/- 0.0 (jitter = 0.0) 8.324
10 images/sec: 264.3 +/- 0.5 (jitter = 0.3) 8.168
20 images/sec: 264.5 +/- 0.3 (jitter = 0.2) 8.261
30 images/sec: 264.4 +/- 0.3 (jitter = 0.3) 8.377
40 images/sec: 264.2 +/- 0.2 (jitter = 0.4) 8.408
50 images/sec: 264.1 +/- 0.2 (jitter = 0.5) 8.160
60 images/sec: 263.9 +/- 0.2 (jitter = 0.6) 8.341
70 images/sec: 263.8 +/- 0.2 (jitter = 0.6) 8.107
80 images/sec: 263.8 +/- 0.2 (jitter = 0.8) 8.404
90 images/sec: 263.8 +/- 0.2 (jitter = 0.7) 8.296
100 images/sec: 263.7 +/- 0.2 (jitter = 0.6) 8.348

total images/sec: 263.65

With a batch size of 256, i get out of memory errors.
Funnily enough with a batch size of 155, it works, but is slower.

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=155 --model=resnet50

Step Img/sec total_loss
1 images/sec: 195.3 +/- 0.0 (jitter = 0.0) 8.394
10 images/sec: 194.6 +/- 0.7 (jitter = 0.6) 8.313
20 images/sec: 194.5 +/- 0.5 (jitter = 0.6) 8.154
30 images/sec: 194.4 +/- 0.3 (jitter = 0.7) 8.249
40 images/sec: 194.5 +/- 0.3 (jitter = 0.8) 8.165
50 images/sec: 194.4 +/- 0.2 (jitter = 1.0) 8.292
60 images/sec: 194.3 +/- 0.2 (jitter = 1.0) 8.340
70 images/sec: 194.3 +/- 0.2 (jitter = 0.9) 8.268
80 images/sec: 194.2 +/- 0.2 (jitter = 0.8) 8.227
90 images/sec: 194.2 +/- 0.2 (jitter = 0.8) 8.257
100 images/sec: 194.1 +/- 0.2 (jitter = 0.9) 8.183

total images/sec: 194.04

@jimdowling
Copy link

Leaving out TC_ROCM_FUSION_ENABLE does not make any difference.
/opt/rocm/bin/rocm-smi -v
VBIOS version: 113-D3600200-105

@jimdowling
Copy link

According to this blog, https://www.pugetsystems.com/labs/hpc/NVIDIA-RTX-2080-Ti-vs-2080-vs-1080-Ti-vs-Titan-V-TensorFlow-Performance-with-CUDA-10-0-1247/, the 2080Ti gets 280 images/sec and the 1080Ti gets 207 images/sec for FP32 training.

@jimdowling
Copy link

jimdowling commented Feb 21, 2019

One more:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step Img/sec total_loss
1 images/sec: 377.7 +/- 0.0 (jitter = 0.0) 8.246
10 images/sec: 375.9 +/- 2.2 (jitter = 0.7) 8.261
20 images/sec: 377.9 +/- 1.2 (jitter = 0.9) 8.279
30 images/sec: 378.3 +/- 0.9 (jitter = 0.9) 8.365
40 images/sec: 378.2 +/- 0.7 (jitter = 0.5) 8.237
50 images/sec: 378.3 +/- 0.6 (jitter = 0.4) 8.295
60 images/sec: 378.4 +/- 0.5 (jitter = 0.4) 8.203
70 images/sec: 378.4 +/- 0.5 (jitter = 0.5) 8.129
80 images/sec: 377.9 +/- 0.6 (jitter = 0.6) 8.264
90 images/sec: 378.0 +/- 0.5 (jitter = 0.8) 8.163
100 images/sec: 377.9 +/- 0.5 (jitter = 0.8) 8.239

total images/sec: 377.79

@Sumenia
Copy link

Sumenia commented Feb 21, 2019

@jimdowling that's some impressive perf !

@sebpuetz
Copy link

@jimdowling these numbers seem substantially higher than the ones I got, what OS and kernel are you on?

@sebpuetz
Copy link

Hi,
I executed the benchmarks in the docker container:

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 229.7 +/- 0.0 (jitter = 0.0)	8.221
10	images/sec: 225.4 +/- 0.8 (jitter = 2.7)	8.289
20	images/sec: 225.9 +/- 0.5 (jitter = 3.6)	8.054
30	images/sec: 226.6 +/- 0.4 (jitter = 2.1)	8.313
40	images/sec: 226.9 +/- 0.3 (jitter = 0.8)	8.187
50	images/sec: 227.2 +/- 0.3 (jitter = 0.7)	8.240
60	images/sec: 227.3 +/- 0.2 (jitter = 0.5)	8.192
70	images/sec: 227.4 +/- 0.2 (jitter = 0.5)	8.143
80	images/sec: 227.6 +/- 0.2 (jitter = 0.5)	8.150
90	images/sec: 227.6 +/- 0.2 (jitter = 0.5)	8.217
100	images/sec: 227.7 +/- 0.2 (jitter = 0.5)	8.163
----------------------------------------------------------------
total images/sec: 227.66
----------------------------------------------------------------

and

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 300.8 +/- 0.0 (jitter = 0.0)	8.205
10	images/sec: 300.3 +/- 0.4 (jitter = 0.2)	8.170
20	images/sec: 300.3 +/- 0.3 (jitter = 0.5)	8.317
30	images/sec: 300.5 +/- 0.2 (jitter = 0.6)	8.201
40	images/sec: 300.6 +/- 0.2 (jitter = 0.5)	8.176
50	images/sec: 300.5 +/- 0.2 (jitter = 0.5)	8.398
60	images/sec: 300.3 +/- 0.2 (jitter = 0.5)	8.268
70	images/sec: 300.3 +/- 0.2 (jitter = 0.6)	8.140
80	images/sec: 300.4 +/- 0.2 (jitter = 0.6)	8.279
90	images/sec: 300.4 +/- 0.2 (jitter = 0.6)	8.328
100	images/sec: 300.3 +/- 0.2 (jitter = 0.6)	8.214
----------------------------------------------------------------
total images/sec: 300.29
----------------------------------------------------------------

@sunway513 these numbers are still pretty far away from what @jimdowling got, do you see a reason for this to happen?

@jimdowling
Copy link

jimdowling commented Feb 21, 2019

Ubuntu 18.04. Python 2.7. Kernel is 4.15.
I was not running Docker - bare metal.

@sunway513
Copy link

Hi @jimdowling , Thanks for your posting! However, it seems there's a typo in your script, therefore TF fusion is not really enabled there. Could you try the following command again?
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
If fusion is enabled, you should see the following message at the run time:
2019-02-21 13:41:32.304325: I tensorflow/core/graph/gpu_fusion_pass.cc:454] ROCm Fusion is enabled.

@sunway513
Copy link

Hi @sebpuetz , thanks for your updated numbers with docker!
in a parallel issue, you mentioned your system is Linux Mint 19.1, is that the same OS you ran the benchmark? May I know the kernel and driver version of your configurations? The following command would help:
uname -a
apt --installed list | grep rock-dkms
I believe your user-bit components were properly configured, as you got similar perf numbers using our official docker image. VBIOS version is good as well. We need to look into kernels and firmware.

@sebpuetz
Copy link

Hi @sunway513 ,
I ran all benchmarks on Linux Mint 19.1

uname -a
Linux seb-desktop 4.20.7-042007-generic #201902061234 SMP Wed Feb 6 17:36:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
apt list --installed | grep rock-dkms
rock-dkms/Ubuntu 16.04,now 2.1-96 all [installed]

Linux Mint 19.1 is based on Ubuntu 18.04, so this looks like a mismatch here?

@ghostplant
Copy link

ghostplant commented Feb 21, 2019

@sunway513

I am also using RX Vega 64 but I got such warning:

2019-02-21 14:26:23.732074: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
2019-02-21 14:26:27.702436: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data
2019-02-21 14:26:29.084753: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
2019-02-21 14:26:33.818470: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data
2019-02-21 14:26:33.839322: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter

And the performance is ~10% loss compared with others' benchmark:

Step    Img/sec total_loss
1       images/sec: 182.8 +/- 0.0 (jitter = 0.0)        8.217
10      images/sec: 187.2 +/- 0.9 (jitter = 0.7)        8.122
20      images/sec: 187.3 +/- 0.5 (jitter = 0.7)        8.229
30      images/sec: 187.1 +/- 0.4 (jitter = 0.9)        8.264
40      images/sec: 187.0 +/- 0.4 (jitter = 0.9)        8.347
50      images/sec: 187.0 +/- 0.3 (jitter = 1.1)        8.014
60      images/sec: 187.0 +/- 0.3 (jitter = 1.0)        8.264
70      images/sec: 186.8 +/- 0.3 (jitter = 1.1)        8.316
80      images/sec: 186.7 +/- 0.3 (jitter = 1.1)        8.231
90      images/sec: 186.7 +/- 0.2 (jitter = 1.2)        8.305

But it should be expected to have about 207 images/sec.
Is it influenced by the warning above and how to fix the performance?

@huanzhang12
Copy link

Benchmark dump and recreation of @kazulittlefox's results. My ROCm 2.8.13 results were significantly lower (~65%) than kazulittlefox's 1.9.2 results so I was concerned I may have a hardware issue. Always compare apples to apples. My 1.9.3 results are consistent with kazulittlefox's.

@mwrnd The performance regression on gfx803 has been fixed in ROCm v3.3. The issue was that assembly kernels were all disabled on gfx803 (see ROCm/MIOpen#134).
On my RX570, resnet fp32 performance restored from 50 images/sec (ROCm v3.1) to 95 images/sec (ROCm v3.3).
I have a script for patching miopen.db for gfx803 targets with 32 CUs (duplicating performance db from 36 CU devices). This improves performance by about 20 images/sec.

@mwrnd
Copy link

mwrnd commented Apr 12, 2020

GPU: MSI Radeon RX 580 Armor 8GB OC
GPU BIOS: 015.050.002001 2017/11/13 21:41 according to Win10 Adrenalin 20.2.2
OS: Ubuntu 18.04.4
Kernel: 5.3.0-45-generic
rocm-dkms: 3.3.19 installed through apt
Python: 3.6.9
tensorflow-rocm: 2.1.1 installed through pip
tensorflow benchmarks: cnn_tf_v2.1_compatible
tensorflow_models: 2.1.0

Benchmark dump. Command-line permutations were generated with cmds.py and log output processed with parse.py.

Comparing ROCm 3.3.19 resnet50 performance to previous versions, 3.3.19 has improved throughput and stability. It did not crash even once for me. However, I ran into the ROCmSoftwarePlatform/MIOpen#130 issue. MIOpen pre-computations take longer than most of these benchmarks. I would not mind giving up drive space for a MIOpen database/cache but prefer the raw throughput for faster training runs on large models/datasets.

             batchsize=16     32     032F   032XRF   64     064XR   128
ROCm1.9.3/TF1.12.0     78.6   92.0   91.9   59.4     100    112     60.7
ROCm2.8.13/TF1.14.2    51.4   57.9   65.8   67.7     61.0   70.0    64.1
ROCm3.3.19/TF2.1.1     77.6   92.6   65.3   65.7     106    105     71.9

imagenet dataset total images/sec:

python tf_cnn_benchmarks.py --device=GPU --num_gpus=1 --num_batches=40 \
--batch_size={16,32,64,128,256} --model={model} --data_name=imagenet

XR means XLA and ROCm Fusion were enabled
  export TF_XLA_FLAGS=--tf_xla_cpu_global_jit
  export TF_ROCM_FUSION_ENABLE=1
F means --use_fp16 option was used
na means the batch size was too large or benchmark would not run

model/batchsize=16      32      032F    032XRF  64      064XR   128     256
trivial         4016    7942    1129    1126    13648   13895   21851   30133
alexnet         317     491     318     319     669     672     764     861
googlenet       207     241     155     162     277     279     288     290
inception3      49.8    56.7    37.4    37.5    58.4    58.6    34.7    na
inception4      22.6    25.4    17.6    18.2    17.6    na      na      na
lenet5          4541    7625    7536    7617    12178   12106   17257   22254
official_ncf    1373    2694    2767    2848    5440    5490    10812   21140
overfeat        95.7    145     81.6    82.1    198     na      233     250
resnet101       44.7    55.5    35.8    36.1    37.1    na      na      na
resnet101_v2    47.8    56.2    35.9    36.2    63.3    63.3    na      na
resnet152       33.5    38.9    24.2    24.5    25.2    na      na      na
resnet152_v2    33.9    39.4    24.5    24.7    25.4    na      na      na
resnet50        77.6    92.6    65.3    65.7    106     105     71.9    na
resnet50_v1.5   70.0    83.8    61.0    61.4    94.9    94.7    66.6    na
resnet50_v2     78.9    94.2    65.7    66.5    108     108     72.5    na
vgg11           70.4    87.7    44.4    44.6    100     100     103     47.2
vgg16           38.9    48.4    21.8    22.0    50.1    50.6    22.6    na
vgg19           33.3    39.4    17.5    17.6    41.4    41.4    18.1    na

cifar10 dataset total images/sec:

python tf_cnn_benchmarks.py --device=GPU --num_gpus=1 --num_batches=40 \
--batch_size={16,32,64,128,256} --model={model} --data_name=cifar10

model/batchsize=16      32      032F    032XRF  64      064XR   128     256
trivial         8651    15968   11978   11708   27686   29923   44124   89755
alexnet         3485    5403    472     480     7210    7159    513     10455
resnet110       na      na      725     727     na      na      1023    na
resnet110_v2    503     729     495     495     902     902     840     1032
resnet20        2372    3421    2483    2490    4364    4353    4246    5217
resnet20_v2     2330    3386    2483    2448    4242    4242    4178    5068
resnet32        1584    2301    1613    1618    2891    2876    2751    3399
resnet32_v2     1579    2268    1609    1614    2841    2836    2732    3335
resnet44        1180    1723    1197    1193    2153    2154    2033    2517
resnet44_v2     1172    1717    1195    1195    2134    2134    2028    2480
resnet56        944     1379    946     945     1720    1723    1616    2004
resnet56_v2     944     1375    952     952     1715    1711    1614    1981

DeepSpeech worked with a batch size of 16:

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --num_batches=40 \
--model=deepspeech2 --data_name=librispeech
  [...]
  total images/sec: 0.56

CPU (Ryzen 5 3600X) total images/sec:

python3 tf_cnn_benchmarks.py --device=CPU  {--use_fp16} --num_batches=40 \
--batch_size={32,64,128} --model={model} --data_name=imagenet

F means --use_fp16 option was used

model/dataset batchsize=32      32F     64      64F     128     128F
trivial/cifar10         35401   2701    51733   2942    64842   3134
trivial/imagenet        2249    65.9    2821    66.4    4489    67.0
ncf/imagenet            347     326     701     558     1407    863
rocm-bandwidth-test
    RocmBandwidthTest Version: 2.3.11
    Device: 0,  AMD Ryzen 5 3600X 6-Core Processor
    Device: 1,  Ellesmere [Radeon RX 470/480/570/570X/580/580X],  2d:0.0

    Unidirectional copy peak bandwidth GB/s
    D/D       0           1
    0         N/A         11.325769
    1         11.244692   24.659122

    Bdirectional copy peak bandwidth GB/s
    D/D       0           1
    0         N/A         14.674771
    1         14.674771   N/A
python3 all_reduce_benchmark.py --variable_update=replicated
  Average time per step: 0.00011957406997680663
dkms status | grep amd
  amdgpu, 3.3-19, 5.3.0-45-generic, x86_64: installed
rocm-smi
  ========================ROCm System Management Interface==================
  ==========================================================================
  GPU  Temp   AvgPwr   SCLK     MCLK     Fan     Perf  PwrCap  VRAM%  GPU%
  0    31.0c  43.124W  1366Mhz  2000Mhz  26.67%  high  135.0W   98%   100%
  ==========================================================================
  ==============================End of ROCm SMI Log ========================

@ashaver
Copy link

ashaver commented Apr 19, 2020

Anyone else still fighting the AMD/ROCM drivers on a laptop. Even with the latest (Rev 20.10) and/or the latest ROCm I have the following peristent bugs related to (https://bugzilla.kernel.org/show_bug.cgi?id=203035):

  • First, this is not so much the fault of AMD as the fault of ACPI not detecting AC power in a laptop (in combination with AMD starting to drive power levels from real values e.g., torvalds/linux@600ae89).
  • I would love to fix the root problem, but have not had any success.
  • After rebooting the laptop CPU thinks it is on battery, so it throttles each core to about 550 MHz (instead of the base 1500 MHz). This hamstrings basically everything. It doesn't matter that I have 8 cores and 16 threads, each runs 386 clock speeds. The solution for the CPU is to unplug and plug it back in.
  • Using amdgpu-utils (https://github.com/Ricks-Lab/amdgpu-utils/) seems to allow setting higher clock frequencies. In contrast, I cannot do anything with rocm-smi (the changes don't seem to stick).
  • Stock laptop Acer Predator Helios 500 PH517-61-R0GX Gaming Laptop, AMD Ryzen 7 2700 Desktop Processor, AMD Radeon RX Vega 56

Specs and results:

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 131.4 +/- 0.0 (jitter = 0.0)	8.458
10	images/sec: 130.0 +/- 0.9 (jitter = 2.9)	7.997
20	images/sec: 129.1 +/- 0.6 (jitter = 2.2)	8.260
30	images/sec: 128.6 +/- 0.5 (jitter = 2.0)	8.338
40	images/sec: 128.4 +/- 0.4 (jitter = 2.3)	8.190
50	images/sec: 128.0 +/- 0.4 (jitter = 2.7)	7.742
60	images/sec: 128.2 +/- 0.4 (jitter = 2.4)	8.061
70	images/sec: 128.3 +/- 0.3 (jitter = 2.4)	inf
80	images/sec: 128.3 +/- 0.3 (jitter = 2.5)	inf
90	images/sec: 128.2 +/- 0.3 (jitter = 2.5)	inf
100	images/sec: 128.2 +/- 0.3 (jitter = 2.5)	inf
----------------------------------------------------------------
total images/sec: 128.13
----------------------------------------------------------------

@sunway513
Copy link

@qixiang109 , MIOpen released pre-compiled kernels in ROCm3.5 release, aiming to reduce the overheads on startup. For more details, you can refer to the following document:
https://github.com/ROCmSoftwarePlatform/MIOpen#installing-miopen-kernels-package

@papadako
Copy link

papadako commented Jun 6, 2020

I guess the following numbers are a bit problematic. Any ideas? Could it be the kernel?

GPU: Radeon VII
Kernel: 5.7.0
rocm-dkms: from kernel
Python: 3.8.2
rocm: 3.5
tensorflow-rocm: 2.2 compiled from source
tensorflow benchmarks: master

python tf_cnn_benchmarks.py --model=resnet50 --batch_size=128

Step    Img/sec total_loss
1       images/sec: 95.9 +/- 0.0 (jitter = 0.0) 7.781
10      images/sec: 95.9 +/- 0.0 (jitter = 0.1) 7.740
20      images/sec: 95.9 +/- 0.0 (jitter = 0.1) 7.827
30      images/sec: 95.8 +/- 0.0 (jitter = 0.1) 7.965
40      images/sec: 95.8 +/- 0.0 (jitter = 0.1) 7.881
50      images/sec: 95.7 +/- 0.0 (jitter = 0.2) 7.795
60      images/sec: 95.7 +/- 0.0 (jitter = 0.1) 8.005
70      images/sec: 95.7 +/- 0.0 (jitter = 0.2) 7.863
80      images/sec: 95.7 +/- 0.0 (jitter = 0.2) 7.922
90      images/sec: 95.7 +/- 0.0 (jitter = 0.1) 7.740
100     images/sec: 95.7 +/- 0.0 (jitter = 0.1) 7.998
----------------------------------------------------------------
total images/sec: 95.66
----------------------------------------------------------------

@huanzhang12
Copy link

@papadako Can you try to set MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0 and/or MIOPEN_DEBUG_CONV_GEMM=0 and see if it can improve performance?

MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0 MIOPEN_DEBUG_CONV_GEMM=0 python tf_cnn_benchmarks.py --model=resnet50 --batch_size=128

@papadako
Copy link

papadako commented Jun 8, 2020

@papadako Can you try to set MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0 and/or MIOPEN_DEBUG_CONV_GEMM=0 and see if it can improve performance?

MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0 MIOPEN_DEBUG_CONV_GEMM=0 python tf_cnn_benchmarks.py --model=resnet50 --batch_size=128

I get even worse results with the above settings

Step    Img/sec total_loss
1       images/sec: 75.8 +/- 0.0 (jitter = 0.0) 7.781
10      images/sec: 75.6 +/- 0.0 (jitter = 0.1) 7.740
20      images/sec: 75.6 +/- 0.0 (jitter = 0.1) 7.826
30      images/sec: 75.5 +/- 0.0 (jitter = 0.1) 7.964
40      images/sec: 75.5 +/- 0.0 (jitter = 0.1) 7.880
50      images/sec: 75.5 +/- 0.0 (jitter = 0.1) 7.793
60      images/sec: 75.4 +/- 0.0 (jitter = 0.1) 8.007
70      images/sec: 75.4 +/- 0.0 (jitter = 0.1) 7.865
80      images/sec: 75.3 +/- 0.0 (jitter = 0.1) 7.928
90      images/sec: 75.2 +/- 0.0 (jitter = 0.2) 7.741
100     images/sec: 75.1 +/- 0.1 (jitter = 0.2) 7.998

I will try try to use a rocm-dkms supported kernel (i.e., 5.4.0) and report back

@witeko
Copy link

witeko commented Jun 8, 2020

@papadako , @huanzhang12 , i have the same performance (or similar) issue. I use vega 7nm, rhel 8.2, dkms drivers, rocm 3.5, tensorflow 2.2.0 (on 2.1.0 works fine).

@logan-dunbar
Copy link

Running inside a Singularity container (v3.5.2) on host Ubuntu 18.04.

GPU: Asus Radeon RX Vega 56 ROG Strix OC 8GB
Kernel: 5.4.0-37
Driver: amdgpu-pro 20.20 (Ubuntu would freeze sporadically with rock-dkms)
Python: 3.7.7 (deadsnakes)
rocm: 3.5.1 (apt)
tensorflow-rocm: 2.2 (PyPI)
tensorflow benchmarks: master (449e900)

python3.7 tf_cnn_benchmarks.py --model=resnet50 --batch_size=64

Step	Img/sec	total_loss
1	images/sec: 132.0 +/- 0.0 (jitter = 0.0)	7.608
10	images/sec: 131.7 +/- 0.4 (jitter = 0.7)	7.849
20	images/sec: 131.4 +/- 0.3 (jitter = 0.8)	8.013
30	images/sec: 131.5 +/- 0.2 (jitter = 0.8)	7.940
40	images/sec: 131.4 +/- 0.2 (jitter = 0.8)	8.136
50	images/sec: 131.2 +/- 0.2 (jitter = 1.1)	8.052
60	images/sec: 131.2 +/- 0.1 (jitter = 1.0)	7.782
70	images/sec: 131.1 +/- 0.1 (jitter = 1.1)	7.853
80	images/sec: 131.2 +/- 0.1 (jitter = 1.1)	8.012
90	images/sec: 131.1 +/- 0.1 (jitter = 1.1)	7.843
100	images/sec: 131.0 +/- 0.1 (jitter = 1.3)	8.088
----------------------------------------------------------------
total images/sec: 130.97
----------------------------------------------------------------

@webber26232
Copy link

webber26232 commented Jul 5, 2020

Radeon VII
rocm==3.5 installed through apt
tensorflow==2.2 installed through pip

python3.7 tf_cnn_benchmarks.py --model=resnet50 --batch_size=128

Step	Img/sec	total_loss
1	images/sec: 183.8 +/- 0.0 (jitter = 0.0)	7.781
10	images/sec: 183.7 +/- 0.1 (jitter = 0.3)	7.740
20	images/sec: 183.5 +/- 0.1 (jitter = 0.3)	7.827
30	images/sec: 183.4 +/- 0.1 (jitter = 0.2)	7.964
40	images/sec: 183.3 +/- 0.1 (jitter = 0.4)	7.882
50	images/sec: 183.3 +/- 0.1 (jitter = 0.3)	7.791
60	images/sec: 183.2 +/- 0.1 (jitter = 0.4)	8.016
70	images/sec: 183.2 +/- 0.1 (jitter = 0.4)	7.870
80	images/sec: 183.1 +/- 0.1 (jitter = 0.4)	7.933
90	images/sec: 183.1 +/- 0.1 (jitter = 0.4)	7.739
100	images/sec: 183.1 +/- 0.0 (jitter = 0.4)	8.008
----------------------------------------------------------------
total images/sec: 183.10

Seems not as good as other Radeon VII posts. Got similar overhead mentioned in qixiang109's post

@nickdon2007
Copy link

I have similar issue, with lower than expected performance. The memory bandwidth is slow, which I don't know why.

CPU: AMD Ryzen 7 3700X
GPU: AMD Radeon RX Vega 56
OS: Ubuntu 18.04
Python: 3.6
rocm: 3 (apt)
tensorflow-rocm: 2.2 (PyPI)

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Done warm up
Step    Img/sec total_loss
1   images/sec: 81.0 +/- 0.0 (jitter = 0.0) 7.765
10  images/sec: 80.7 +/- 0.1 (jitter = 0.2) 8.049
20  images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.808
30  images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.976
40  images/sec: 80.9 +/- 0.1 (jitter = 0.2) 7.591
50  images/sec: 81.2 +/- 0.1 (jitter = 0.3) 7.549
60  images/sec: 81.5 +/- 0.1 (jitter = 0.6) 7.819
70  images/sec: 81.7 +/- 0.1 (jitter = 1.1) 7.820
80  images/sec: 81.8 +/- 0.1 (jitter = 1.5) 7.847
90  images/sec: 82.0 +/- 0.1 (jitter = 0.8) 8.025
100 images/sec: 82.1 +/- 0.1 (jitter = 0.6) 8.029
----------------------------------------------------------------
total images/sec: 82.07
----------------------------------------------------------------

clinfo

Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.0 AMD-APP (3137.0)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback 


  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               1
  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     1002h
  Board name:                    Vega 10 XT [Radeon RX Vega 64]
  Device Topology:               PCI[ B#47, D#0, F#0 ]
  Max compute units:                 56
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               1024
  Max work group size:               256
  Preferred vector width char:           4
  Preferred vector width short:          2
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          4
  Native vector width short:             2
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1590Mhz
  Address bits:                  64
  Max memory allocation:             7287183769
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            26751
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                8573157376
  Constant buffer size:              7287183769
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 65536
  Max pipe arguments:                16
  Max pipe active reservations:          16
  Max pipe packet size:              2992216473
  Max global variable size:          7287183769
  Max global variable preferred total size:  8573157376
  Max read/write image args:             64
  Max on device events:              1024
  Queue on device max size:          8388608
  Max on device queues:              1
  Queue on device preferred size:        262144
  SVM capabilities:              
    Coarse grain buffer:             Yes
    Fine grain buffer:               Yes
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     64
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                Yes
    Profiling :                  Yes
  Platform ID:                   0x7fe56aa5fcf0
  Name:                      gfx900
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 2.0 
  Driver version:                3137.0 (HSA1.1,LC)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 2.0 
  Extensions:                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

rocminfo

ROCk module is loaded
Able to open /dev/kfd read-write
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 3700X 8-Core Processor 
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 3700X 8-Core Processor 
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   0                                  
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16436616(0xfacd88) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16436616(0xfacd88) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx900                             
  Uuid:                    GPU-02151e1bb9ee2144               
  Marketing Name:          Vega 10 XT [Radeon RX Vega 64]     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26751(0x687f)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1590                               
  BDFID:                   12032                              
  Internal Node ID:        1                                  
  Compute Unit:            56                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx900          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***   

rocm-bandwidth-test

          RocmBandwidthTest Version: 2.3.11

          Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)


          Device: 0,  AMD Ryzen 7 3700X 8-Core Processor
          Device: 1,  Vega 10 XT [Radeon RX Vega 64],  2f:0.0

          Inter-Device Access

          D/D       0         1         

          0         1         0         

          1         1         1         


          Inter-Device Numa Distance

          D/D       0         1         

          0         0         N/A       

          1         20        0         


          Unidirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         9.295924    

          1         8.892247    72.654038   


          Bdirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         17.103560   

          1         17.103560   N/A         

@sunway513
Copy link

Hi @nickdon2007 @webber26232 , thanks for reporting your observations.
We've been looking into the performance drop reported for the TF2.2 release branch. The issue has been identified and we'll try to provide the fixes in the next a few weeks with the next ROCm release.
cc @ekuznetsov139 @deven-amd

@joket1999
Copy link

joket1999 commented Sep 13, 2020

Ubuntu 20.04

Radeon VII
VBIOS version: 113-D3600200-106

rocm==3.7
tensorflow==2.3
benchmarks==cnn_tf_v2.1_compatible

python3 ./tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50

Step	Img/sec	total_loss
1	images/sec: 284.8 +/- 0.0 (jitter = 0.0)	7.608
10	images/sec: 284.0 +/- 0.3 (jitter = 0.7)	7.849
20	images/sec: 284.0 +/- 0.2 (jitter = 0.6)	8.013
30	images/sec: 284.0 +/- 0.1 (jitter = 0.7)	7.939
40	images/sec: 283.9 +/- 0.1 (jitter = 0.8)	8.137
50	images/sec: 283.8 +/- 0.2 (jitter = 0.8)	8.051
60	images/sec: 283.7 +/- 0.1 (jitter = 0.8)	7.781
70	images/sec: 283.7 +/- 0.1 (jitter = 0.8)	7.856
80	images/sec: 283.7 +/- 0.1 (jitter = 0.9)	8.012
90	images/sec: 283.7 +/- 0.1 (jitter = 0.8)	7.842
100	images/sec: 283.7 +/- 0.1 (jitter = 0.7)	8.090
----------------------------------------------------------------
total images/sec: 283.60
----------------------------------------------------------------
python3 ./tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16

Done warm up
Step	Img/sec	total_loss
1	images/sec: 391.8 +/- 0.0 (jitter = 0.0)	7.573
10	images/sec: 394.2 +/- 0.5 (jitter = 1.9)	7.848
20	images/sec: 394.6 +/- 0.3 (jitter = 1.4)	7.966
30	images/sec: 394.7 +/- 0.3 (jitter = 1.1)	7.907
40	images/sec: 394.1 +/- 0.3 (jitter = 1.7)	8.070
50	images/sec: 394.2 +/- 0.2 (jitter = 1.6)	8.047
60	images/sec: 394.3 +/- 0.2 (jitter = 1.6)	7.769
70	images/sec: 394.4 +/- 0.2 (jitter = 1.5)	7.859
80	images/sec: 394.2 +/- 0.2 (jitter = 1.6)	7.965
90	images/sec: 394.1 +/- 0.2 (jitter = 1.7)	7.822
100	images/sec: 394.1 +/- 0.2 (jitter = 1.7)	8.058
----------------------------------------------------------------
total images/sec: 393.89
----------------------------------------------------------------

python3 ./tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Done warm up
Step	Img/sec	total_loss
1	images/sec: 292.8 +/- 0.0 (jitter = 0.0)	7.781
10	images/sec: 292.6 +/- 0.2 (jitter = 0.7)	7.740
20	images/sec: 292.3 +/- 0.1 (jitter = 0.6)	7.827
30	images/sec: 292.2 +/- 0.1 (jitter = 0.3)	7.963
40	images/sec: 292.0 +/- 0.1 (jitter = 0.4)	7.884
50	images/sec: 291.9 +/- 0.1 (jitter = 0.5)	7.792
60	images/sec: 291.8 +/- 0.1 (jitter = 0.5)	8.015
70	images/sec: 291.7 +/- 0.1 (jitter = 0.6)	7.868
80	images/sec: 291.6 +/- 0.1 (jitter = 0.6)	7.933
90	images/sec: 291.5 +/- 0.1 (jitter = 0.6)	7.746
100	images/sec: 291.4 +/- 0.1 (jitter = 0.7)	7.997
----------------------------------------------------------------
total images/sec: 291.38
----------------------------------------------------------------

python3 ./tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Done warm up
Step	Img/sec	total_loss
1	images/sec: 426.1 +/- 0.0 (jitter = 0.0)	7.794
10	images/sec: 428.1 +/- 0.3 (jitter = 0.9)	7.737
20	images/sec: 427.7 +/- 0.3 (jitter = 0.9)	7.828
30	images/sec: 427.5 +/- 0.2 (jitter = 1.0)	7.960
40	images/sec: 427.2 +/- 0.2 (jitter = 1.3)	7.889
50	images/sec: 427.0 +/- 0.2 (jitter = 1.3)	7.788
60	images/sec: 427.0 +/- 0.1 (jitter = 1.2)	8.019
70	images/sec: 426.8 +/- 0.1 (jitter = 1.2)	7.869
80	images/sec: 426.7 +/- 0.1 (jitter = 1.1)	7.931
90	images/sec: 426.6 +/- 0.1 (jitter = 1.2)	7.731
100	images/sec: 426.4 +/- 0.1 (jitter = 1.2)	7.992
----------------------------------------------------------------
total images/sec: 426.36
----------------------------------------------------------------

@dcominottim
Copy link

dcominottim commented Jan 15, 2021

Here are some RTX 3080 10GB results.

(Obs.: When you see some low scores at higher batch sizes with (UM), it's because CUDA Unified Memory and shared memory was used due to lack of VRAM.)

Ryzen 9 5950X
32GB 3200MHz RAM
Pop_OS! 20.04.1
NVIDIA 460 driver
tensorflow-gpu 2.4.0
NVIDIA 20.12-tf2-py3 Docker image

sudo docker run --gpus all --name tf-20.12 --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 -it --rm -v $HOME/Projects/nvidia/tensorflow-gpu/benchmarks-master/scripts/tf_cnn_benchmarks:/projects nvcr.io/nvidia/tensorflow:20.12-tf2-py3

 FP32 ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=512 / 4715.95 / / / /
batch_size=256 54.2 (UM) 4578.22 / / / /
batch_size=128 62.8 (UM) 4237.48 52.8 (UM) / 1016.12 /
batch_size=64 396.26 3373.96 278.23 245.71 906.01 /
batch_size=32 362.88 2467.48 260.47 238.11 802.6 150.18
FP16 ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=512 / 6504.74 / / / /
batch_size=256 / 5819.6 / / 1790.52 /
batch_size=128 947.3 4919.44 635.26 355.78 1645.71 /
batch_size=64 900.25 3797.61 578.34 326.88 1498.69 384.89
batch_size=32 736.35 2512.88 517.68 295.81 1307.13 321.85

@EmilPi
Copy link

EmilPi commented Feb 2, 2021

Any 6900 XT benchmarks?

@Daniel451
Copy link

@EmilPi 6900 XT would be very interesting indeed

@qixiang109
Copy link

qixiang109 commented Mar 12, 2021

@dcominottim my GTX1080 and Radeon vii, training examples / second

image

@dcominottim
Copy link

@Daniel451 @EmilPi @qixiang109 Unfortunately, without ROCm support for RDNA*, we can't test ROCm performance yet. However, I've managed to test a 6800 XT with tensorflow-directml (1.15.4, the latest version as of now) on W10! That's at least a little light for RDNA owners who are interested in ML. Here are the numbers:

Ryzen 9 5950X
32GB 3200MHz RAM
6800 XT
Windows 10 20H2 19042.867
AMD Adrenalin 21.3.1
Python 3.7.10
tensorflow-directml 1.15.4

 FP32 ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=128 63.2 590.1 52.6 29.6 244.0 /
batch_size=64 / / / / / 27.9
 FP16 ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=128 52 528.2 41.0 23.9 174.0 23.1

@plinnie
Copy link

plinnie commented May 20, 2021

I have MI50 and V100 available which I can use for benchmarking. What would be the best benchmarks to run? I see the original benchmarks seem outdated

@cjm-sfw
Copy link

cjm-sfw commented Nov 7, 2021

Dose anyone try the benchmark on Rocm 4.5? Seemly Rocm now support gfx1030(6800xt/6900xt)?

There is someone testing 6700xt:

Link:https://www.zhihu.com/question/469674526/answer/2189926640
From:ZHIHU

Linux-5.10, nvidia driver 460.91.03, cuda 11.2, pytorch-1.9.1,Tesla A100running benchmark for framework pytorch
cuda version= 11.2
cudnn version= 8100
pytorch's vgg16 eval at fp32: 11.3ms avg
pytorch's vgg16 train at fp32: 46.5ms avg
pytorch's resnet152 eval at fp32: 44.4ms avg
pytorch's resnet152 train at fp32: 157.3ms avg
pytorch's densenet161 eval at fp32: 45.8ms avg
pytorch's densenet161 train at fp32: 154.8ms avg
pytorch's vgg16 eval at fp16: 7.8ms avg
pytorch's vgg16 train at fp16: 41.9ms avg
pytorch's resnet152 eval at fp16: 48.2ms avg
pytorch's resnet152 train at fp16: 163.0ms avg
pytorch's densenet161 eval at fp16: 48.1ms avg
pytorch's densenet161 train at fp16: 174.4ms avg

Linux-5.14.5, ROCm-4.3.0, pytorch-1.9.1, Radeon 6700XT :running benchmark for framework pytorch
cuda version= None
cudnn version= 2012000
pytorch's vgg16 eval at fp32: 67.7ms avg
pytorch's vgg16 train at fp32: 194.5ms avg
pytorch's resnet152 eval at fp32: 57.8ms avg
pytorch's resnet152 train at fp32: 226.2ms avg
pytorch's densenet161 eval at fp32: 63.9ms avg
pytorch's densenet161 train at fp32: 228.0ms avg
pytorch's vgg16 eval at fp16: 25.8ms avg
pytorch's vgg16 train at fp16: 118.2ms avg
pytorch's resnet152 eval at fp16: 52.4ms avg
pytorch's resnet152 train at fp16: 183.4ms avg
pytorch's densenet161 eval at fp16: 54.5ms avg
pytorch's densenet161 train at fp16: 195.7ms avg

@Djip007
Copy link

Djip007 commented Dec 11, 2021

Any 6900 XT benchmarks?

with not official support on rocm-4.5: ;)

>TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 

Step	Img/sec	total_loss
1	images/sec: 305.5 +/- 0.0 (jitter = 0.0)	7.781
10	images/sec: 304.3 +/- 0.3 (jitter = 1.1)	7.740
20	images/sec: 304.0 +/- 0.2 (jitter = 1.0)	7.827
30	images/sec: 304.0 +/- 0.2 (jitter = 0.9)	7.967
40	images/sec: 304.1 +/- 0.1 (jitter = 0.8)	7.885
50	images/sec: 304.1 +/- 0.1 (jitter = 0.7)	7.792
60	images/sec: 304.0 +/- 0.1 (jitter = 0.8)	8.011
70	images/sec: 303.9 +/- 0.1 (jitter = 0.8)	7.870
80	images/sec: 303.8 +/- 0.1 (jitter = 0.9)	7.923
90	images/sec: 303.8 +/- 0.1 (jitter = 0.9)	7.745
100	images/sec: 303.8 +/- 0.1 (jitter = 0.8)	7.990
----------------------------------------------------------------
total images/sec: 303.76
----------------------------------------------------------------
>TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 457.7 +/- 0.0 (jitter = 0.0)	7.788
10	images/sec: 453.4 +/- 0.8 (jitter = 3.0)	7.738
20	images/sec: 452.8 +/- 0.6 (jitter = 2.0)	7.821
30	images/sec: 452.7 +/- 0.4 (jitter = 2.3)	7.962
40	images/sec: 452.6 +/- 0.4 (jitter = 2.3)	7.888
50	images/sec: 452.5 +/- 0.3 (jitter = 2.3)	7.795
60	images/sec: 452.5 +/- 0.3 (jitter = 2.5)	8.018
70	images/sec: 452.5 +/- 0.3 (jitter = 2.7)	7.868
80	images/sec: 452.7 +/- 0.3 (jitter = 2.9)	7.916
90	images/sec: 452.5 +/- 0.3 (jitter = 2.8)	7.739
100	images/sec: 452.4 +/- 0.3 (jitter = 2.9)	8.006
----------------------------------------------------------------
total images/sec: 452.35
----------------------------------------------------------------

@WannaBeOCer
Copy link

My previous Radeon VII results are here: #173 (comment)

Hopefully AMD's next generation consumer GPUs include Matrix cores if they are actually bringing support to RDNA.

Titan RTX:

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 

Step	Img/sec	total_loss
1	images/sec: 352.6 +/- 0.0 (jitter = 0.0)	7.781
10	images/sec: 355.6 +/- 0.7 (jitter = 1.8)	7.740
20	images/sec: 355.5 +/- 0.4 (jitter = 1.1)	7.827
30	images/sec: 355.4 +/- 0.3 (jitter = 0.9)	7.966
40	images/sec: 355.6 +/- 0.2 (jitter = 1.0)	7.880
50	images/sec: 355.5 +/- 0.2 (jitter = 1.0)	7.790
60	images/sec: 355.4 +/- 0.2 (jitter = 0.9)	8.013
70	images/sec: 355.4 +/- 0.2 (jitter = 0.9)	7.866
80	images/sec: 355.3 +/- 0.1 (jitter = 0.9)	7.920
90	images/sec: 355.3 +/- 0.1 (jitter = 0.9)	7.743
100	images/sec: 355.3 +/- 0.1 (jitter = 0.9)	7.991
----------------------------------------------------------------
total images/sec: 355.08
----------------------------------------------------------------
python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Step	Img/sec	total_loss
1	images/sec: 1127.1 +/- 0.0 (jitter = 0.0)	7.788
10	images/sec: 1119.9 +/- 3.7 (jitter = 7.1)	7.741
20	images/sec: 1122.1 +/- 2.6 (jitter = 9.3)	7.826
30	images/sec: 1121.3 +/- 2.1 (jitter = 5.0)	7.962
40	images/sec: 1121.6 +/- 1.9 (jitter = 5.6)	7.885
50	images/sec: 1119.3 +/- 1.7 (jitter = 8.4)	7.795
60	images/sec: 1117.8 +/- 1.6 (jitter = 9.5)	8.012
70	images/sec: 1116.1 +/- 1.6 (jitter = 12.6)	7.874
80	images/sec: 1115.2 +/- 1.4 (jitter = 13.9)	7.929
90	images/sec: 1114.7 +/- 1.5 (jitter = 13.8)	7.739
100	images/sec: 1114.1 +/- 1.4 (jitter = 14.1)	8.000
----------------------------------------------------------------
total images/sec: 1112.65
----------------------------------------------------------------

@Djip007
Copy link

Djip007 commented Dec 13, 2021

CDNA and up compute carte (MI100 ...) have already Matrix core ... but yes not "consomer" card
The support is not official... hop it may have some optimisations after official support..

Compared with RTX 3080 ... the old Titan is as fast... This benchmark is not optimised and no more update for tensorflow2... May be we can find some more update benchmark...

@WannaBeOCer
Copy link

The Titan RTX is slightly faster than a RTX 3080 since its FP32 accumulation runs at full throughput like Nvidia's professional/data center cards when using Tensor cores. Geforce RTX cards are limited to 0.5x throughput when using Tensor cores, it's a way to segment their lineup similar to how AMD/Nvidia limit FP64 throughput.

https://lambdalabs.com/gpu-benchmarks

I used this benchmark to test my MacBook M1 Pro and my Titan RTX: https://github.com/tlkh/tf-metal-experiments

python train_benchmark.py --type cnn --model resnet50 --xla --fp16 --steps 100
python train_benchmark.py --type cnn --model mobilenetv2 --xla --fp16 --steps 100
python train_benchmark.py --type transformer --model distilbert-base-uncased --xla --fp16 --steps 100
python train_benchmark.py --type transformer --model bert-large-uncased --bs 16 --xla --fp16 --steps 30

Model | GPU | BatchSize | Throughput
ResNet50 | Titan RTX |128 |901.0 img/sec
MobileNetV2 | Titan RTX |128 |1467.2 img/sec
DistilBERT | Titan RTX | 64 | 1216.9 seq/sec
BERTLarge | Titan RTX | 16 | 126.7 seq/sec

@tedliosu
Copy link

tedliosu commented Jan 4, 2022

If it's alright I'd like to post here the benchmark results I got with my rx 6800 using ROCm 4.5.2 and amdgpu-dkms version 1:5.11.32.40502-1350682 (from the latest amdgpu-pro driver stack):

> TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 226.9 +/- 0.0 (jitter = 0.0)        7.781
10      images/sec: 226.4 +/- 0.3 (jitter = 0.1)        7.740
20      images/sec: 226.5 +/- 0.2 (jitter = 0.2)        7.827
30      images/sec: 226.4 +/- 0.1 (jitter = 0.2)        7.966
40      images/sec: 226.4 +/- 0.1 (jitter = 0.2)        7.883
50      images/sec: 226.5 +/- 0.1 (jitter = 0.2)        7.800
60      images/sec: 226.5 +/- 0.1 (jitter = 0.3)        8.008
70      images/sec: 226.5 +/- 0.1 (jitter = 0.2)        7.872
80      images/sec: 226.5 +/- 0.1 (jitter = 0.2)        7.930
90      images/sec: 226.5 +/- 0.1 (jitter = 0.2)        7.743
100     images/sec: 226.6 +/- 0.1 (jitter = 0.2)        7.996
----------------------------------------------------------------
total images/sec: 226.54
----------------------------------------------------------------
> TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Step    Img/sec total_loss
1       images/sec: 316.8 +/- 0.0 (jitter = 0.0)        7.786
10      images/sec: 316.4 +/- 0.3 (jitter = 0.5)        7.742
20      images/sec: 316.7 +/- 0.2 (jitter = 0.6)        7.826
30      images/sec: 316.8 +/- 0.1 (jitter = 0.7)        7.964
40      images/sec: 316.8 +/- 0.1 (jitter = 0.7)        7.884
50      images/sec: 316.9 +/- 0.1 (jitter = 0.6)        7.799
60      images/sec: 316.8 +/- 0.1 (jitter = 0.6)        8.015
70      images/sec: 316.7 +/- 0.1 (jitter = 0.7)        7.867
80      images/sec: 316.7 +/- 0.1 (jitter = 0.6)        7.922
90      images/sec: 316.7 +/- 0.1 (jitter = 0.6)        7.753
100     images/sec: 316.7 +/- 0.1 (jitter = 0.6)        7.999
----------------------------------------------------------------
total images/sec: 316.61
----------------------------------------------------------------

I guess these results are somewhat in line with @Djip007's results, as at least according to userbenchmark.com the 6900 XT is around 40% faster than the 6800.

However, what I don't understand is how is the Radeon VII able to basically match the 6900 XT and beat the rx 6800 with the above set of benchmarks which I ran, when both the 6900 XT and 6800 are clearly the faster GPUs at least according to Geekbench's OpenCL benchmarks (note how the 6900 XT scores 167460, the 6800 scores 129251, and the Vega 20 i.e. Radeon VII only scores 96073). Could this performance discrepancy be explained by the Radeon VII's massive memory bandwidth, or the fact that the RDNA series of GPUs are not as optimized for compute (microarch-wise) as Radeon VII, or is it just that ROCm and the amdgpu-pro drivers haven't been optimized for performing compute tasks on RDNA cards?

@gururise
Copy link

Ran some benchmarks on my RX6800XT (Quiet Mode Switch) in the Docker ROCm5.2.0-TF2.8-dev container:

>TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 258.0 +/- 0.0 (jitter = 0.0)        7.781
10      images/sec: 168.1 +/- 20.7 (jitter = 0.5)       7.740
20      images/sec: 203.7 +/- 10.6 (jitter = 0.3)       7.826
30      images/sec: 219.1 +/- 7.1 (jitter = 0.4)        7.965
40      images/sec: 227.7 +/- 5.4 (jitter = 0.4)        7.878
50      images/sec: 233.1 +/- 4.3 (jitter = 0.3)        7.790
60      images/sec: 236.9 +/- 3.6 (jitter = 0.3)        8.006
70      images/sec: 239.7 +/- 3.1 (jitter = 0.4)        7.866
80      images/sec: 241.8 +/- 2.7 (jitter = 0.4)        7.929
90      images/sec: 243.4 +/- 2.4 (jitter = 0.5)        7.745
100     images/sec: 244.8 +/- 2.2 (jitter = 0.5)        7.997
----------------------------------------------------------------
total images/sec: 244.74
----------------------------------------------------------------
> TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Step    Img/sec total_loss
1       images/sec: 360.7 +/- 0.0 (jitter = 0.0)        7.784
10      images/sec: 368.3 +/- 0.9 (jitter = 1.5)        7.745
20      images/sec: 368.8 +/- 0.5 (jitter = 0.9)        7.827
30      images/sec: 368.9 +/- 0.3 (jitter = 0.6)        7.964
40      images/sec: 368.9 +/- 0.3 (jitter = 0.6)        7.881
50      images/sec: 368.9 +/- 0.2 (jitter = 0.6)        7.792
60      images/sec: 368.7 +/- 0.2 (jitter = 0.8)        8.013
70      images/sec: 368.6 +/- 0.2 (jitter = 1.0)        7.873
80      images/sec: 368.6 +/- 0.2 (jitter = 1.0)        7.926
90      images/sec: 368.5 +/- 0.1 (jitter = 1.0)        7.739
100     images/sec: 368.2 +/- 0.2 (jitter = 1.2)        7.999
----------------------------------------------------------------
total images/sec: 368.10
----------------------------------------------------------------

I would try it in Power Mode, but that would involve opening my case and flipping the BIOS switch on the card, which is too much work at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests