Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EfficientNet inference yields incorrect results on GPU #519

Closed
liamnr2 opened this issue Jun 24, 2019 · 11 comments
Closed

EfficientNet inference yields incorrect results on GPU #519

liamnr2 opened this issue Jun 24, 2019 · 11 comments
Labels
gfx803 issue specific to gfx803 GPUs

Comments

@liamnr2
Copy link

liamnr2 commented Jun 24, 2019

I'm using rocm 2.5, tensorflow-rocm 1.13.3 and python 3.6 with a RX 470.

When running the simple EfficientNet-B0 inference example here:
https://github.com/qubvel/efficientnet/blob/master/examples/inference_example.ipynb

the inference of the example image yields incorrect and non-deterministic results. Some examples:
[[('n01773549', 'barn_spider', 0.4544877), ('n01776313', 'tick', 0.14279026), ('n03271574', 'electric_fan', 0.06995272), ('n01774750', 'tarantula', 0.059890375), ('n01531178', 'goldfinch', 0.04341215)]]

[[('n01776313', 'tick', 0.52216125), ('n01773549', 'barn_spider', 0.24521892), ('n03271574', 'electric_fan', 0.17396057), ('n01774750', 'tarantula', 0.015509106), ('n03982430', 'pool_table', 0.008740869)]]

[[('n02497673', 'Madagascar_cat', 0.24683656), ('n03976657', 'pole', 0.20120004), ('n03710721', 'maillot', 0.078447856), ('n01773549', 'barn_spider', 0.046732053), ('n01774750', 'tarantula', 0.04341184)]]

When forcing to run on the CPU via CUDA_VISIBLE_DEVICES=, it yields the expected result:
[[('n02510455', 'giant_panda', 0.8347932), ('n02134084', 'ice_bear', 0.015602067), ('n02509815', 'lesser_panda', 0.0045535103), ('n02133161', 'American_black_bear', 0.0024719117), ('n02132136', 'brown_bear', 0.0020707578)]]

@Bengt
Copy link

Bengt commented Jun 28, 2019

Hi, @liamnr2!

Welcome to GitHub and thanks for reporting this issue. Seemingly random results are hard to test for so it is very valuable that you found some.

Unfortunately, the RX470 uses a Polaris 10 chip, which shares the gfx803 compile target with a bunch of other popular GPUs. For a list of the affected GPUs see #479.

There have been quite a number of issues with this compile target, only some of which could be resolved, yet. For a full list see the gfx803 tag.

To find the cause of this behavior, we need to reproduce these issues with various combinations of hardware and software. I can try and help with creating a reproducing procedure.

A wild guess would be to try downgrading rocm-opencl, which has helped with gfx803 in some cases:

#300 (comment)
#302 (comment)

@Bengt
Copy link

Bengt commented Jun 29, 2019

Procedure for reproduction:

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx rocm/tensorflow:rocm2.5-tf1.13-python3
python3 -m pip install scikit-image numpy keras efficientnet pytest
wget https://upload.wikimedia.org/wikipedia/commons/f/fe/Giant_Panda_in_Beijing_Zoo_1.JPG
wget https://gist.githubusercontent.com/Bengt/308c7d05dc755f1bfe0aeda9220e4eed/raw//test_efficientnet_gfx803.py
HIP_VISIBLE_DEVICES=0 python3 -m pytest -s test_efficientnet_gfx803.py
HIP_VISIBLE_DEVICES=-1 python3 -m pytest -s test_efficientnet_gfx803.py

@Bengt
Copy link

Bengt commented Jun 29, 2019

I can reproduce this issue.

Using GPU 0 fails:

# HIP_VISIBLE_DEVICES=0 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['bow_tie', '... 'guinea_pig'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'bow_tie' != 'giant_panda'

Using GPU 1 fails:

# HIP_VISIBLE_DEVICES=1 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['crutch', 't...ra', 'sorrel'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'crutch' != 'giant_panda'

Using GPU 2 fails:

# HIP_VISIBLE_DEVICES=2 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['jersey', 'w...an_coonhound'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'jersey' != 'giant_panda'

Using GPU 3 fails:

# HIP_VISIBLE_DEVICES=3 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['bolo_tie', ...analog_clock'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'bolo_tie' != 'giant_panda'

These results seem indeed random or undeterministic:

# HIP_VISIBLE_DEVICES=3 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['oxygen_mask...er', 'maraca'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'oxygen_mask' != 'giant_panda'

Using CPU works fine:

# HIP_VISIBLE_DEVICES=-1 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
===================== 1 passed, 2 warnings in 9.32 seconds =====================

I am using R9 Fury X and R9 Nano GPUs, latest Ubuntu Kernel and ROCm 2.5.27:

$ lspci -v | grep VGA
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev ca) (prog-if 00 [VGA controller])
0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev ca) (prog-if 00 [VGA controller])
42:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev ca) (prog-if 00 [VGA controller])
43:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c8) (prog-if 00 [VGA controller])
$ uname -r
4.15.0-54-generic
$ $ dpkg -l | grep rocm | grep stack
ii  rocm-dev                                      2.5.27                                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-dkms                                     2.5.27                                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-libs                                     2.5.27                                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-utils                                    2.5.27                                       amd64        Radeon Open Compute (ROCm) Runtime software stack

Downgrading the ROCm-opencl does not help in my case:

cd ~ && mkdir rocm1.9.2-opencl && cd rocm1.9.2-opencl &&
wget https://www.dropbox.com/s/rtwe1zrpuphbyqm/rocm-opencl-1.2.0-2018111340_amd64.deb && 
wget https://www.dropbox.com/s/6gp2g5zju66i4e9/rocm-opencl-dev-1.2.0-2018111340_amd64.deb && 
dpkg -i rocm-opencl*.deb &&
rm -rf ~/.cache &&
cd  -
# HIP_VISIBLE_DEVICES=0 python3 -m pytest -s test_efficientnet_gfx803.py
[..]
>       assert actual == expected
E       AssertionError: assert ['artichoke',... 'sea_urchin'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'artichoke' != 'giant_panda'

@Bengt
Copy link

Bengt commented Jun 29, 2019

This issue persists with rocm/tensorflow:rocm1.9.2-tf1.12-python3:

# HIP_VISIBLE_DEVICES=0 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['garter_snak...r', 'echidna'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'garter_snake' != 'giant_panda'

@sunway513 sunway513 added the gfx803 issue specific to gfx803 GPUs label Jul 8, 2019
@liamnr2
Copy link
Author

liamnr2 commented Aug 14, 2019

Still a problem with ROCm 2.6.

As an observation, setting MIOPEN_DEBUG_GCN_ASM_KERNELS=0 improves the results - there is still jitter, but far less so. With EfficientNet-B7 it's minimal, but still there.

EfficientNet-B0, MIOPEN_DEBUG_GCN_ASM_KERNELS=1:

[[('n02510455', 'giant_panda', 0.77773875), ('n02132136', 'brown_bear', 0.01460326), ('n02134084', 'ice_bear', 0.009905247), ('n02133161', 'American_black_bear', 0.009050588), ('n02096585', 'Boston_bull', 0.0070677395)]]
[[('n06359193', 'web_site', 0.058782034), ('n03291819', 'envelope', 0.05005152), ('n04118776', 'rule', 0.044108856), ('n03998194', 'prayer_rug', 0.04122383), ('n04409515', 'tennis_ball', 0.034883693)]]
[[('n03706229', 'magnetic_compass', 0.14726971), ('n04238763', 'slide_rule', 0.11163757), ('n04118776', 'rule', 0.094091), ('n02708093', 'analog_clock', 0.027822705), ('n02794156', 'barometer', 0.02083086)]]
[[('n04118776', 'rule', 0.1692506), ('n03706229', 'magnetic_compass', 0.0850252), ('n04238763', 'slide_rule', 0.06351954), ('n02708093', 'analog_clock', 0.024399932), ('n03857828', 'oscilloscope', 0.019325882)]]
[[('n04238763', 'slide_rule', 0.06601893), ('n04118776', 'rule', 0.057043314), ('n03706229', 'magnetic_compass', 0.043703355), ('n04357314', 'sunscreen', 0.04076335), ('n03929660', 'pick', 0.035940796)]]
[[('n06359193', 'web_site', 0.049492065), ('n04118776', 'rule', 0.049231295), ('n03998194', 'prayer_rug', 0.048374362), ('n03291819', 'envelope', 0.035772696), ('n07248320', 'book_jacket', 0.033176217)]]
[[('n04238763', 'slide_rule', 0.12657635), ('n03706229', 'magnetic_compass', 0.10579053), ('n04118776', 'rule', 0.054984488), ('n04357314', 'sunscreen', 0.04215321), ('n03047690', 'clog', 0.031784806)]]
[[('n04118776', 'rule', 0.32308587), ('n04238763', 'slide_rule', 0.14665197), ('n03706229', 'magnetic_compass', 0.044921804), ('n04357314', 'sunscreen', 0.026250241), ('n02708093', 'analog_clock', 0.023987856)]]
[[('n04118776', 'rule', 0.08757503), ('n03706229', 'magnetic_compass', 0.06836976), ('n04238763', 'slide_rule', 0.06297214), ('n02708093', 'analog_clock', 0.03041393), ('n04039381', 'racket', 0.02478665)]]
[[('n04238763', 'slide_rule', 0.073604986), ('n04357314', 'sunscreen', 0.057176016), ('n04118776', 'rule', 0.055984076), ('n03706229', 'magnetic_compass', 0.05034835), ('n03929660', 'pick', 0.029202135)]]

EfficientNet-B0, MIOPEN_DEBUG_GCN_ASM_KERNELS=0:

[[('n02510455', 'giant_panda', 0.80664486), ('n02134084', 'ice_bear', 0.006699027), ('n02132136', 'brown_bear', 0.0057221507), ('n02509815', 'lesser_panda', 0.004147317), ('n02120079', 'Arctic_fox', 0.0035862043)]]
[[('n02510455', 'giant_panda', 0.75878745), ('n02134084', 'ice_bear', 0.008354737), ('n02132136', 'brown_bear', 0.007207209), ('n02509815', 'lesser_panda', 0.004130219), ('n02120079', 'Arctic_fox', 0.0040210793)]]
[[('n02510455', 'giant_panda', 0.7587877), ('n02134084', 'ice_bear', 0.008354739), ('n02132136', 'brown_bear', 0.0072072037), ('n02509815', 'lesser_panda', 0.0041302163), ('n02120079', 'Arctic_fox', 0.0040210765)]]
[[('n02510455', 'giant_panda', 0.76415765), ('n02134084', 'ice_bear', 0.008157566), ('n02132136', 'brown_bear', 0.0061342083), ('n02509815', 'lesser_panda', 0.0036074982), ('n02120079', 'Arctic_fox', 0.0035751157)]]
[[('n02510455', 'giant_panda', 0.75936085), ('n02134084', 'ice_bear', 0.008365493), ('n02132136', 'brown_bear', 0.007142773), ('n02509815', 'lesser_panda', 0.004107962), ('n02120079', 'Arctic_fox', 0.0040129614)]]
[[('n02510455', 'giant_panda', 0.75878924), ('n02134084', 'ice_bear', 0.00835698), ('n02132136', 'brown_bear', 0.0072079534), ('n02509815', 'lesser_panda', 0.004130396), ('n02120079', 'Arctic_fox', 0.0040213186)]]
[[('n02510455', 'giant_panda', 0.7603499), ('n02134084', 'ice_bear', 0.009082864), ('n02132136', 'brown_bear', 0.006688087), ('n02120079', 'Arctic_fox', 0.0040302738), ('n02509815', 'lesser_panda', 0.0038609721)]]
[[('n02510455', 'giant_panda', 0.7493819), ('n02132136', 'brown_bear', 0.008669576), ('n02134084', 'ice_bear', 0.008599169), ('n02509815', 'lesser_panda', 0.0042907814), ('n02120079', 'Arctic_fox', 0.0039218697)]]
[[('n02510455', 'giant_panda', 0.73992616), ('n02134084', 'ice_bear', 0.008566578), ('n02132136', 'brown_bear', 0.0071503706), ('n02120079', 'Arctic_fox', 0.005537635), ('n02133161', 'American_black_bear', 0.0039643333)]]
[[('n02510455', 'giant_panda', 0.48032713), ('n02114548', 'white_wolf', 0.024954954), ('n02120079', 'Arctic_fox', 0.016971268), ('n02395406', 'hog', 0.015805786), ('n02132136', 'brown_bear', 0.00848116)]]

EfficientNet-B7, MIOPEN_DEBUG_GCN_ASM_KERNELS=1:

[[('n02093256', 'Staffordshire_bullterrier', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02319095', 'sea_urchin', 0.0), ('n02395406', 'hog', 0.0), ('n02391049', 'zebra', 0.0)]]
[[('n15075141', 'toilet_tissue', nan), ('n02319095', 'sea_urchin', nan), ('n02395406', 'hog', nan), ('n02391049', 'zebra', nan), ('n02389026', 'sorrel', nan)]]
[[('n03482405', 'hamper', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02319095', 'sea_urchin', 0.0), ('n02391049', 'zebra', 0.0), ('n02389026', 'sorrel', 0.0)]]
[[('n13044778', 'earthstar', 1.0), ('n02317335', 'starfish', 6.7773486e-22), ('n04033901', 'quill', 3.4295856e-33), ('n02391049', 'zebra', 0.0), ('n02389026', 'sorrel', 0.0)]]
[[('n03379051', 'football_helmet', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02281787', 'lycaenid', 0.0), ('n02389026', 'sorrel', 0.0), ('n02364673', 'guinea_pig', 0.0)]]
[[('n07892512', 'red_wine', 1.0), ('n02317335', 'starfish', 0.0), ('n02391049', 'zebra', 0.0), ('n02389026', 'sorrel', 0.0), ('n02364673', 'guinea_pig', 0.0)]]
[[('n04447861', 'toilet_seat', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02317335', 'starfish', 0.0), ('n02391049', 'zebra', 0.0), ('n02389026', 'sorrel', 0.0)]]
[[('n03314780', 'face_powder', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02281787', 'lycaenid', 0.0), ('n02389026', 'sorrel', 0.0), ('n02364673', 'guinea_pig', 0.0)]]
[[('n02804610', 'bassoon', 1.0), ('n02841315', 'binoculars', 2.5764785e-13), ('n04099969', 'rocking_chair', 5.0266332e-29), ('n02328150', 'Angora', 0.0), ('n02317335', 'starfish', 0.0)]]
[[('n03887697', 'paper_towel', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02281787', 'lycaenid', 0.0), ('n02389026', 'sorrel', 0.0), ('n02364673', 'guinea_pig', 0.0)]]

EfficientNet-B7, MIOPEN_DEBUG_GCN_ASM_KERNELS=0:

[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.003146674), ('n02133161', 'American_black_bear', 0.002262074), ('n02134084', 'ice_bear', 0.0014058463), ('n02132136', 'brown_bear', 0.0013730429)]]
[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.003146674), ('n02133161', 'American_black_bear', 0.002262073), ('n02134084', 'ice_bear', 0.0014058443), ('n02132136', 'brown_bear', 0.0013730436)]]
[[('n02510455', 'giant_panda', 0.8399879), ('n02509815', 'lesser_panda', 0.0031466729), ('n02133161', 'American_black_bear', 0.0022620677), ('n02134084', 'ice_bear', 0.0014058452), ('n02132136', 'brown_bear', 0.0013730424)]]
[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.0031466756), ('n02133161', 'American_black_bear', 0.0022620752), ('n02134084', 'ice_bear', 0.001405845), ('n02132136', 'brown_bear', 0.0013730436)]]
[[('n02510455', 'giant_panda', 0.8399879), ('n02509815', 'lesser_panda', 0.0031466743), ('n02133161', 'American_black_bear', 0.00226207), ('n02134084', 'ice_bear', 0.0014058452), ('n02132136', 'brown_bear', 0.0013730417)]]
[[('n02510455', 'giant_panda', 0.839891), ('n02509815', 'lesser_panda', 0.003151415), ('n02133161', 'American_black_bear', 0.0022747808), ('n02134084', 'ice_bear', 0.0014124429), ('n02132136', 'brown_bear', 0.0013766055)]]
[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.0031466784), ('n02133161', 'American_black_bear', 0.002262073), ('n02134084', 'ice_bear', 0.0014058456), ('n02132136', 'brown_bear', 0.0013730442)]]
[[('n02510455', 'giant_panda', 0.83998775), ('n02509815', 'lesser_panda', 0.0031466782), ('n02133161', 'American_black_bear', 0.002262075), ('n02134084', 'ice_bear', 0.0014058475), ('n02132136', 'brown_bear', 0.0013730454)]]
[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.0031466756), ('n02133161', 'American_black_bear', 0.002262073), ('n02134084', 'ice_bear', 0.001405845), ('n02132136', 'brown_bear', 0.0013730442)]]
[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.0031466756), ('n02133161', 'American_black_bear', 0.002262072), ('n02134084', 'ice_bear', 0.0014058456), ('n02132136', 'brown_bear', 0.0013730429)]]

@ekuznetsov139
Copy link

FYI, not that it helps you, but it works correctly on gfx900 (Vega 10) with rocm2.6-tf1.14-python3.
[[('n02510455', 'giant_panda', 0.83479327), ('n02134084', 'ice_bear', 0.015601887), ('n02509815', 'lesser_panda', 0.0045534954), ('n02133161', 'American_black_bear', 0.0024719073), ('n02132136', 'brown_bear', 0.002070747)]]

@Bengt
Copy link

Bengt commented Aug 29, 2019

Hi @ekuznetsov139.

thanks for the data point. While you are at it, could you rerun the test with rocm/tensorflow:rocm2.7-tf1.14-dev? That seems to be the current focus of development.

Regards,
Bengt

@ekuznetsov139
Copy link

It works correctly with that tag as well.

Though in both cases there is something odd: processing takes a very long time (around 1 minute) and GPU usage is near zero all that time. (It definitely uses the GPU, I've confirmed with HIP_TRACE_API.) Not sure if it's an anomaly or it's just that EfficientNet is not being very efficient.

@Bengt
Copy link

Bengt commented Sep 2, 2019

Hi, to add another data point, I can confirm this working using gfx900 (Vega 64, Vega 10). So the issue seems to affect gfx803, only. Having an eye on the card's GPUTach, I also noticed long idle times during the test run.

@huanzhang12
Copy link

I found that the issue is caused by the ASM 1x1 kernel on gfx803: https://github.com/ROCmSoftwarePlatform/MIOpen/blob/master/src/kernels/conv1x1u.s
On gfx803, I can obtain the same result as on gfx906 by disabling this ASM 1x1 kernel:

MIOPEN_DEBUG_CONV_DIRECT_ASM_1X1U=0 python3 -m pytest -s test_efficientnet_gfx803.py

Recently, to avoid issues like this one all ASM convolution kernels have been disabled on gfx803 (See ROCm/MIOpen@ce51a4c) But this also significantly reduces gfx803 performance (for ResNet-50 it is almost twice slower, see #173 (comment)). I have a workload that becomes 10x slower on gfx803 after disabling asm kernels. I hope AMD can fix the bugs in ASM kernels and re-enable them on gfx803.

@ROCmSupport
Copy link

Thanks for reaching out.
gfx8 is not a supported config now.
We are not supporting gfx8 devices officially with ROCm and request you to follow our supported hardware section @ ROCm docs: https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gfx803 issue specific to gfx803 GPUs
Projects
None yet
Development

No branches or pull requests

6 participants