Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCL] Batchnorm verification issues with some configs #1974

Open
junliume opened this issue Feb 9, 2023 · 7 comments
Open

[OCL] Batchnorm verification issues with some configs #1974

junliume opened this issue Feb 9, 2023 · 7 comments

Comments

@junliume
Copy link
Contributor

junliume commented Feb 9, 2023

Several batch norm has issues with verification in forward direction (?):

MIOpenDriver bnorm -n 1 -c 64 -D 128 -H 128 -W 128 -m 1 --forw 1 -b 0 -r 1
MIOpenDriver bnorm -n 1 -c 128 -D 64 -H 64 -W 64 -m 1 --forw 1 -b 0 -r 1
MIOpenDriver bnorm -n 1 -c 256 -D 32 -H 32 -W 32 -m 1 --forw 1 -b 0 -r 1
MIOpenDriver bnorm -n 1 -c 512 -D 16 -H 16 -W 16 -m 1 --forw 1 -b 0 -r 1
MIOpenDriver bnorm -n 1 -c 640 -D 8 -H 8 -W 8 -m 1 --forw 1 -b 0 -r 1
MIOpenDriver bnorm -n 1 -c 640 -D 4 -H 4 -W 4 -m 1 --forw 1 -b 0 -r 1

the above are problematic configs, a typical issue looks like:

# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 128 -H 128 -W 128 -m 1 --forw 1 -b 0 -r 1
MIOpenDriver bnorm -n 1 -c 64 -D 128 -H 128 -W 128 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification **FAILED on running variance: 0.00583748**
Forward batch norm verification **FAILED on output: 0.00592572**

The file which looks like where the issues should be coming from:
MIOpenBatchNormFwdTrainSpatial.cl.

@atamazov
Copy link
Contributor

./bin/MIOpenDriver bnorm -n 1 -c 64 -D 128 -H 128 -W 128 -m 1 --forw 1 -b 0 -r 1

I am afraid the reason is not address computation. The size of the input tensor buffer is 0.5 GiB. 32 bit signed address computations are able to handle 2GiB buffers.

@junliume
Copy link
Contributor Author

./bin/MIOpenDriver bnorm -n 1 -c 64 -D 128 -H 128 -W 128 -m 1 --forw 1 -b 0 -r 1

I am afraid the reason is not address computation. The size of the input tensor buffer is 0.5 GiB. 32 bit signed address computations are able to handle 2GiB buffers.

@zjing14 @atamazov yes, and I am afraid the numerical instability has been there for a while.

@atamazov
Copy link
Contributor

@junliume

  1. BN is supported for 3D.
  2. 2D BN that uses exactly the same kernel (./bin/MIOpenDriver bnorm -n 1 -c 64 -W 128 -H 16384 -m 1 --forw 1 -b 0 -r 1) passes validation without any issues.

Now investigating 3D specific validation issue related to the order of reference computations (hypothesis).

@atamazov
Copy link
Contributor

atamazov commented Feb 15, 2023

Confirmed that the reason is the order of computations. The library computes BN as if it is 2D, while the driver computes reference on GPU as 3D (triple nested loop). This leads to substantially different order of computations. I am going to implement the fix that eliminates this difference.

Alternatively, we can increase tolerance for 3D BN in the driver (depending on depth).

@atamazov
Copy link
Contributor

atamazov commented Feb 15, 2023

How variance error changes depending on D (same tensor size, 8MiB):

# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 128 -H 128 -W 128 -m 1 --forw 1 -b 0 -r 1                                                         MIOpenDriver bnorm -n 1 -c 64 -D 128 -H 128 -W 128 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.00683977
Forward batch norm verification FAILED on output: 0.0077077
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 64 -H 256 -W 128 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.00516882
Forward batch norm verification FAILED on output: 0.00856796
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 32 -H 256 -W 256 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.00355585
Forward batch norm verification FAILED on output: 0.0155435
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 16 -H 512 -W 256 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.00262365
Forward batch norm verification FAILED on output: 0.0272853
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 8 -H 512 -W 512 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.0017123
Forward batch norm verification FAILED on output: 0.0481208
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 4 -H 1024 -W 512 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.0012078
Forward batch norm verification FAILED on output: 0.0793092
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 2 -H 1024 -W 1024 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.000715023
Forward batch norm verification FAILED on output: 0.120932
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 1 -H 2048 -W 1024 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification passed on running variance
Forward batch norm verification passed on output
Forward Batch Norm Verifies on CPU and GPU.

atamazov added a commit to atamazov/MIOpen that referenced this issue Feb 15, 2023
… output (resolves issue ROCm#1974). Print errors when verification passed.
@atamazov
Copy link
Contributor

Fix (#1983) is ready for review/testing.

junliume pushed a commit that referenced this issue Mar 8, 2023
* driver-bnorm-fixes(01) Fixed `--forw 2` for bnormfp16

* driver-bnorm-fixes(02) Changed the order of computations of reference output (resolves issue #1974). Print errors when verification passed.
alexandraBara pushed a commit that referenced this issue Mar 14, 2023
* driver-bnorm-fixes(01) Fixed `--forw 2` for bnormfp16

* driver-bnorm-fixes(02) Changed the order of computations of reference output (resolves issue #1974). Print errors when verification passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants