-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OCL] Batchnorm verification issues with some configs #1974
Comments
I am afraid the reason is not address computation. The size of the input tensor buffer is 0.5 GiB. 32 bit signed address computations are able to handle 2GiB buffers. |
@zjing14 @atamazov yes, and I am afraid the numerical instability has been there for a while. |
Now investigating 3D specific validation issue related to the order of reference computations (hypothesis). |
Confirmed that the reason is the order of computations. The library computes BN as if it is 2D, while the driver computes reference on GPU as 3D (triple nested loop). This leads to substantially different order of computations. I am going to implement the fix that eliminates this difference. Alternatively, we can increase tolerance for 3D BN in the driver (depending on |
How variance error changes depending on D (same tensor size, 8MiB): # ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 128 -H 128 -W 128 -m 1 --forw 1 -b 0 -r 1 MIOpenDriver bnorm -n 1 -c 64 -D 128 -H 128 -W 128 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.00683977
Forward batch norm verification FAILED on output: 0.0077077
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 64 -H 256 -W 128 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.00516882
Forward batch norm verification FAILED on output: 0.00856796
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 32 -H 256 -W 256 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.00355585
Forward batch norm verification FAILED on output: 0.0155435
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 16 -H 512 -W 256 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.00262365
Forward batch norm verification FAILED on output: 0.0272853
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 8 -H 512 -W 512 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.0017123
Forward batch norm verification FAILED on output: 0.0481208
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 4 -H 1024 -W 512 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.0012078
Forward batch norm verification FAILED on output: 0.0793092
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 2 -H 1024 -W 1024 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification FAILED on running variance: 0.000715023
Forward batch norm verification FAILED on output: 0.120932
# ./bin/MIOpenDriver bnorm -n 1 -c 64 -D 1 -H 2048 -W 1024 -m 1 --forw 1 -b 0 -r 1
Forward train batch norm verification passed on running mean.
Forward train batch norm verification passed on running variance
Forward batch norm verification passed on output
Forward Batch Norm Verifies on CPU and GPU. |
… output (resolves issue ROCm#1974). Print errors when verification passed.
Fix (#1983) is ready for review/testing. |
* driver-bnorm-fixes(01) Fixed `--forw 2` for bnormfp16 * driver-bnorm-fixes(02) Changed the order of computations of reference output (resolves issue #1974). Print errors when verification passed.
* driver-bnorm-fixes(01) Fixed `--forw 2` for bnormfp16 * driver-bnorm-fixes(02) Changed the order of computations of reference output (resolves issue #1974). Print errors when verification passed.
Several batch norm has issues with verification in forward direction (?):
the above are problematic configs, a typical issue looks like:
The file which looks like where the issues should be coming from:
MIOpenBatchNormFwdTrainSpatial.cl.
The text was updated successfully, but these errors were encountered: