Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tensorflow-pytorch-aarch64--r22.09] TensorFlow docker images are broken #147

Open
snadampal opened this issue Sep 19, 2022 · 2 comments
Open
Labels
bug Something isn't working

Comments

@snadampal
Copy link
Contributor

snadampal commented Sep 19, 2022

Issue Description
The tensorflow docker images built from r22.09 tag with onednn/acl as well as those available on docker hub (https://hub.docker.com/r/armswdev/tensorflow-arm-neoverse:r22.09-tf-2.10.0-onednn-acl or latest) are producing incorrect results for MLPerf resnet50 model.

the last working tag was: tensorflow-pytorch-aarch64--r22.08
TF2.10 official wheel works fine, so the issue is with one of the staging patches maintained on top of TF 2.10.
https://github.com/ARM-software/Tool-Solutions/tree/main/docker/tensorflow-aarch64/patches

How to reproduce
docker pull armswdev/tensorflow-arm-neoverse

follow this section to run MLPerf resnet50 inference with "--accuracy" option.
https://github.com/ARM-software/Tool-Solutions/blob/main/docker/tensorflow-aarch64/examples/README.md#mlcommons-tm-benchmarks

@snadampal snadampal changed the title [tensorflow-pytorch-aarch64--r22.09] Data accuracy issues with TensorFlow neoverse docker images [tensorflow-pytorch-aarch64--r22.09] Data accuracy issues with TensorFlow docker images Sep 19, 2022
@snadampal snadampal changed the title [tensorflow-pytorch-aarch64--r22.09] Data accuracy issues with TensorFlow docker images [tensorflow-pytorch-aarch64--r22.09] TensorFlow docker images are broken Sep 19, 2022
@nSircombe
Copy link
Contributor

Thanks for the report @snadampal, we'll look into it.
Our builds include a few accuracy tests, on the Python examples, on the CPP examples, and on the MLCommons examples - however, the MLCommons accuracy test employs --count=1 expecting 100% accuracy.

Could you provide more details of the failure? Run lines, logs etc. and environment (what platform, is 'fast maths' enabled, etc.)

Many thanks.

@nSircombe nSircombe added the bug Something isn't working label Sep 19, 2022
@snadampal
Copy link
Contributor Author

Hi @nSircombe , the issue was observed only for fast math enabled with bf16 kernels.

export DNNL_DEFAULT_FPMATH_MODE=BF16
run commands:
MLCommons/inference/vision/classification_detection: ./run_local.sh tf resnet50 cpu --accuracy

Failing scenario log: (r22.09 tag)
TestScenario.SingleStream qps=68.64, mean=0.0144, time=7.284, acc=2.600%, queries=500, tiles=50.0:0.0144,80.0:0.0145,90.0:0.0146,95.0:0.0147,99.0:0.0148,99.9:0.0151

Passing scenario logs: (r22.08 tag)
TestScenario.SingleStream qps=66.96, mean=0.0148, time=7.467, acc=76.200%, queries=500, tiles=50.0:0.0147,80.0:0.0149,90.0:0.0150,95.0:0.0151,99.0:0.0153,99.9:0.0159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants