[tensorflow-pytorch-aarch64--r22.09] TensorFlow docker images are broken #147

snadampal · 2022-09-19T17:40:13Z

Issue Description
The tensorflow docker images built from r22.09 tag with onednn/acl as well as those available on docker hub (https://hub.docker.com/r/armswdev/tensorflow-arm-neoverse:r22.09-tf-2.10.0-onednn-acl or latest) are producing incorrect results for MLPerf resnet50 model.

the last working tag was: tensorflow-pytorch-aarch64--r22.08
TF2.10 official wheel works fine, so the issue is with one of the staging patches maintained on top of TF 2.10.
https://github.com/ARM-software/Tool-Solutions/tree/main/docker/tensorflow-aarch64/patches

How to reproduce
docker pull armswdev/tensorflow-arm-neoverse

follow this section to run MLPerf resnet50 inference with "--accuracy" option.
https://github.com/ARM-software/Tool-Solutions/blob/main/docker/tensorflow-aarch64/examples/README.md#mlcommons-tm-benchmarks

nSircombe · 2022-09-19T19:38:57Z

Thanks for the report @snadampal, we'll look into it.
Our builds include a few accuracy tests, on the Python examples, on the CPP examples, and on the MLCommons examples - however, the MLCommons accuracy test employs --count=1 expecting 100% accuracy.

Could you provide more details of the failure? Run lines, logs etc. and environment (what platform, is 'fast maths' enabled, etc.)

Many thanks.

snadampal · 2022-09-22T15:30:12Z

Hi @nSircombe , the issue was observed only for fast math enabled with bf16 kernels.

export DNNL_DEFAULT_FPMATH_MODE=BF16
run commands:
MLCommons/inference/vision/classification_detection: ./run_local.sh tf resnet50 cpu --accuracy

Failing scenario log: (r22.09 tag)
TestScenario.SingleStream qps=68.64, mean=0.0144, time=7.284, acc=2.600%, queries=500, tiles=50.0:0.0144,80.0:0.0145,90.0:0.0146,95.0:0.0147,99.0:0.0148,99.9:0.0151

Passing scenario logs: (r22.08 tag)
TestScenario.SingleStream qps=66.96, mean=0.0148, time=7.467, acc=76.200%, queries=500, tiles=50.0:0.0147,80.0:0.0149,90.0:0.0150,95.0:0.0151,99.0:0.0153,99.9:0.0159

snadampal changed the title ~~[tensorflow-pytorch-aarch64--r22.09] Data accuracy issues with TensorFlow neoverse docker images~~ [tensorflow-pytorch-aarch64--r22.09] Data accuracy issues with TensorFlow docker images Sep 19, 2022

snadampal changed the title ~~[tensorflow-pytorch-aarch64--r22.09] Data accuracy issues with TensorFlow docker images~~ [tensorflow-pytorch-aarch64--r22.09] TensorFlow docker images are broken Sep 19, 2022

nSircombe added the bug Something isn't working label Sep 19, 2022

milpuz01 mentioned this issue Sep 28, 2022

Fixes to calls to arm_gemm kernels to get correct accuracy when runni… #150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tensorflow-pytorch-aarch64--r22.09] TensorFlow docker images are broken #147

[tensorflow-pytorch-aarch64--r22.09] TensorFlow docker images are broken #147

snadampal commented Sep 19, 2022 •

edited

Loading

nSircombe commented Sep 19, 2022

snadampal commented Sep 22, 2022

[tensorflow-pytorch-aarch64--r22.09] TensorFlow docker images are broken #147

[tensorflow-pytorch-aarch64--r22.09] TensorFlow docker images are broken #147

Comments

snadampal commented Sep 19, 2022 • edited Loading

nSircombe commented Sep 19, 2022

snadampal commented Sep 22, 2022

snadampal commented Sep 19, 2022 •

edited

Loading