Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ppc64le support #2921

Closed
smuzaffar opened this issue Jan 28, 2020 · 25 comments
Closed

ppc64le support #2921

smuzaffar opened this issue Jan 28, 2020 · 25 comments
Labels
contributions welcome external contributions welcome

Comments

@smuzaffar
Copy link
Contributor

Is your feature request related to a problem? Please describe.
We build our software for x86, aarch64 and ppc64le and our developers would like to use onnxruntime but as it does not build for ppc64le archs, so we can not integrate it.

System information

  • Cent OS7 on ppc64le arch with onnxversion 1.0.0

Describe the solution you'd like
We would like to build and use onnxruntime on PPC64 archs.

Describe alternatives you've considered
Nothing yet

@snnn
Copy link
Member

snnn commented Jan 28, 2020

It's hard for us to make progress on it because our team don't have any ppc64le hardware that can be used for dev and testing.

@jywu-msft jywu-msft added the contributions welcome external contributions welcome label Jan 28, 2020
@snnn
Copy link
Member

snnn commented Feb 20, 2020

Seems new manylinux2014 docker images can help us solve this.

@smuzaffar
Copy link
Contributor Author

@snnn , we were able to build onnxruntime for ppc64le using the changes here cms-externals#4 but some of our tests failed to produce identical results. One of onnxruntime test also failed to run. @mrodozov do you remember which test was failing?

Have you tried using proot and qemu to get emulate powerpc ? We do use it to install ppc64le rpm packages on our x86_64 server.

@snnn
Copy link
Member

snnn commented Feb 20, 2020

@tracysh Could you please take a look at cms-externals#4 ?

@mrodozov
Copy link

mrodozov commented Feb 20, 2020

turn this:
-Donnxruntime_BUILD_UNIT_TESTS=ON
and then the test is:
onnxruntime_mlas_test
a validation test to my understanding.
We were trying to bring this implementation:
https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/core/mlas/lib/arm/sgemmc.cpp#L22
on powerpc. the failing test is SGEMM,
with output like:

mismatch TransA=111, TransB=111, M=1, N=1, K=1, alpha=1.000000, beta=0.000000  0.000000 529.000000!
mismatch TransA=111, TransB=112, M=1, N=1, K=1, alpha=1.000000, beta=0.000000  0.000000 529.000000!
mismatch TransA=112, TransB=111, M=1, N=1, K=1, alpha=1.000000, beta=0.000000  0.000000 529.000000!
mismatch TransA=112, TransB=112, M=1, N=1, K=1, alpha=1.000000, beta=0.000000  0.000000 529.000000!
mismatch TransA=111, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 991.000000!
mismatch TransA=111, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 946.000000!
mismatch TransA=111, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 903.000000!
mismatch TransA=111, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 862.000000!
mismatch TransA=111, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 1013.000000!
mismatch TransA=111, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 923.000000!
mismatch TransA=111, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 923.000000!
mismatch TransA=111, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 841.000000!
mismatch TransA=112, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 970.000000!
mismatch TransA=112, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 926.000000!
mismatch TransA=112, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 926.000000!
mismatch TransA=112, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 884.000000!
mismatch TransA=112, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 991.000000!
mismatch TransA=112, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 903.000000!
mismatch TransA=112, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 946.000000!
mismatch TransA=112, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000  0.000000 862.000000!
mismatch TransA=111, TransB=111, M=3, N=3, K=3, alpha=1.000000, beta=0.000000  1194.000000 1326.000000!

the other tests:

Conv2D tests.
Pool2D tests.
Pool3D tests.
Activation tests.

are going fine (no mismatch prints at least)

@tracysh
Copy link
Contributor

tracysh commented Feb 20, 2020

That's strange, because the Conv2D tests build on the GEMM routine. MlasFgemmTest::ExecuteShort first loops over small GEMMs from 1-15 which stresses some of the partial vector stores. Do the tests after this, which are multiples of 16, work okay?

@mrodozov
Copy link

this is the full unittest output
unit_test.txt

@tracysh
Copy link
Contributor

tracysh commented Feb 27, 2020

Update: I was curious about the latest state of Power ISA (I worked on Xbox 360, a PowerPC 2.02 implementation), so I updated MLAS to directly use VSX intrinsics. I verified the GEMM using gcc 7.4 to cross compile then run from qemu. I'll get my changes into a branch you can try on your end in a few days.

@smuzaffar
Copy link
Contributor Author

We will be happy to test it as soon as it is available. many thanks for looking in to this.

@smuzaffar
Copy link
Contributor Author

@tracysh , any update which we can test?

@tracysh
Copy link
Contributor

tracysh commented Mar 13, 2020

I'm going to need a few more days to clean this up. Just curious, which POWER versions are you using this on?

@smuzaffar
Copy link
Contributor Author

We are using power8

> lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    8
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Model:                 2.0 (pvr 004d 0200)
Model name:            POWER8 (raw), altivec supported
CPU max MHz:           3857.0000
CPU min MHz:           2061.0000
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     64-127
NUMA node8 CPU(s):     0-63

@slava77
Copy link

slava77 commented Apr 7, 2020

I'm going to need a few more days to clean this up.

I'm curious if there was an update for this issue
Please let me know.
Thank you.

@tracysh
Copy link
Contributor

tracysh commented Apr 9, 2020

Apologies for the delay on this. I've put the changes into the branch tracysh/mlas_powerpc. With this, I was able to build with gcc 7.5 and run under qemu. I ran onnxruntime_mlas_test and was able to run the subset of the GEMM tests. There are more GEMM tests that I usually run for validation of big changes, but qemu was too slow to tackle that.

I was also able to run through onnxruntime_test_all (run as part of the build), but there was a MathSinFloat test that uses Eigen that was failing. I'm curious what happens on real hardware to know if this is worth investigating further.

I was also able to point onnx_test_runner at resnet50 and bertsquad from the onnx model zoo and both passed successfully.

I have no idea how performant the SGEMM might be. It may be possible to scale up the GEMM further, but I'll need some help from you to measure on real hardware. Also, I want to make a few changes to onnxruntime_mlas_test to test a few more things out.

Let me know how it goes.

@smuzaffar
Copy link
Contributor Author

Thanks @tracysh , we are testing your changes now and will let you know soon.

@mrodozov
Copy link

mrodozov commented Apr 9, 2020

Hello again,
the code builds now on our powerpc machine,
which is different from the prev one:

lscpu 
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    8
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Model:                 1.0 (pvr 004c 0100)
Model name:            POWER8NVL (raw), altivec supported
CPU max MHz:           4023.0000
CPU min MHz:           2061.0000
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-63
NUMA node1 CPU(s):     64-127

when I run:

onnxruntime_mlas_test
SGEMM tests.
Conv2D tests.
Pool2D tests.
Pool3D tests.
Done.
SGEMM tests.
Conv2D tests.
Pool2D tests.
Pool3D tests.
Done.
Activation tests.
mismatch activation kind=3 i=2 value=bf800000 expected=7ff00002
mismatch activation kind=3 i=3 value=bf800000 expected=fff00002
mismatch activation kind=4 i=2 value=b3800000 expected=7ff00002
mismatch activation kind=4 i=3 value=b3800000 expected=fff00002

./onnxruntime_shared_lib_test 
[==========] Running 21 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 16 tests from CApiTest
[ RUN      ] CApiTest.dim_param
[       OK ] CApiTest.dim_param (19 ms)
[ RUN      ] CApiTest.custom_op_handler
Running custom op inference
Running simple inference with default provider
[       OK ] CApiTest.custom_op_handler (16 ms)
[ RUN      ] CApiTest.create_tensor
[       OK ] CApiTest.create_tensor (0 ms)
[ RUN      ] CApiTest.create_tensor_with_data
[       OK ] CApiTest.create_tensor_with_data (0 ms)
[ RUN      ] CApiTest.override_initializer
[       OK ] CApiTest.override_initializer (16 ms)
[ RUN      ] CApiTest.end_profiling
[       OK ] CApiTest.end_profiling (31 ms)
[ RUN      ] CApiTest.model_metadata
[       OK ] CApiTest.model_metadata (15 ms)
[ RUN      ] CApiTest.session_options_graph_optimization_level
[       OK ] CApiTest.session_options_graph_optimization_level (0 ms)
[ RUN      ] CApiTest.run_options
[       OK ] CApiTest.run_options (0 ms)
[ RUN      ] CApiTest.allocation_info
[       OK ] CApiTest.allocation_info (0 ms)
[ RUN      ] CApiTest.DefaultAllocator
[       OK ] CApiTest.DefaultAllocator (0 ms)
[ RUN      ] CApiTest.CreateGetVectorOfMapsInt64Float
[       OK ] CApiTest.CreateGetVectorOfMapsInt64Float (0 ms)
[ RUN      ] CApiTest.CreateGetVectorOfMapsStringFloat
[       OK ] CApiTest.CreateGetVectorOfMapsStringFloat (0 ms)
[ RUN      ] CApiTest.CreateGetSeqTensors
[       OK ] CApiTest.CreateGetSeqTensors (0 ms)
[ RUN      ] CApiTest.CreateGetSeqStringTensors
[       OK ] CApiTest.CreateGetSeqStringTensors (0 ms)
[ RUN      ] CApiTest.model_from_array
[       OK ] CApiTest.model_from_array (16 ms)
[----------] 16 tests from CApiTest (113 ms total)

[----------] 5 tests from CApiTestWithProviders/CApiTestWithProvider
[ RUN      ] CApiTestWithProviders/CApiTestWithProvider.simple/0
Running simple inference with default provider
[       OK ] CApiTestWithProviders/CApiTestWithProvider.simple/0 (15 ms)
[ RUN      ] CApiTestWithProviders/CApiTestWithProvider.simple/1
[       OK ] CApiTestWithProviders/CApiTestWithProvider.simple/1 (0 ms)
[ RUN      ] CApiTestWithProviders/CApiTestWithProvider.simple/2
[       OK ] CApiTestWithProviders/CApiTestWithProvider.simple/2 (0 ms)
[ RUN      ] CApiTestWithProviders/CApiTestWithProvider.simple/3
[       OK ] CApiTestWithProviders/CApiTestWithProvider.simple/3 (0 ms)
[ RUN      ] CApiTestWithProviders/CApiTestWithProvider.simple/4
Running simple inference with default provider
[       OK ] CApiTestWithProviders/CApiTestWithProvider.simple/4 (15 ms)
[----------] 5 tests from CApiTestWithProviders/CApiTestWithProvider (31 ms total)

[----------] Global test environment tear-down
[==========] 21 tests from 2 test suites ran. (144 ms total)
[  PASSED  ] 21 tests.

  YOU HAVE 1 DISABLED TEST
./onnxruntime_global_thread_pools_test 
[==========] Running 15 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 15 tests from CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/0
Running simple inference with default provider
2020-04-09 12:10:09.920787225 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:09.923644854 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:09.923682633 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:09.926045591 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.927908098 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.928285204 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.930865047 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:09.930901035 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:09.930963711 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:09.931095141 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:09.931800325 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:09.932915858 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:09.933515767 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:09.933594683 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/0 (46 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/1
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/1 (0 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/2
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/2 (1 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/3
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/3 (0 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/4
Running simple inference with default provider
2020-04-09 12:10:09.967065780 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:09.969736207 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:09.969773430 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:09.970006107 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.971855079 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.972233963 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.974794212 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:09.974830396 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:09.974893681 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:09.974995939 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:09.975696547 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:09.977450182 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:09.977908551 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:09.977978610 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/4 (39 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/0
Running simple inference with default provider
2020-04-09 12:10:10.006238624 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.008900539 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.008938569 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.009153048 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.011013542 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.011391594 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.013950101 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.013986316 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.014049687 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.014151584 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.014849628 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.016065118 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.016520071 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
Running simple inference with default provider
2020-04-09 12:10:10.019779584 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.022475948 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.022514226 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.022723799 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.024586024 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.024964934 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.027514511 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.027550402 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.027612250 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.027713176 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.028405672 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.029567600 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.030039855 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.030111410 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
2020-04-09 12:10:10.051622176 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/0 (76 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/1
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/1 (1 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/2
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/2 (0 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/3
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/3 (0 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/4
Running simple inference with default provider
2020-04-09 12:10:10.083698038 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.086331852 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.086368899 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.086582950 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.088431406 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.088810771 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.091389283 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.091425277 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.091488425 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.091611267 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.092309856 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.093717814 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.094172376 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
Running simple inference with default provider
2020-04-09 12:10:10.097515136 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.100225496 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.100262135 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.100469505 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.102329418 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.102707731 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.105260567 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.105296315 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.105358030 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.105459450 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.106151105 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.107641782 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.108095530 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.108166743 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
2020-04-09 12:10:10.134985230 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/4 (78 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/0
Running simple inference with default provider
2020-04-09 12:10:10.161779645 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.164416227 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.164454129 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.164667892 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.166516273 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.166895228 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.169455112 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.169490844 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.169555170 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.169671501 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.170383589 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.171827361 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.172278874 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.172346424 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
Running simple inference with default provider
2020-04-09 12:10:10.195078460 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.197688391 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.197725612 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.197937578 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.199797713 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.200216561 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.202773912 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.202809814 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.202872762 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.202974787 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.203672710 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.204626779 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.205075086 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.205141170 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/0 (67 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/1
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/1 (1 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/2
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/2 (0 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/3
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/3 (0 ms)
[ RUN      ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/4
Running simple inference with default provider
2020-04-09 12:10:10.229769044 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.232439349 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.232477255 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.232690281 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.234536881 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.234915160 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.237472849 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.237509013 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.237572490 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.237674529 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.238372696 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.239761337 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.240230078 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.240297610 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
Running simple inference with default provider
2020-04-09 12:10:10.267648379 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.270301774 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.270338515 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.270549998 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.272409207 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.272785896 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.275340095 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.275375767 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.275439329 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.275562638 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.276257615 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.277270498 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.277721931 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.277787556 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[       OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/4 (78 ms)
[----------] 15 tests from CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider (387 ms total)

[----------] Global test environment tear-down
[==========] 15 tests from 1 test suite ran. (387 ms total)
[  PASSED  ] 15 tests.

@mrodozov
Copy link

mrodozov commented Apr 9, 2020

And the result from
onnxruntime_test_all:
test_results_ppc_onnxruntime_test_all.txt

@tracysh
Copy link
Contributor

tracysh commented Apr 9, 2020

I pushed some new changes to cleanup the GEMM kernel templating.

How is performance of the runtime? I'm curious what you see for resnet50 or other test models from the ONNX model zoo. If you download some models + test data from the zoo (https://github.com/onnx/models), you can use onnx_test_runner to verify that the models run. And you can use "onnxruntime_perf_test -e cpu -t 30 path/to/model_and_data" to get a reference time.

Once you have some timing data, can you try updating MlasSgemmKernel in SgemmKernelPower.cpp to see if doing 6 rows improves or degrades performance? GCC seemed to build this and keep everything in registers, but this isn't always faster.

if (CountM >= 6) {
    RowsHandled = MlasSgemmProcessCount<6>(A, B, C, CountK, CountN, lda, ldc, AlphaBroadcast, ZeroMode);

As far as the onnxruntime_mlas_test errors, I see the same problem in the ARM64 build. The expected data is based on what is observed with x86/x64.

@tracysh
Copy link
Contributor

tracysh commented Apr 16, 2020

@smuzaffar Are there any additional comments on these changes (see my last comment for some questions)? Are you able to run your models successfully with these changes?

@smuzaffar
Copy link
Contributor Author

@tracysh , we are working on it cms-sw/cmsdist#5743 . We needed few fixes on top of v1.2.0 to build it ( https://github.com/cms-externals/onnxruntime/commits/cms/v1.2.0_plus_ppc_update_pb31130 ) . @mrodozov is working on it.

@tracysh
Copy link
Contributor

tracysh commented Apr 26, 2020

I merged all of the pending Power changes into master.

@tracysh tracysh closed this as completed Apr 26, 2020
@smuzaffar
Copy link
Contributor Author

Thanks @tracysh , we have integrated this in our software and things looks in much better state.

@tracysh
Copy link
Contributor

tracysh commented May 5, 2020

Hi, @smuzaffar, just checking in: how does the performance of ONNX Runtime compare to the other runtimes you were using on Power? Do these systems have GPUs too that might benefit from using the CUDA support?

@smuzaffar
Copy link
Contributor Author

@tracysh , as x86_64 is our production architecture so when we migrated to onnxruntime then we did a performance test for x86_64. You can find the preformance results here cms-sw/cmssw#28112 . In short we noticed 7x gain in modules where we have used onnxruntime.

Unfortunately we do not have same exact comparison for Power (i.e. exact cmssw with and without onnxruntime). But the comparison between cmssw from Dec 2019 (which was without onnxruntime) and latest nightlies we see much better gain (this could be due to both onnxruntime plus improve,ent in our code)

CMSSW 2019-12-04 + without ONNXRuntime

TimeReport   0.101799     0.101799     0.101799  pfDeepFlavourJetTagsWithDeepInfo
TimeReport   0.001237     0.001237     0.001237  pfDeepFlavourTagInfosWithDeepInfo
TimeReport   0.009642     0.009642     0.009642  pfMassDecorrelatedDeepBoostedJetTagsAK8WithDeepInfo

CMSSW 2020-05-07 + ONNXRuntime

TimeReport   0.009132     0.009132     0.009132  pfDeepFlavourJetTagsWithDeepInfo
TimeReport   0.001243     0.001243     0.001243  pfDeepFlavourTagInfosWithDeepInfo
TimeReport   0.000803     0.000803     0.000803  pfMassDecorrelatedDeepBoostedJetTagsAK8WithDeepInfo

Although our Power machines have GPU but currently we are not building with cuda support. Hopefully in near future we will enable it and report back the results.

@mrodozov
Copy link

mrodozov commented May 7, 2020

Hi, @tracysh, there is a comparison between ONNX Runtime and another runtime to measure performance on x86_64, results are available here:
cms-sw/cmssw#28711
and in the comments the researchers conclude "it depends on the use case" IIRC
We haven't run performance comparison yet on Power but we might, at least to know for ourselves, although we are using x86_64 as prod arch and our Arm and Power builds are lets call it "a research interest".
Having ONNX Runtime for Power was needed to cover the external package requirements for the PPC build. We have GPU devices that can benefit from the CUDA support, yes, about that effort you can read here: https://patatrack.web.cern.ch/patatrack/ and because the direction is any heavy computation to be executed on GPUs we are interested in the CUDA support, on any arch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributions welcome external contributions welcome
Projects
None yet
Development

No branches or pull requests

6 participants