forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enable parallel output reordering in MlasReorderOutputNchw() (microso…
…ft#13643) ### Description This PR speeds-up the output reordering operation (as implemented in [MlasReorderOutputNchw](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/onnxruntime/core/mlas/lib/reorder.cpp#L400)) by replacing the sequential implementation with a parallelized one. The parallelization is achieved through the use of the existing [TryBatchParallelFor](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/include/onnxruntime/core/platform/threadpool.h#L284) construct. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The output reordering operation is frequently executed in image processing models. Its implementation can be easily parallelized and therefore sped up when executed on a multi-core machine. The amount of speedup achieved by this PR varies and depends on the actual input. The table below summarizes the results of some of the experiments I have conducted on a 16-core VM running on an AMD EPYC 7742 64-core processor. The experiment is based on the existing [unit test](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/mlas/unittest/test_reorder_output.cpp) for the output reordering operation. The first column represents the shape of the output as BatchCount:Channels:Height:Width, and the numbers in other columns represent the latency (in us, on average out of 100 runs) for the tested variants. Specifically, I compare the (sequential) baseline (in second column) with the (parallelized) variants, each using a number of worker threads equal to 1, 2, 4, 8 or 16 (as specified in [the constructor to the threadpool object](https://github.com/microsoft/onnxruntime/blob/9954454c65086c49b7c00f83b23ada76975f3546/onnxruntime/test/mlas/unittest/test_main.cpp#L12)). The numbers in () represent the speedup over the baseline. | Input | baseline | 1 Thread | 2 Threads | 4 Threads | 8 Threads | 16 Threads| | ------------- | ------------- |---------------|---------------|---------------|---------------|---------------| 1:1:112:112 | 20.8 | 21.5 (x0.97) | 21.9 (x0.95) | 22.2 (x0.94) | 22.5 (x0.92) | 23.0 (x0.90) | 1:128:160:84 | 540.4 | 712.5 (x0.76) | 404.0 (x1.34) | 327.8 (x1.65) | 377.9 (x1.43) | 371.8 (x1.45) | 13:240:4:314 | 1484.0 | 1851.1 (x0.80) | 1080.9 (x1.37) | 570.2 (x2.60) | 531.8 (x2.79) | 511.2 (x2.90) | 13:96:4:314 | 471.0 | 679.9 (x0.69) | 427.2 (x1.10) | 372.1 (x1.27) | 445.5 (x1.06) | 428.5 (x1.10) | 1:64:320:168 | 1215.1 | 1497.8 (x0.81) | 863.8 (x1.41) | 456.7 (x2.66) | 435.7 (x2.79) | 462.5 (x2.63) | 30:240:4:140 | 1711.5 | 2181.4 (x0.78) | 1182.6 (x1.45) | 657.4 (x2.60) | 592.5 (x2.89) | 578.0 (x2.96) | 30:336:4:140 | 2432.5 | 3039.2 (x0.80) | 1695.6 (x1.43) | 920.7 (x2.64) | 817.1 (x2.98) | 819.2 (x2.97) | The initial drop between the baseline and the variant using just one worker thread can be attributed to the overhead of invoking the reordering loop as a functor in TryBatchParallelFor. This overhead is compensated by the speedup of parallel processing when the number of worker threads is increased.
- Loading branch information
1 parent
8a7e605
commit 061b9fd
Showing
6 changed files
with
164 additions
and
60 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters