From 701728956e84dd2a5c02920abbb007061bb75efe Mon Sep 17 00:00:00 2001
From: ivberg <ivberg@microsoft.com>
Date: Mon, 25 Sep 2023 14:36:50 -0700
Subject: [PATCH] Update ORT Performance threading docs (#16995)

---
 .../performance/tune-performance/threading.md | 67 ++++++++++++++-----
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/docs/performance/tune-performance/threading.md b/docs/performance/tune-performance/threading.md
index 522980a55a158..c3ba0691c73ba 100644
--- a/docs/performance/tune-performance/threading.md
+++ b/docs/performance/tune-performance/threading.md
@@ -14,58 +14,89 @@ nav_order: 3
 {:toc}
 
 
-For the default CPU execution provider, you can use the following knobs in the Python API to control the thread number:
+For the default CPU execution provider, setting defaults are provided to get fast inference performance. You can customize the performance using the following knobs in the API to control the thread count and other settings:
 
+Python (Defaults):
 ```python
 import onnxruntime as rt
 
 sess_options = rt.SessionOptions()
 
-sess_options.intra_op_num_threads = 2
+sess_options.intra_op_num_threads = 0
 sess_options.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL
 sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
+sess_options.AddConfigEntry(kOrtSessionOptionsConfigAllowIntraOpSpinning, "1")
 ```
 
 
-* Thread Count
+* INTRA Thread Count
+
+  * Controls the _total_ number of INTRA threads to use to run the model. 
+  * INTRA = parallelize computation _inside_ each operator
+  * Default: (not specified or 0). `sess_options.intra_op_num_threads = 0`
+    * INTRA Threads Total = Number of physical CPU Cores. Leaving at default also enables some affinitization (explained below)
+    * E.g. 6-core machine (with 12 HT logical processors) = 6 total INTRA threads
 
-  * `sess_options.intra_op_num_threads = 2` controls the number of threads to use to run the model.
 
 * Sequential vs Parallel Execution
 
-  * `sess_options.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL` controls whether the operators in the graph run sequentially or in parallel. Usually when a model has many branches, setting this option to `ORT_PARALLEL` will provide better performance.
+  * Controls whether _multiple_ operators in the graph (_across_ nodes) run sequentially or in parallel. 
+  * Default: `sess_options.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL`
+  * Usually when a model has many branches, setting this option to `ORT_PARALLEL` will provide better performance. This could also hurt performance on some models without many branches.
   * When `sess_options.execution_mode = rt.ExecutionMode.ORT_PARALLEL`, you can set `sess_options.inter_op_num_threads` to control the
-number of threads used to parallelize the execution of the graph (across nodes).
+number of threads used to parallelize the execution of the graph (_across_ nodes).
 
 
 * Graph Optimization Level
 
-  * `sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL` enables all optimizations which is the default. Please see [onnxruntime_c_api.h](https://github.com/microsoft/onnxruntime/tree/main/include/onnxruntime/core/session/onnxruntime_c_api.h#L286) (enum `GraphOptimizationLevel`) for the full list of all optimization levels. For details regarding available optimizations and usage, please refer to the [Graph Optimizations](../model-optimizations/graph-optimizations.md) documentation.
+  * Default: `sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL` enables all optimizations. 
+  * Please see [onnxruntime_c_api.h](https://github.com/microsoft/onnxruntime/tree/main/include/onnxruntime/core/session/onnxruntime_c_api.h#L286) (enum `GraphOptimizationLevel`) for the full list of all optimization levels. For details regarding available optimizations and usage, please refer to the [Graph Optimizations](../model-optimizations/graph-optimizations.md) documentation.
 
+* Thread-Pool Spinning Behavior
 
+  * Controls whether additional INTRA or INTER threads spin waiting for work. Provides faster inference but consumes more CPU cycles, resources, and power
+  * Default: 1 (Enabled)
 
 
 ## Set number of intra-op threads
 
-Onnxruntime sessions utilize multi-threading to parallelize computation inside each operator.
-Customer could configure the number of threads like:
+Onnxruntime sessions utilize multi-threading to parallelize computation _inside_ each operator.
+
+By default with intra_op_num_threads=0 or not set, each session will start with the main thread on the 1st core (not affinitized). Then extra threads per additional physical core are created, and affinitized to that core (1 or 2 logical processors).
 
+Customer could manually configure the total number of threads like:
+
+[Python](https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.SessionOptions.intra_op_num_threads) (below) - [C/C++](https://onnxruntime.ai/docs/api/c/struct_ort_api.html) - [.NET/C#](https://onnxruntime.ai/docs/api/csharp/api/Microsoft.ML.OnnxRuntime.SessionOptions.html#Microsoft_ML_OnnxRuntime_SessionOptions_IntraOpNumThreads)
 ```python
 sess_opt = SessionOptions()
 sess_opt.intra_op_num_threads = 3
 sess = ort.InferenceSession('model.onnx', sess_opt)
 ```
 
-With above configuration, two threads will be created in the pool, so along with the main calling thread, there will be three threads in total to participate in intra-op computation.
-By default, each session will create one thread per phyical core (except the 1st core) and attach the thread to that core.
-However, if customer explicitly set the number of threads like showcased above, there will be no affinity set to any of the created thread.
+With the above configuration of 3 total threads, two extra threads will be created in the addtional INTRA pool, so along with the main calling thread, there will be three threads in total to participate in intra-op computation. However, if customer explicitly set the number of threads like showcased above, there will be no affinity set to any of the created thread.
 
 In addition, Onnxruntime also allow customers to create a global intra-op thread pool to prevent overheated contentions among session thread pools, please find its usage [here](https://github.com/microsoft/onnxruntime/blob/68b5b2d7d33b6aa2d2b5cf8d89befb4a76e8e7d8/onnxruntime/test/global_thread_pools/test_main.cc#L98).
 
+## Thread spinning behavior
+
+Controls whether additional INTRA or INTER threads spin waiting for work. Provides faster inference but consumes more CPU cycles, resources, and power.
+
+Example disabling spinning so WorkerLoop doesn't consume extra active cycles spinning waiting or attempting to steal work
+
+[Python](https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.SessionOptions.add_session_config_entry) (below) - [C++](https://onnxruntime.ai/docs/api/c/struct_ort_api.html) - [.NET/C#](https://onnxruntime.ai/docs/api/csharp/api/Microsoft.ML.OnnxRuntime.SessionOptions.html#Microsoft_ML_OnnxRuntime_SessionOptions_AddSessionConfigEntry_System_String_System_String_) - [Keys](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h#L90)
+```python
+sess_opt = SessionOptions()
+sess_opt.AddConfigEntry(kOrtSessionOptionsConfigAllowIntraOpSpinning, "0")
+sess_opt.AddConfigEntry(kOrtSessionOptionsConfigAllowInterOpSpinning, "0")
+```
+
 ## Set number of inter-op threads
 
-A inter-op thread pool is for parallelism between operators, and will only be created when session execution mode set to parallel:
+A inter-op thread pool is for parallelism _between_ operators, and will only be created when session execution mode set to parallel:
+
+By default, inter-op thread pool will also have one thread per physical core.
 
+[Python](https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.SessionOptions.inter_op_num_threads) (below) - [C/C++](https://onnxruntime.ai/docs/api/c/struct_ort_api.html) - [.NET/C#](https://onnxruntime.ai/docs/api/csharp/api/Microsoft.ML.OnnxRuntime.SessionOptions.html#Microsoft_ML_OnnxRuntime_SessionOptions_InterOpNumThreads)
 ```python
 sess_opt = SessionOptions()
 sess_opt.execution_mode  = ExecutionMode.ORT_PARALLEL
@@ -73,16 +104,15 @@ sess_opt.inter_op_num_threads = 3
 sess = ort.InferenceSession('model.onnx', sess_opt)
 ```
 
-By default, inter-op thread pool will also have one thread per physical core.
-
 ## Set intra-op thread affinity
 
-For certain scenarios, it may be beneficial to customize intra-op thread affinities, for example:
+It is normally best to not set thread affinity and let the OS handle thread assignment for perf and power reasons. However, for certain scenarios, it may be beneficial to customize intra-op thread affinities, for example:
 * There are multiple sessions run in parallel, customer might prefer their intra-op thread pools run on separate cores to avoid contention.
 * Customer want to limit a intra-op thread pool to run on only one of the NUMA nodes to reduce overhead of expensive cache miss among nodes.
 
 For session intra-op thread pool, please read the [configuration](https://github.com/microsoft/onnxruntime/blob/68b5b2d7d33b6aa2d2b5cf8d89befb4a76e8e7d8/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h#L180) and consume it like:
 
+[Python](https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.SessionOptions.add_session_config_entry) (below) - [C++](https://onnxruntime.ai/docs/api/c/struct_ort_api.html) - [.NET/C#](https://onnxruntime.ai/docs/api/csharp/api/Microsoft.ML.OnnxRuntime.SessionOptions.html#Microsoft_ML_OnnxRuntime_SessionOptions_AddSessionConfigEntry_System_String_System_String_) - [Keys](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h#L176)
 ```python
 sess_opt = SessionOptions()
 sess_opt.intra_op_num_threads = 3
@@ -95,12 +125,13 @@ For global thread pool, please read the [API](https://github.com/microsoft/onnxr
 ## Numa support and performance tuning
 
 Since release 1.14, Onnxruntime thread pool could utilize all physical cores that are available over NUMA nodes.
-The intra-op thread pool will create a thread on every physical core (except the 1st core). E.g. assume there is a system of 2 NUMA nodes, each has 24 cores.
+The intra-op thread pool will create an extra thread on every physical core (except the 1st core). E.g. assume there is a system of 2 NUMA nodes, each has 24 cores.
 Hence intra-op thread pool will create 47 threads, and set thread affinity to each core.
 
 For NUMA systems, it is recommended to test a few thread settings to explore for best performance, in that threads allocated among NUMA nodes might has higher cache-miss overhead when cooperating with each other. For example, when number of intra-op threads has to be 8, there are different ways to set affinity:
 
-```
+[Python]() (below) - [C++]() - [.NET/C#]()
+```python
 sess_opt = SessionOptions()
 sess_opt.intra_op_num_threads = 8
 sess_opt.add_session_config_entry('session.intra_op_thread_affinities', '3,4;5,6;7,8;9,10;11,12;13,14;15,16') # set affinities of all 7 threads to cores in the first NUMA node