-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfcs: Auto-tuning API #1764
base: rfcs
Are you sure you want to change the base?
rfcs: Auto-tuning API #1764
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
# RFC for Auto-Tuning API | ||
|
||
## Introduction & Motivation | ||
|
||
Auto-tuning is a feature supported as a mode in PyTorch's [torch.compile](https://pytorch.org/docs/stable/generated/torch.compile.html) function | ||
and in TensorFlow via [XLA](https://github.com/sourcecode369/tensorflow-1/blob/9f446aba8aaeb2b3c4c6e5ba1ab4cf31494b8a64/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc#L279). | ||
While the median model level improvement is generally modest, there are cases where there are | ||
large speedups. This has been observed on Intel hardware (see [Using tuning to evaluate the level of oneDNN performance](https://github.com/intel-innersource/libraries.performance.math.onednn/pull/5931)) | ||
and externally on other [hardware](https://mangpo.net/papers/xla-autotuning-pact2021.pdf). | ||
|
||
The goal is to allow end users to optionally try auto-tuning in the case where out of the | ||
box performance is insufficient. | ||
|
||
## Proposal | ||
OneDNN should implement auto tuning as a feature that can be exposed to end users of frameworks. | ||
Major requirements for framework integration: | ||
1) No changes to primitive API. | ||
2) Primitive and kernel cache states should not be affected. | ||
3) No regressions after tuning. | ||
4) Simple knob to enable/disable tuning. | ||
|
||
### Option 1 - Tune During Primitive Execution- (Recommended) | ||
Tuning happens during one call of the execute function on the primitive. | ||
Subsequent execute calls on the same primitive will not result in re-tuning the primitive. | ||
Tuned results will be stored in a primitive implementation specific lookup table that will be referenced | ||
when the primitive is (re)created. (Some gpu implementations such as conv and batchnorm already use lookup tables.) | ||
Tuning will happen under a cold cache mode and will be limited to max_nconfigs. | ||
Primitive cache entry for the primitive(s) being tuned will be updated to point to the tuned implementation when tuning is complete. | ||
Kernel cache entries will be unmodified for now, but can be modified if we want to enable tuning of resuable kernels later. | ||
If the user wants to persist the tuned configs between sessions, the lookup tables can optionally be written to files. | ||
If an implementation does not support tuning, the tuning process will be skipped. | ||
In tune mode, there is no guarantee of correctness. | ||
|
||
***Option 1.1 - As a Primtive Attr*** | ||
```c | ||
primitive_attr_t attr; | ||
primtive_desc_t prim_desc; | ||
attr.tunable=true; | ||
create_primtive_desc(prim_desc, attr); | ||
create_primitive(prim, prim_desc); | ||
execute(prim); //tuning happens here | ||
attr.tunable=false; | ||
create_primtive_desc(prim_desc, attr); | ||
create_primitive(prim, prim_desc); //primitive is created with config selected from tuning | ||
execute(prim); //normal execution | ||
``` | ||
***Option 1.2 - As a Global Variable*** | ||
```c | ||
create_primitive(prim); | ||
dnnl_set_tune(true); | ||
execute(prim); //tuning happens here | ||
dnnl_set_tune(false); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using a global state to implement tuning seems like it introduces some usage limitations in regards to multi-threaded programs. In addition, I think there are some non-obvious conflicts with the primitive cache and when a tuned/non-tuned implementation are created at different times. Would it make sense to just add something like a enum dnnl_dispatch_mode_t {
dnnl_dispatch_default,
dnnl_dispatch_heuristic,
dnnl_dispatch_tune,
} to the primitive attributes instead? This could be extended to allow users further control around creation time and performance in the future and used to implement PR's such as #1743 (if it is accepted). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The main reason a global variable was used was to minimize coding changes for frameworks. I can check with them if they like this option better.
The tuning state is set once for all primitives and configurations during tuning are cached separately from the default one. Can you give an example of a conflict? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another potential "creation" scenario I've seen recently: a user wants to probe the primitive cache, that is skip creation (oneDNN returns an error) in case of a primitive cache miss. All this (together with tuning) could be combined under "create mode" sub-attribute. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The core issue is that there is a significant internal state (visible via realized performance) based on execution order. This can cause reproduction issues depending the execution order. The ordering issues that come to my mind are create(p); // Primitive cache miss, get default version
tune_create(p); // Primitive cache hit, get default version tune_create(p); // Get tuned version
create(p); // Primitive cache hit, get tuned version
... // create many primitives
create(p); // Primitive cache miss, get default version This becomes significantly more complicated in a multi-threaded scenario where the ordering can vary from run to run. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The config iteration number is now part of the To me it seems like primitive cache management has to be addressed in a similar way regardless of the api. |
||
create_primitive(prim); //primitive is created with config selected from tuning | ||
execute(prim); //normal execution | ||
``` | ||
***Option 1.3 - As a Primitive Attr Whose Default Value can be set by a Global Variable*** | ||
```c | ||
set_tune_mode(true); | ||
primitive_attr_t attr; | ||
primtive_desc_t prim_desc; | ||
create_primtive_desc(prim_desc, attr); | ||
create_primitive(prim, prim_desc); | ||
execute(prim); //tuning happens here | ||
set_tune_mode(false); | ||
primitive_attr_t new_attr; //attr must be recreated or tunable field manually set to false | ||
create_primtive_desc(prim_desc, new_attr); | ||
create_primitive(prim, prim_desc); //primitive is created with config selected from tuning | ||
execute(prim); //normal execution | ||
``` | ||
|
||
|
||
### Option 2 -Tune During Primtive Creation | ||
Unlike the first option, primitive does not have to be recreated afterward. | ||
However, oneDNN will have to allocate and initialize all memory needed for execution internally during creation. | ||
This adds additional complexity to the implementation, potentially high memory consumption | ||
and need for an optimized data filling routine. | ||
|
||
Since frameworks seem ok with first option, would recommend Option 1. | ||
|
||
***Option 2.1 - As a Global Variable*** | ||
```c | ||
dnnl_set_tune(true); | ||
create_primitive(prim); // tuning happens here | ||
dnnl_set_tune(false); | ||
execute(prim); //normal execution with tuned implementation | ||
``` | ||
***Option 2.2 - As a Primtive Attr*** | ||
|
||
|
||
### Implementation Details | ||
The following structure will be added to primitive_attr_t. | ||
```c | ||
struct tune_info_t { | ||
void set_tune_iter(int i); // set configuration to try | ||
void set_tune_profile(int i, double time); //record min time for ith configuration | ||
enum tune_status_t { searching /*default*/, finalize }; | ||
int iter = -1; //which configuration to try | ||
std::vector<double> iter_profile_time; // measured time for ith configuration | ||
int max_iters = 1; //max number of iters to try obtained by querying implementation | ||
tune_status_t tune_status = searching; //search or finalize status | ||
}; | ||
``` | ||
During the primitive execute call, will query the implementation for the number of configs it has via | ||
`query::tune_nconfigs`. For each config it will create the primitive, execute it 5 times, and record the min | ||
time in the tune_info_t structure. In the case where number of configurations to try is greater than 40 it will stop. | ||
It will then recreate the primitive with tune_status set to finalize. During this call the config with the best | ||
performance will be stored in a lookup table managed by the primitive and primitive cache will be updated to point | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this mechanism is to be extended to many other primitives - it may make sense to manage tuned parameters at a higher level in a generic way. Otherwise every primitive would have to implement lookup tables, logic for the finalization step, etc. But this is an implementation detail, and can be adjusted in the future if needed without API changes. |
||
to this implementation. | ||
|
||
### Additional Considerations | ||
***Tuning across different implementations:*** This can be tricky for nested primitives as primitive_desc_iterator only | ||
iterates through outermost implementations. Nested implementations may use scratchpad allocated buffers or take | ||
different arguments than the outermost primitive. One solution to dispatch correctly between implementations after | ||
tuning would be to use lookup tables to decide whether to return unimplemented or not. That would imply | ||
all implementations for a particular primtive will need to generate keys in the same way for their lookup tables. | ||
Given that currently the most relevant case for this is GeMM based primitives and that dispatching logic between | ||
the two implementations seems to work well, would recommend this issue be addressed later if the need arises. | ||
|
||
***Multi-threaded behavior:*** Since most of the tuning time will be spent creating primitives, threading can | ||
likely reduce tuning time. Each primtive can be tuned in a different thread. In that scenario, | ||
the implementation should be thread-safe. Lookup tables should be thread-safe and performance profiling should | ||
be done in a thread-safe way. | ||
|
||
***Dynamic Tuning:*** Currently tuning happens in a predetermined way; configs are pregenerated and executed blindly. | ||
Implementations can dynamically adjust which configurations to try next by looking at the iter_profile_time vector which | ||
shows times for previously executed configs. However, implementation will be responsible for maintaining mapping of iter | ||
number to actual configuation tried between primitive creation calls. The primitive_attr struct is const so implementation can't | ||
write back into this structure. | ||
|
||
***Performance Measurement:*** Performance is measured with the profiling api. To simulate cold cache mode a reorder is done | ||
between each execution to wipe the cache. This implementation should closely replicate the behavior of benchdnn; there | ||
are memory bound cases that are highly sensitive to cache behavior. If the performance measurement is inaccurate while | ||
tuning, this can result in regressions. | ||
|
||
|
||
|
||
### API | ||
|
||
```c | ||
/// include/oneapi/dnnl/dnnl.h | ||
|
||
/// Enable/disable tuning. All primitives executed when "true" | ||
/// will be tuned (if underlying implementation supports tuning). Tuning must | ||
/// be disabled by setting to "false" and primitives recreated in order for tuned implementations | ||
/// to take effect. | ||
/// | ||
/// @param int Set/Unset tuning status. | ||
/// @returns #dnnl_success on success or a status describing the error | ||
/// otherwise. | ||
dnnl_status_t DNNL_API dnnl_set_tune(int tune); | ||
``` | ||
|
||
```c++ | ||
/// include/oneapi/dnnl/dnnl.hpp: | ||
inline void set_tune(bool tune) { | ||
error::wrap_c_api(dnnl_set_tune((int)tune), "could not set tune status"); | ||
} | ||
|
||
``` | ||
|
||
### Performance Example (PVC) | ||
`./benchdnn --conv --engine=gpu --mb=1 --dt=s8 --mode=f --cold-cache=all --batch=shapes_resnet_50_v1_5` | ||
|
||
Speedup 1.2x | ||
|
||
Total time Before .54 ms | ||
|
||
Total time After .45 ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scenario looks like the initially created primitive goes unused (as it's not needed for following tuning), and we have unnecessary overhead. Is it correct? Is there a way to avoid that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be avoided if the tuning state is set before primitive creation or the proposed
dispatch_mode_t
is used.On the other hand, there will be many primitive creations during the tuning process itself, so i don't anticipate the overhead from the first primitive creation to be a big overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no reason to keep the flow in this order (create -> set_tune) then I think it makes sense to flip these steps and use an attribute. In general the API would work like this:
dnnl_set_tune()
which will modify the attribute on creation automaticallyexecute()
call) oneDNN performs all necessary benchmarking or no benchmarking (e.g. when the primitive doesn't support tuning or when this case was already tuned). oneDNN also doesn't guarantee correctness upon execution for tunable primitivesDo you see any issues with such API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me except for
I believe
primitive_attr_t
is passed in as const during creation, so in this scenario it would be implemented the same way as currently proposed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about following the same approach as for FP math mode:
oneDNN/src/common/primitive_attr.hpp
Line 620 in c900643
We have a global setter which sets a global value that is used as the default value during primitive attribute creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Does this imply after tuning is unset, the
primitive_attr_t
needs to be recreated?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be required. Might be less flexible but at the same time having a distinctive property between normal and tunable primitives looks like an easy-to-understand approach.