Solver/Solution framework #866

atamazov · 2021-04-19T15:24:44Z

This is a copy of a presentation for MIOpen team I held a couple of years ago, when we've introduced and implemented the Solver/Solution architecture. It does not cover the recent additions like GetWti() and Invokers. I would like to make it available for all MIOpen developers, including collaborators.

Though a bit outdated, this should provide a good overview of how device code is abstracted away from the rest of the library.

1. Intent

Problem: Variety of convolution kernels

Different formats (OCL, asm, binary)
Capable to do different (but often intersecting) subsets of convolution configs
Some may benefit from auto-tuning, some not.
Each kernel may require its own set of #defines for build
There are cases when two (or more kernels) needs to be run to do the job
Kernels need different parameters during invocation

Experience shows that straightforward attempts to support such a set of kernels result in host code which is large, fragile, difficult to develop and maintain. You may see leftovers of this in convolutionocl.cpp.

Provide abstractions which able to represent in the single place all the information required to

Select kernel(s) which can do the required job
Build these kernel(s)
Run kernels
Auto-tune kernel(s)

Such abstractions allow working with all convolutions in unified manner. Currently, there are:

ProblemDescriptions
Solvers
- Non-searchable
- Searchable, which implement auto-tune internally (legacy OpenCL)
- Searchable, which employ Generic Search
Solutions
PerformanceConfigs (paired with searchable Solvers)

atamazov · 2021-04-19T15:26:09Z

2. Problem Description and Context

2.1. Problem Description for an operation, e.g. `conv::ProblemDescription`

This is an input for the Solver.

This is an object which represents a set of parameters that shall unambiguously describe the specific library primitive. Example for convolutions:

struct ProblemDescription
{
    int n_inputs         = 0;
    int in_height        = 0;
    int in_width         = 0;
    int kernel_size1     = 0;
    int kernel_size0     = 0;
    int n_outputs        = 0;
...
    struct Direction direction; // has members like IsBackwardData() etc
...
}

If perf-db is used, the object must provide also "void Serialize(std::ostream& stream) const;". This is used to generate perf-db keys.
- Each unique problem shall have unique db-key.
May contain other helper members in order to ease implementation or Solvers, for example:

struct ProblemDescription
{...
    int GetBackwardPad0() const { return kernel_size0 - pad0 - 1; }
    int GetBackwardPad1() const { return kernel_size1 - pad1 - 1; }
...}

2.3. `ExecutionContext`

TBD|

2.3. Operation Context, e.g. `ConvolutionContext`

Inherits from ProblemDescription and ExecutionContext, so for example an instance of ConvolutionContext can be used as an instance of conv::ProblemDescription. More info TBD.

atamazov · 2021-04-19T16:01:54Z

3. What is Solver

Solver is an object which encapsulates the implementation of specific primitive.

It "knows" all the information required to properly use the kernel(s) it hides.
Searchable (Tunable) Solvers work together with paired PerformanceConfig objects. In such a case the information about kernel(s) is distributed between Solver and its PerformanceConfig.

Member functions (see here for current prototypes):

bool IsApplicable(ConvolutionContext&);
- Takes problem description (e.g. convolution parameters) and answers if this Solver is able to provide the Solution.
- This must be fast, i.e. must faster than kernel execution time.
  - [Informative] The typical execution time should be ~10 us.

If a Solver needs workspace:

size_t GetWorkspaceSize(const ConvolutionContext&) const
bool MayNeedWorkspace() const
- This is optional method which is required for optimization of GWSS.

Each Solver instance s can be used as a parameter to GetSolverDbId(s) template function which retrieves the string id of the Solver. There is default implementation of GetSolverDbId() which returns the class name and can be overridden if necessary.

If a Solver is Dynamic:

bool IsDynamic() const { return true; }

If a Solver is NOT searchable (NOT tunable):

Solution GetSolution(ConvolutionContext&);
- Returns Solution object

If a Solver is searchable (tunable), then also the accompanying PerformanceConfig type shall be defined plus some member functions:

PerformanceConfig GetPerformanceConfig(const ConvolutionContext&) const;
- Provides default performance parameters
PerformanceConfig Search(const ConvolutionContext&) const;
- Performs auto-tune and returns optimal parameters.
- This function could be quite complicated.
Solution GetSolution(const ConvolutionContext&, const PerformanceConfig&);
- Returns Solution object generated from PerformanceConfig instance.
bool IsValidPerformanceConfig(const ConvolutionContext&, const PerformanceConfig&);
- Checks if PerformanceConfig instance (e.g. read from the perf-db) is valid.
- "Valid" means that Solution returned by GetSolution():
  - Should build without errors and the resulting binary would be runnable
  - Result of execution would be numerically correct.
  - Checks performed by this function may be non-trivial due to limited amount of LDS, and, in case of asm kernels, SGPR/VGPR limitations etc.

Generic search.

Modern Solvers employ Generic search.

Only legacy OpenCL kernels do not use it. These have complicated Search().
Generic Search allows implementation of Search() in Solvers as simple calls to the GenericSearch() template function.
However this requires paired PerformanceConfig type to provide some member functions. These functions are used by GenericSearch() in order to build the ComputedContainer object and iterate over it.
~~Right now the Solver must also define the RunAndMeasureSolution(). This is to be removed as soon as Invoker object concept is implemented.~~

The PerformanceConfig of a modern searchable Solver type shall provide some functions. These are necessary to build the ComputedContainer instances. The following member functions are required for that:

(ctor)()
- Constructs an instance with invalid value.
(ctor)(bool)
- Constructs an instance with minimal, valid value.
SetNextValue(ConvolutionContext& c)
- Note: It was SetNextValue() before [NFC] ConvolutionContext made accessible in SetNextValue() #1033.
- Advances valid performance-config to the next available valid value and returns true. If max value reached (no more valid performance-configs left), returns false.
- IMPORTANT: the PerformaneConfig instance pc is valid if and only if the IsValidPerformanceConfig(..., pc) returns true. This ensures that all the perf-configs which reside in the ComputedContainer are:
  - Buildable without errors.
  - Runnable on GPU and yield correct result.
IsValid(ConvolutionContext& c) const
- Checks if instance is valid for the given c.
operator==(const PerformanceConfig&)
- Ordinary semantics.

⚠️ IMPORTANT:

The tunable Solver is allowed to have 0 performance configs available (empty primary and spare ComputedContainers), but in this case it must be able to provide valid default performance config (the one returned by GetPerformanceConfig(const ConvolutionContext&)).
However it is highly recommended that tunable Solver provides 2 or more performance configs.
- Why: If there is only 1 performance config, then there is no reason to spend resources for tuning and saving/loading its results in the perf-db, because db accesses are not free.
- Therefore, if the applicability scope of some Solver contains many problems that have only 1 performance config, then it is highly recommended to split the Solver to two: one tunable and non-tunable.

Serialization/de-serialization of PerformanceConfig instances

All PerformanceConfig types shall implement the following member functions:

void Serialize(std::ostream&) const;
- Converts the internal state of an instance ("value") to text.
bool Deserialize(const std::string& str)
- Reads text and converts it to the internal state ("value") of an instance.

atamazov · 2021-04-19T16:02:03Z

4. What is Solution

Information required to build and run a kernel (or a set of kernels), which is expected to perform computatons as per the ProblemConfig.

Currently best suits a subset of existing solvers, namely some OpenCL-written forward direct convolutions. Shall be refactored (possibly, to a class hierarchy).

struct ConvSolution
{
    std::vector<KernelInfo> construction_params; // impl may consist of multiple kernels.
    miopenStatus_t status;
    std::string solver_id;
    ...
}

As you see, it contains a vector of KernelInfo objects.

Each object describes a kernel source and whatever information required in order to build and run it (the former is unused for binary kernels).

struct KernelInfo
{
    std::string comp_options;
    std::vector<size_t> l_wk;
    std::vector<size_t> g_wk;
    std::string kernel_file;
    std::string kernel_name;
    friend std::ostream& operator<<(std::ostream& os, const KernelInfo& k);
};

atamazov · 2021-04-19T17:04:06Z

5. Perf-db support

SolverDbId(solver)
Serialization and De-serialization of PerformanceConfigs

atamazov · 2021-04-19T17:05:23Z

6. Future directions

Extend usage of Solvers to primitives other than plain Direct and Winograd convolutions
- GEMM, FFT convolution algorithms (done)
- Normalization, Pooling, Activation
- Fused convolutions
Invoker objects (done)
- Allows for simplification of convolutionocl.cpp.
- Second effect is removal of RunAndMeasureSolution() from Solvers.
- More info at Design of Invokers #216

atamazov · 2023-10-05T23:18:34Z

7. Support for convolutions with non-packed tensors

Currently we are not going to include strides of non-packed tensors to the database keys. Only an optional flag (saying that at least one tensor is non-packed) should be included there. The above means that databases will share the same find-db records, same Invoker instances and same perf-db information for the non-packed convolutions that differ only in strides.

The above design should work correctly provided that:

If an Invoker instance is able to compute some non-packed convolution, then the same instance must be able to compute any similar non-packed convolution that differs only in strides.
- [Consequence 1] An Invoker that is used for computation of non-packed convolutions must read stride information from InvokeParams and pass it to the kernels as arguments.
- [Consequence 2] Stride information should not be used for building any compile-time parameters.
If a kernel that is used to compute a non-packed convolution requires tuning, then the same tuning parameters must provide similar performance for non-packed convolutions that differs only in strides.

Originated from #2334 (comment)

atamazov added the specification label Apr 19, 2021

atamazov changed the title ~~Solver framework~~ Solver framework (outdated) Apr 19, 2021

atamazov mentioned this issue Apr 19, 2021

[implicit GEMM] CalculateGemmSize() should not be a public member function of a Solver #867

Closed

This comment has been minimized.

Sign in to view

atamazov changed the title ~~Solver framework (outdated)~~ Solver/Solution framework (outdated) Apr 19, 2021

atamazov changed the title ~~Solver/Solution framework (outdated)~~ Solver/Solution framework Apr 19, 2021

atamazov mentioned this issue Jun 24, 2021

Double check a tuning parameter from perfdb #668

Closed

atamazov mentioned this issue Aug 27, 2021

[gfx90a][conv][nhwc]Asm igemm on gfx90a-fwd/bwd/wrw fp32/fp16 #1035

Merged

atamazov mentioned this issue Sep 13, 2021

[MLIR] Implement tuning - step 3: bwd, nonxdlops + xdlops #1152

Merged

atamazov mentioned this issue Oct 5, 2021

[MI100][FP32] ConvHipImplicitGemmBwdDataV4R1Xdlops verficiation failure (SWDEV-305815) #1206

Open

atamazov mentioned this issue Nov 3, 2021

[Enhancement] xdlops NCHW support by transpose #1247

Merged

5 tasks

atamazov mentioned this issue Nov 19, 2021

[conv] Optimize performance of GWSS #1300

Merged

atamazov mentioned this issue Mar 24, 2022

[Tensor reorder] Universal tensor transform feature, a fallback of batched transpose kernel #1419

Merged

atamazov mentioned this issue Jul 29, 2022

Where are the C++ kernel implementation of the Winograd convolution algorithm? #1658

Closed

atamazov mentioned this issue Mar 23, 2023

Mismatch in ConvHipImplicitGemmV4R1Fwd #2038

Open

atamazov mentioned this issue Jul 20, 2023

[Exhaustive Tuning] Search failed Error message when no solution is available #2253

Closed

This was referenced Aug 3, 2023

Properly failing FindSoluions on exceptions #2279

Merged

[tuning] Do not insert default perf-config into empty tuning container #2296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solver/Solution framework #866

Solver/Solution framework #866

atamazov commented Apr 19, 2021 •

edited

Loading

atamazov commented Apr 19, 2021 •

edited

Loading

atamazov commented Apr 19, 2021 •

edited

Loading

atamazov commented Apr 19, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

atamazov commented Apr 19, 2021

atamazov commented Apr 19, 2021 •

edited

Loading

atamazov commented Oct 5, 2023 •

edited

Loading

Solver/Solution framework #866

Solver/Solution framework #866

Comments

atamazov commented Apr 19, 2021 • edited Loading

1. Intent

atamazov commented Apr 19, 2021 • edited Loading

2. Problem Description and Context

2.1. Problem Description for an operation, e.g. conv::ProblemDescription

2.3. ExecutionContext

2.3. Operation Context, e.g. ConvolutionContext

atamazov commented Apr 19, 2021 • edited Loading

3. What is Solver

Member functions (see here for current prototypes):

Generic search.

Serialization/de-serialization of PerformanceConfig instances

atamazov commented Apr 19, 2021 • edited Loading

4. What is Solution

This comment has been minimized.

This comment has been minimized.

atamazov commented Apr 19, 2021

5. Perf-db support

atamazov commented Apr 19, 2021 • edited Loading

6. Future directions

atamazov commented Oct 5, 2023 • edited Loading

7. Support for convolutions with non-packed tensors

atamazov commented Apr 19, 2021 •

edited

Loading

atamazov commented Apr 19, 2021 •

edited

Loading

2.1. Problem Description for an operation, e.g. `conv::ProblemDescription`

2.3. `ExecutionContext`

2.3. Operation Context, e.g. `ConvolutionContext`

atamazov commented Apr 19, 2021 •

edited

Loading

atamazov commented Apr 19, 2021 •

edited

Loading

atamazov commented Apr 19, 2021 •

edited

Loading

atamazov commented Oct 5, 2023 •

edited

Loading