Skip to content

Releases: NVIDIA/cudnn-frontend

v0.7.2

20 Oct 17:23
60496f4
Compare
Choose a tag to compare

Release Notes:

cudnn_frontend v0.7 aims to target the new features introduced in cudnn version v8.5 (https://developer.nvidia.com/cudnn). The following are the changes in the v0.7 release.

[New API] Added support for Resample operation.

[New API] Tensor class has a clone method which allows a user to quickly create a new Tensor object with similar attributes.

[New API] Added support for new pointwise operations CUDNN_POINTWISE_ERF, CUDNN_POINTWISE_GELU_APPROX_TANH_FWD, CUDNN_POINTWISE_GELU_APPROX_TANH_BWD, CUDNN_POINTWISE_IDENTITY.

[New API] Several API names have been unified and made consistent across multiple descriptors for readability.

setComputePrecision/setMathPrecision/setMathType have been unified into setComputeType in cudnn_frontend_ConvDesc.h, cudnn_frontend_MatMulDesc.h, cudnn_frontend_Operation.h, cudnn_frontend_PointWiseDesc.h, cudnn_frontend_ReductionDesc.h, cudnn_frontend_Resample.h
Math operations like ConvDesc, ResampleDesc have getSpatialDimCount instead of getDimCount to avoid confusion with Tensor Dimensions.
Accessors for arrays will have [g,s]et[Spatial] as the API. [Spatial] is only needed when the attribute is common to both Tensor descriptor and Operation descriptor. Currently, its only the Stride and DimCount attributes that have ambiguity.
setArray functions will take size and pointer as arguments eg. setStride(int dim, int64_t* arr), setSpatialStride(int dim, int64_t* arr)
getArray functions will return a pointer to the array whose size is determined by getDimCount or getSpatialDimCount
[Minor Enhancement] Execution plans and Operation Graph printout more information in their describe() method.

[Bug Fixes] Some samples have been updated to go over all fallback configs to ensure that a successful plan is built.

[Bug Fixes] Execution plans had wrongly initialized numerical note CUDNN_NUMERICAL_NOTE_TYPE_TENSOR_CORE. This has been fixed.

[Samples] Added a new sample that does scale and bias of two tensors, adds them followed by a ReLU operation to show how fused operations work.

[Samples] Added a sample to demonstrate how the resample operation works.

[Samples] Added a new sample which shows convolution followed by multiple scales.

[Samples] Added a sample to show Fully Connected Layer fused with GeLU forward.

[Samples] Added a new sample to show fused backward activation, backward bias and backward Data Grad operation.

The current FE is designed to be compatible with all minor releases in the cuDNN 8.x version

v0.7.1
[Enhancement] Additional commit to remove an extraneous include to cudnn_ops_infer.h

v0.7.2
[Enhancement] Fixed issues in the code which caused warnings in MSVC and clang compilers.

[Enhancement] Fixed errors in get_heuristics_list where for certain heuristics mode in older cuDNN versions, the heuristics list might be incorrect.

[Bug fixes] Fixed several test cases failing on unsupported GPUs to exit gracefully.

[Samples] Added a sample to showcase fp8 convolution forward in Nvidia Hopper GPUs. The sample also showcases post convolution book-keeping operations such as scaling and absolute maximum reduction.

[Samples] Added a sample which converts fp16 tensor to fp8 and performs transpose and absolute maximum reduction.

[Samples] Added a sample to demonstrate Max pooling operation including tensor index dump, necessary to speed up the backward pass.

[Samples] Added a sample to showcase the backward pooling operation.

v0.7.1

27 Aug 04:07
171a7a9
Compare
Choose a tag to compare

Release Notes:

cudnn_frontend v0.7 aims to target the new features introduced in cudnn version v8.5 (https://developer.nvidia.com/cudnn). The following are the changes in the v0.7 release.

[New API] Added support for Resample operation.

[New API] Tensor class has a clone method which allows a user to quickly create a new Tensor object with similar attributes.

[New API] Added support for new pointwise operations CUDNN_POINTWISE_ERF, CUDNN_POINTWISE_GELU_APPROX_TANH_FWD, CUDNN_POINTWISE_GELU_APPROX_TANH_BWD, CUDNN_POINTWISE_IDENTITY.

[New API] Several API names have been unified and made consistent across multiple descriptors for readability.

  • setComputePrecision/setMathPrecision/setMathType have been unified into setComputeType in cudnn_frontend_ConvDesc.h, cudnn_frontend_MatMulDesc.h, cudnn_frontend_Operation.h, cudnn_frontend_PointWiseDesc.h, cudnn_frontend_ReductionDesc.h, cudnn_frontend_Resample.h
  • Math operations like ConvDesc, ResampleDesc have getSpatialDimCount instead of getDimCount to avoid confusion with Tensor Dimensions.
  • Accessors for arrays will have [g,s]et[Spatial]<AttributeName> as the API. [Spatial] is only needed when the attribute is common to both Tensor descriptor and Operation descriptor. Currently, its only the Stride and DimCount attributes that have ambiguity.
    • setArray functions will take size and pointer as arguments eg. setStride(int dim, int64_t* arr), setSpatialStride(int dim, int64_t* arr)
    • getArray functions will return a pointer to the array whose size is determined by getDimCount or getSpatialDimCount

[Minor Enhancement] Execution plans and Operation Graph printout more information in their describe() method.

[Bug Fixes] Some samples have been updated to go over all fallback configs to ensure that a successful plan is built.

[Bug Fixes] Execution plans had wrongly initialized numerical note CUDNN_NUMERICAL_NOTE_TYPE_TENSOR_CORE. This has been fixed.

[Samples] Added a new sample that does scale and bias of two tensors, adds them followed by a ReLU operation to show how fused operations work.

[Samples] Added a sample to demonstrate how the resample operation works.

[Samples] Added a new sample which shows convolution followed by multiple scales.

[Samples] Added a sample to show Fully Connected Layer fused with GeLU forward.

[Samples] Added a new sample to show fused backward activation, backward bias and backward Data Grad operation.

The current FE is designed to be compatible with all minor releases in the cuDNN 8.x version

v0.7.1

[Enhancement] Additional commit to remove an extraneous include to cudnn_ops_infer.h

v0.7

25 Aug 01:28
581b915
Compare
Choose a tag to compare

Release Notes:

cudnn_frontend v0.7 aims to target the new features introduced in cudnn version v8.5 (https://developer.nvidia.com/cudnn). The following are the changes in the v0.7 release.

[New API] Added support for Resample operation.

[New API] Tensor class has a clone method which allows a user to quickly create a new Tensor object with similar attributes.

[New API] Added support for new pointwise operations CUDNN_POINTWISE_ERF, CUDNN_POINTWISE_GELU_APPROX_TANH_FWD, CUDNN_POINTWISE_GELU_APPROX_TANH_BWD, CUDNN_POINTWISE_IDENTITY.

[New API] Several API names have been unified and made consistent across multiple descriptors for readability.

  • setComputePrecision/setMathPrecision/setMathType have been unified into setComputeType in cudnn_frontend_ConvDesc.h, cudnn_frontend_MatMulDesc.h, cudnn_frontend_Operation.h, cudnn_frontend_PointWiseDesc.h, cudnn_frontend_ReductionDesc.h, cudnn_frontend_Resample.h
  • Math operations like ConvDesc, ResampleDesc have getSpatialDimCount instead of getDimCount to avoid confusion with Tensor Dimensions.
  • Accessors for arrays will have [g,s]et[Spatial]<AttributeName> as the API. [Spatial] is only needed when the attribute is common to both Tensor descriptor and Operation descriptor. Currently, its only the Stride and DimCount attributes that have ambiguity.
    • setArray functions will take size and pointer as arguments eg. setStride(int dim, int64_t* arr), setSpatialStride(int dim, int64_t* arr)
    • getArray functions will return a pointer to the array whose size is determined by getDimCount or getSpatialDimCount

[Minor Enhancement] Execution plans and Operation Graph printout more information in their describe() method.

[Bug Fixes] Some samples have been updated to go over all fallback configs to ensure that a successful plan is built.

[Bug Fixes] Execution plans had wrongly initialized numerical note CUDNN_NUMERICAL_NOTE_TYPE_TENSOR_CORE. This has been fixed.

[Samples] Added a new sample that does scale and bias of two tensors, adds them followed by a ReLU operation to show how fused operations work.

[Samples] Added a sample to demonstrate how the resample operation works.

[Samples] Added a new sample which shows convolution followed by multiple scales.

[Samples] Added a sample to show Fully Connected Layer fused with GeLU forward.

[Samples] Added a new sample to show fused backward activation, backward bias and backward Data Grad operation.

v0.6.3

14 Jul 21:48
2ce4797
Compare
Choose a tag to compare
  • [New Feature] Serialization:

    Execution Plan Serialization and Deserialization (Experimental)

    cuDNN v8.4 and above provides exeuction plan serialization and deserialization to save the execution plan as a string in JSON format. The execution plan can be then restored from that string at a later point, and this also saves compilation time compared to rebuilding the plan from scratch. Currently, this is an experimental feature that only supports the runtime fusion engine. No forward/backward or cross-device compatibility guarantee is offered at this time.

    API:

      - std::string cudnn_frontend::ExecutionPlan_v8::getJsonRepresentation() : Serialize the execution plan into a string in JSON format.
      - cudnn_frontend::ExecutionPlan_v8&& cudnn_frontend::ExecutionPlanBuilder_v8::loadFromJson(const std::string &json_plan) : Deserialize from a string containing the JSON representation of the execution plan.
    
  • [New API] Added a new API

    get_heuristics_list(std::array<std::string, SIZE> modes,
      OperationGraph_v8 &opGraph,
      std::function<bool(cudnnBackendDescriptor_t)> filter_fn,
      EngineConfigList &filtered_configs,
      bool evaluate_all = false)
    

    This function takes a paramter list of heuristics mode. "heuristics_instant", "heuristic_fallback", "heuristic_mode_b" and computes a list of engine config which do not satisfy the blocking condition in filter_fn. The function can be optionally set to keep going even if one of the mode fails.

  • [New Features] Added support for BN Finalize i.e. generation of mean and variance to perform batch normalization.

  • [New Features] Added support for BN Stats fusion pattern. This pattern covers Scale, Bias, Relu, Conv and generation of SUM and SQSUM for batch normalization.

  • [New Features] Added support for CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations added in cuDNN 8.4.0.

  • [Cleanup] Fixed a bug when used CUDNN_HEUR_MODE_B is used in multiple threads leads to crash in certain conditions.

  • [Cleanup] Added the CUDNN_PATH in CMakeLists.txt allowing user to build with different cuDNN installation path.

  • [Cleanup] Made Engine_v8 constructor as default which prevents overwriting of the status during knob creation.

  • [Cleanup] Take UIDs of variant pack as a const pointer.

  • [Cleanup] When logging was enabled and if no plan returned by heuristics is finalizable, it lead to a crash. This is now fixed.

  • [Samples] Added a new sample to showcase CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations.

  • [Samples] Modified MHA sample to show improved numerical stability. Investigation is still going on to further improve the MHA sample

  • [Samples] Added samples for fused operation graph for BN Stats generation and stats finalization.

  • Added missing return statements for operation.

  • Added as warn-as-error to the Samples Makefile.

  • Addressed multiple compiler warning triggered by clang.

    • Unused variables.
    • Undefined destructor for class with virtual methods
  • During the heuristics query if the heur_mode_b fails it fallbacks to heur_mode_a(heur_mode_instant)

  • Addressed a bug to initiate the numerical notes and behavior notes to max values instead of 0.

v0.6.2

21 Apr 20:06
43709ab
Compare
Choose a tag to compare
  • [New Feature] Serialization:

    Execution Plan Serialization and Deserialization (Experimental)

    cuDNN v8.4 and above provides exeuction plan serialization and deserialization to save the execution plan as a string in JSON format. The execution plan can be then restored from that string at a later point, and this also saves compilation time compared to rebuilding the plan from scratch. Currently, this is an experimental feature that only supports the runtime fusion engine. No forward/backward or cross-device compatibility guarantee is offered at this time.

    API:

      - std::string cudnn_frontend::ExecutionPlan_v8::getJsonRepresentation() : Serialize the execution plan into a string in JSON format.
      - cudnn_frontend::ExecutionPlan_v8&& cudnn_frontend::ExecutionPlanBuilder_v8::loadFromJson(const std::string &json_plan) : Deserialize from a string containing the JSON representation of the execution plan.
    
  • [New API] Added a new API

    get_heuristics_list(std::array<std::string, SIZE> modes,
      OperationGraph_v8 &opGraph,
      std::function<bool(cudnnBackendDescriptor_t)> filter_fn,
      EngineConfigList &filtered_configs,
      bool evaluate_all = false)
    

    This function takes a paramter list of heuristics mode. "heuristics_instant", "heuristic_fallback", "heuristic_mode_b" and computes a list of engine config which do not satisfy the blocking condition in filter_fn. The function can be optionally set to keep going even if one of the mode fails.

  • [New Features] Added support for BN Finalize i.e. generation of mean and variance to perform batch normalization.

  • [New Features] Added support for BN Stats fusion pattern. This pattern covers Scale, Bias, Relu, Conv and generation of SUM and SQSUM for batch normalization.

  • [New Features] Added support for CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations added in cuDNN 8.4.0.

  • [Cleanup] Fixed a bug when used CUDNN_HEUR_MODE_B is used in multiple threads leads to crash in certain conditions.

  • [Cleanup] Added the CUDNN_PATH in CMakeLists.txt allowing user to build with different cuDNN installation path.

  • [Cleanup] Made Engine_v8 constructor as default which prevents overwriting of the status during knob creation.

  • [Cleanup] Take UIDs of variant pack as a const pointer.

  • [Cleanup] When logging was enabled and if no plan returned by heuristics is finalizable, it lead to a crash. This is now fixed.

  • [Samples] Added a new sample to showcase CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations.

  • [Samples] Modified MHA sample to show improved numerical stability. Investigation is still going on to further improve the MHA sample

  • [Samples] Added samples for fused operation graph for BN Stats generation and stats finalization.

  • Added missing return statements for operation.

  • Added as warn-as-error to the Samples Makefile.

  • Addressed multiple compiler warning triggered by clang.

    • Unused variables.
    • Undefined destructor for class with virtual methods

v0.6.1

09 Apr 00:22
fa61199
Compare
Choose a tag to compare

cuDNN Frontend v0.6 release

  • [New Feature] Serialization: (#26)

Execution Plan Serialization and Deserialization (Experimental)

cuDNN v8.4 and above provides exeuction plan serialization and deserialization to save the execution plan as a string in JSON format. The execution plan can be then restored from that string at a later point, and this also saves compilation time compared to rebuilding the plan from scratch. Currently, this is an experimental feature that only supports the runtime fusion engine. No forward/backward or cross-device compatibility guarantee is offered at this time.

### API:
    - std::string cudnn_frontend::ExecutionPlan_v8::getJsonRepresentation() : Serialize the execution plan into a string in JSON format.
    - cudnn_frontend::ExecutionPlan_v8&& cudnn_frontend::ExecutionPlanBuilder_v8::loadFromJson(const std::string &json_plan) : Deserialize from a string containing the JSON representation of the execution plan.
  • [New API] Added a new API

    get_heuristics_list(std::array<std::string, SIZE> modes,
      OperationGraph_v8 &opGraph,
      std::function<bool(cudnnBackendDescriptor_t)> filter_fn,
      EngineConfigList &filtered_configs,
      bool evaluate_all = false)
    

    This function takes a paramter list of heuristics mode. "heuristics_instant", "heuristic_fallback", "heuristic_mode_b" and computes a list of engine config which do not satisfy the blocking condition in filter_fn. The function can be optionally set to keep going even if one of the mode fails.

  • [New Features] Added support for BN Finalize i.e. generation of mean and variance to perform batch normalization.

  • [New Features] Added support for BN Stats fusion pattern. This pattern covers Scale, Bias, Relu, Conv and generation of SUM and SQSUM for batch normalization.

  • [New Features] Added support for CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations added in cuDNN 8.4.0.

  • [Cleanup] Fixed a bug when used CUDNN_HEUR_MODE_B is used in multiple threads leads to crash in certain conditions.

  • [Cleanup] Added the CUDNN_PATH in CMakeLists.txt allowing user to build with different cuDNN installation path.

  • [Cleanup] Made Engine_v8 constructor as default which prevents overwriting of the status during knob creation.

  • [Cleanup] Take UIDs of variant pack as a const pointer.

  • [Cleanup] When logging was enabled and if no plan returned by heuristics is finalizable, it lead to a crash. This is now fixed.

  • [Samples] Added a new sample to showcase CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations.

  • [Samples] Modified MHA sample to show improved numerical stability. Investigation is still going on to further improve the MHA sample

  • [Samples] Added samples for fused operation graph for BN Stats generation and stats finalization.

v0.6.1

  • [Cleanup] Patch a fix for compilation errors in cuDNN v8.3 and below

cuDNN Frontend v0.6

07 Apr 18:11
e8e186a
Compare
Choose a tag to compare

cuDNN Frontend v0.6 release

  • [New Feature] Serialization: (#26)

Execution Plan Serialization and Deserialization (Experimental)

cuDNN v8.4 and above provides exeuction plan serialization and deserialization to save the execution plan as a string in JSON format. The execution plan can be then restored from that string at a later point, and this also saves compilation time compared to rebuilding the plan from scratch. Currently, this is an experimental feature that only supports the runtime fusion engine. No forward/backward or cross-device compatibility guarantee is offered at this time.

### API:
    - std::string cudnn_frontend::ExecutionPlan_v8::getJsonRepresentation() : Serialize the execution plan into a string in JSON format.
    - cudnn_frontend::ExecutionPlan_v8&& cudnn_frontend::ExecutionPlanBuilder_v8::loadFromJson(const std::string &json_plan) : Deserialize from a string containing the JSON representation of the execution plan.
  • [New API] Added a new API

    get_heuristics_list(std::array<std::string, SIZE> modes,
      OperationGraph_v8 &opGraph,
      std::function<bool(cudnnBackendDescriptor_t)> filter_fn,
      EngineConfigList &filtered_configs,
      bool evaluate_all = false)
    

    This function takes a paramter list of heuristics mode. "heuristics_instant", "heuristic_fallback", "heuristic_mode_b" and computes a list of engine config which do not satisfy the blocking condition in filter_fn. The function can be optionally set to keep going even if one of the mode fails.

  • [New Features] Added support for BN Finalize i.e. generation of mean and variance to perform batch normalization.

  • [New Features] Added support for BN Stats fusion pattern. This pattern covers Scale, Bias, Relu, Conv and generation of SUM and SQSUM for batch normalization.

  • [New Features] Added support for CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations added in cuDNN 8.4.0.

  • [Cleanup] Fixed a bug when used CUDNN_HEUR_MODE_B is used in multiple threads leads to crash in certain conditions.

  • [Cleanup] Added the CUDNN_PATH in CMakeLists.txt allowing user to build with different cuDNN installation path.

  • [Cleanup] Made Engine_v8 constructor as default which prevents overwriting of the status during knob creation.

  • [Cleanup] Take UIDs of variant pack as a const pointer.

  • [Cleanup] When logging was enabled and if no plan returned by heuristics is finalizable, it lead to a crash. This is now fixed.

  • [Samples] Added a new sample to showcase CUDNN_POINTWISE_GEN_INDEX and CUDNN_POINTWISE_BINARY_SELECT pointwise operations.

  • [Samples] Modified MHA sample to show improved numerical stability. Investigation is still going on to further improve the MHA sample

  • [Samples] Added samples for fused operation graph for BN Stats generation and stats finalization.

v0.5.1

25 Jan 04:35
7b83dba
Compare
Choose a tag to compare
  • Fix an issue where cuDNN Frontend API always used the default stream in cudnn_find_plan (autotuning). Now, the stream is queried from the handle.
  • Updated CMakelist.txt to depend on CUDNN_FRONTEND_PATH environment variable.
  • Fixed a compilation warnings for missing return values

cudnn_frontend v0.5

12 Nov 06:30
Compare
Choose a tag to compare

Release Notes:

- [New API]: Execution Plan Caching
   ## Execution Plan Caching
      cuDNN through heuristics provides a way to query a list of good engine configs. Based on this query we build the cudnn_frontend_find_plan function which runs all the engineConfig(s) on the given user system and returns a sorted list of plans. 
      This process of running multiple plans through several iterations is time consuming. The ExecutionPlanCache allows the user to build a cache with operation graph as the key to query an execution plan. It is the responsibilty of the user to maintain different ca  ches for different types of operation_graphs (For eg. different cache for convolutionForward compared to Dgrad or Wgrad).                                                                                                                                   
  
      ### API:
       - add_plan_to_cache(const cudnn_frontend::OperationGraph &op_graph, const cudnn_frontend::ExecutionPlan &plan) : Creates a mapping between the operation graph and executionPlan
       - bool get_plan_from_cache(const cudnn_frontend::OperationGraph &op_graph, const cudnn_frontend::ExecutionPlan *&plan) : Sets the executionPlan in the plan pointer and returns true if found. 
       - cudnnFindPlanAndCache(cudnnHandle_t handle, cudnn_frontend::OperationGraph &opGraph, cudnn_frontend::VariantPack const &variantPack, cudnn_frontend::ExecutionPlanCache &cache, Predicate pred) -> cudnn_frontend::ExecutionPlan : The above API chains the output of cudnn_frontend_find_plan and caches the result for future usage.

- [New Feature]:  Allows logging in the cudnn frontend.
   ## Logging
      cuDNN Frontend API logging records execution flow through cuDNN frontend API. This functionality is disabled by default, and can be enabled through methods described in this section.

      ### Method 1: Using Environment Variables:
      | Environment variables                             | CUDNN_FRONTEND_LOG_INFO=0 | CUDNN_FRONTEND_LOG_INFO=1 |
      | --------------------------------------------------| ------------------------- | -----------               |
      | CUDNN_FRONTEND_LOG_FILE not set                   | No Logging                | No Logging                |
      | CUDNN_FRONTEND_LOG_FILE set to stdout or stderr   | No Logging                | Logging to cout or cerr   |
      | CUDNN_FRONTEND_LOG_FILE set to filename.txt       | No Logging                | Logging to the filename   |

      ### Method 2: Using API calls:
      Calling `cudnn_frontend::isLoggingEnabled() = true|false` has same effect of setting the environment variable.
      Calling `cudnn_frontend::getStream() = stream_name` can be used to assign the output stream directly.

- [New API]: cudnnReorderFilterAndBiasInt8x32 :- Reorders the filter and bias tensors which allows the tensor cores to be used during Int8x32 convolutions
- [New Feature]: Add support for isByValue attribute setting in tensor.

- [Samples]: Clean up Makefile and move to cmake based setup. Allows samples to be compiled on Windows machines.
- [Samples]: Updated samples to query the heuristics for fusion cases.
- [Samples]: Added a new ConvScaleBiasAct_int8 sample to address https://github.com/NVIDIA/cudnn-frontend/issues/8
- [Samples]: Added a sample to demonstrate how execution Plan caching works.
- [Samples]: Added a new sample to show how Multi-Headed Attention can be implemented with run time fusion. Will work with 8.3.1
- [Cleanup]: ExecutionPlan cache has copy contructor and pre-emptively caches the workspace, numerical and behavior notes.
- [Cleanup]: Update cudnn_frontend_PointWiseDesc.h to include limits to fix gcc 11 compilation error.( https://github.com/NVIDIA/cudnn-frontend/pull/10)
- [Cleanup]: Verify out of bounds iterator in Errata.h (https://github.com/NVIDIA/cudnn-frontend/pull/11)
- [Cleanup]: Added default move assign and move constructor to all classes.
- [Cleanup]: CheckcudaError and checkCudnnError correctly asserts now instead of silently failing.
- [Cleanup]: Updated errata filter to no-longer block the engine ID 0 when running the Int8x32. 
- [Cleanup]: Default value of knobs in engine config is not 0 anymore. 

Release 0.4.1

13 Aug 21:43
8360d4a
Compare
Choose a tag to compare

[Bug Fix]: Fixed an issue where the vector count was not copied over during move construction phase.
[Samples]: Added a new sample for INT8x32 config (utilizing integer tensor cores). The example includes an errata filter which blocks an engine that has a known issue running this config.
[CleanUp]: Change all move constructors and fixed move assignment operator.

Co-authored-by: agopal [email protected]