-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add kleidiai as thirdparty #27331
Add kleidiai as thirdparty #27331
Conversation
f05c178
to
4b52286
Compare
build_jenkins |
build_jenkins |
@@ -175,6 +176,11 @@ if(DNNL_USE_ACL) | |||
set(OV_CPU_WITH_ACL ON) | |||
endif() | |||
|
|||
if(ENABLE_KLEIDIAI_FOR_CPU) | |||
add_definitions(-DOV_CPU_WITH_KLEIDIAI) | |||
set(OV_CPU_WITH_KLEIDIAI ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it's not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to use it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, why ENABLE_KLEIDIAI_FOR_CPU
is not enough?
build_jenkins |
build_jenkins |
build_jenkins |
build_jenkins |
build_jenkins |
build_jenkins |
build_jenkins |
build_jenkins |
build_jenkins |
@@ -218,8 +218,8 @@ void CPUTestsBase::CheckPluginRelatedResultsImpl(const std::shared_ptr<const ov: | |||
|
|||
auto primType = getExecValue(ov::exec_model_info::IMPL_TYPE); | |||
|
|||
ASSERT_TRUE(primTypeCheck(primType)) | |||
<< "primType is unexpected : " << primType << " Expected : " << selectedType; | |||
// ASSERT_TRUE(primTypeCheck(primType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be removed?
This PR will be closed in a week because of 2 weeks of no activity. |
This PR will be closed in a week because of 2 weeks of no activity. |
Hi. I was trying to use these changes locally to see if KleidiAI gets used for fp32 inference. I see that the ACL executor still gets used and not Kleidi's. Does OpenVINO have to be built with any special flags for this to work, or is integration not complete yet? I have detailed my experiment setup below for reference. SetupI replicated the changes in my fork along with the following changes in using LayoutConfig = std::vector<LayoutType>;
static const LayoutConfig dnnlFCLayoutConfig{LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp};
static const LayoutConfig aclFCLayoutConfig{LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp};
// <<< ADDED BY ME >>>
static const LayoutConfig kleidiaiFCLayoutConfig{LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp};
template <dnnl::impl::cpu::x64::cpu_isa_t ISA>
struct Require {
bool operator()() {
return dnnl::impl::cpu::x64::mayiuse(ISA);
}
};
// clang-format off
static const TypeMapping dnnlFCTypeMapping {
// {src, wei, bia, dst} pt<src, wei, bias, dst>
{{_bf16, _bf16 | _f32, _any, _bf16 | _f32}, pt(bypass(), bypass(), use<3>(), bypass())},
{{_f16, _f16, _any, _f16 | _f32}, pt(bypass(), bypass(), use<3>(), bypass())},
// integer precision outputs are not supported for float precision inputs
{{_f32 | _bf16 | _f16, _any, _any, _i8 | _u8}, pt(bypass(), bypass(), use<0>(), use<0>())},
// compresses float weights which do not match input data precision
{{_f32, _half_float, _any, _any | _any}, pt(bypass(), bypass(), use<0>(), use<0>())},
{{_bf16, _f16, _any, _any | _any}, pt(bypass(), bypass(), use<0>(), use<0>())},
{{_f16, _bf16, _any, _any | _any}, pt(bypass(), bypass(), use<0>(), use<0>())},
// quantization configuration
// int8 inner_product does not support f16 output and bias
{{_u8 | _i8, _i8, _u8 | _i8 | _i32 | _bf16 | _f32 | _undefined, _u8 | _i8 | _i32 | _bf16 | _f32}, pt(bypass(), bypass(), bypass(), bypass())},
{{_u8 | _i8, _i8, _f16, _u8 | _i8 | _i32 | _bf16 | _f32}, pt(bypass(), bypass(), just<f32>(), bypass())},
{{_u8 | _i8, _i8, _any, _any}, pt(bypass(), bypass(), just<f32>(), just<f32>())},
// compresses int weights (@todo more strict requrements for output precision?)
{{_bf16, _u8 | _i8 | _nf4 | _u4 | _i4 | _f4e2m1, _any, _any}, pt(bypass(), bypass(), use<0>(), use<0>()),
Require<dnnl::impl::cpu::x64::avx512_core_bf16>()}, // Ticket 122347
{{_bf16, _u8 | _i8 | _nf4 | _u4 | _i4 | _f4e2m1, _any, _any}, pt(just<f32>(), bypass(), just<f32>(), just<f32>())},
{{_f32, _u8 | _i8 | _nf4 | _u4 | _i4 | _f4e2m1, _any, _any}, pt(bypass(), bypass(), use<0>(), use<0>())},
// @todo should we fallback to FPXX instead of _f32?
{{_any, _any, _any, _any}, pt(just<f32>(), just<f32>(), just<f32>(), just<f32>())},
// @todo explicitly cover configuration limitations for oneDNN on ARM
};
static const TypeMapping aclFCTypeMapping {
// {src, wei, bia, dst} pt<src, wei, bias, dst>
{{_f32 | _f16, _f32 | _f16, _any, _any}, pt(bypass(), bypass(), use<0>(), use<0>())},
{{_any, _any, _any, _any}, pt(just<f32>(), just<f32>(), just<f32>(), just<f32>())}
};
// <<< ADDED BY ME >>>
static const TypeMapping kleidiaiFCTypeMapping {
// {src, wei, bia, dst} pt<src, wei, bias, dst>
{{_f32, _f32, _any, _f32}, pt(bypass(), bypass(), use<0>(), bypass())},
{{_any, _any, _any, _any}, pt(just<f32>(), just<f32>(), just<f32>(), just<f32>())}
};
static const TypeMapping aclLowpFCTypeMapping {
// {src, wei, bia, dst} pt<src, wei, bias, dst>
{{_i8, _i8, _any, _f32}, pt(bypass(), bypass(), use<3>(), bypass())}
};
static const MappingNotation dnnlConvolutionMappingNotation {
ARG_SRC, ARG_WEI, ARG_BIAS, ARG_DST
};
static const MappingNotation aclFullyConnectedMappingNotation {
ARG_SRC, ARG_WEI, ARG_BIAS, ARG_DST
};
// <<< ADDED BY ME >>>
static const MappingNotation kleidiaiFullyConnectedMappingNotation {
ARG_SRC, ARG_WEI, ARG_BIAS, ARG_DST
}; and this change to // requiresFallback
[](const FCConfig& config) -> ov::optional<executor::Config<FCAttrs>> {
return requiresFallbackCommon(config,
kleidiaiFCTypeMapping,
kleidiaiFCLayoutConfig,
kleidiaiFullyConnectedMappingNotation);
}, Then I built OpenVINO with the usual commands: cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_PYTHON=ON -DENABLE_WHEEL=ON ..
cmake --build . --parallel 32 and installed the generated |
@dmitry-gorokhov Any insights on the above? |
@NishantPrabhuFujitsu I just tried to build this PR with enabled tests ( openvino/src/plugins/intel_cpu/src/nodes/executors/fullyconnected_implementations.cpp Lines 224 to 228 in e3424a0
I am not sure which workload you are trying to run and what's the difference in the graphs patterns. I would recommend to check which condition from the above code link returns false |
} | ||
|
||
bool MatMulKleidiAIExecutor::supports(const FCConfig& config) { | ||
if (!config.attrs.weightsNonTransposed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmitry-gorokhov I investigated further and found that this check (line 34) fails causing Kleidi executor to not get called. Is this behaviour expected? I was just running inference for an LLM in the exact same way as I have for the contributions I have made in the past.
I will try running the tests in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a gap in current MatMulKleidiAIExecutor coverage.
@alvoron generously ageed to help. He will extend MatMulKleidiAIExecutor
to support !config.attrs.weightsNonTransposed
case, so Kleidi will be used on regular LLMs.
Meanwhile I would recommend to work with ov_cpu_func_tests
as a most convinient way to extend MatMulKleidiAIExecutor coverage on new precisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks @alvoron, looking forward to getting this to work soon. In the meantime, I'll work on integrating the int8 microkernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NishantPrabhuFujitsu I did some changes to support weights transpose.
I picked the current PR changes, rebased to the latest master and applied weights transpose changes.
Could you please try my PR?
#28830
I checked that all smoke_FC_KLEIDIAI_2D
tests passed. It includes several tests with weightsNonTransposed
that executed by kleidiai, so, I assume, you can try weightsNonTransposed
cases as well.
Please let me know if any issues are observed, I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alvoron I tried your PR, and matmuls in LLM inference (weightsNonTransposed
case) are now executed by Kleidi. Thanks for helping!
However, I have noticed the following drawbacks.
- Inference with kleidi is really slow. Please find below some benchmarking results where I compare kleidi with
gemm:acl
forf32:f32:f32
single-prompt inference.
To generate these results, I exported TinyLlama-1.1B-Chat-v1.0
with optimum
in fp32
weight format and used f32
precision hint during inference for both cases.
- Inference with kleidi consumes a lot of memory. While running the above benchmark, inference with ACL needed <6 GB RAM while kleidi consumed >100 GB of RAM (and was going to consume even more); I had to cut the benchmarking short to prevent the process from getting killed. I am currently not sure what's the cause of this.
Let me know if you have any insights on the above. I'll investigate further from my end as well, while working on integrating the int8 microkernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NishantPrabhuFujitsu, glad to know it works now.
I left couple of comments in #28830. These recommendations should help to dramatically improve the perf and avoid memory leaks.
Also when I try building with
|
This is smt unknown. Haven't seen before. |
I am compiling with GCC 12.3.0 on Ubuntu 22.04.5 LTS, kernel version 6.8.0-1021-aws. The machine is AWS Graviton3 with 32 cores. The exact build commands used (after installing required dependencies) is: openvino/build$ cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_PYTHON=ON -DENABLE_WHEEL=ON -DENABLE_TESTS=ON ..
openvino/build$ cmake --build . --parallel 32 |
I'll try to reproduce it on AWS. UPD: I was not able to reproduce the issue using gcc 11.4.0 (Ubuntu 22.04.5 LTS / 6.8.0-1021-aws). Build was completed successfully using your commands. To avoid the issue I'd suggest to downgrade to gcc-11, taking into account that Ubuntu 22.04 comes with GCC 11 by default. |
@alvoron I was able to compile successfully using gcc-11, so I'll stick with that for now. There's no requirement to use gcc-12 specifically. |
Since we will not merge this PR, I would suggest to move all further work/discussions into #28830 |
### Details: - `kleidiai` is added as git submodule - `kleidiai` is built statically and linked into cpu plugin library - MatMul kleidiai executor is added - weights transpose is supported in MatMul kleidiai executor - Initial implementation is inherited from #27331 ### Tickets: - *ticket-id*
Details: