Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kleidiai as thirdparty #27331

Closed
wants to merge 16 commits into from

Conversation

mory91
Copy link
Contributor

@mory91 mory91 commented Oct 30, 2024

Details:

  • This PR aims to add KleidiAI third-party library.

@mory91 mory91 requested review from a team as code owners October 30, 2024 09:09
@github-actions github-actions bot added category: CPU OpenVINO CPU plugin category: build OpenVINO cmake script / infra category: dependency_changes Pull requests that update a dependency file labels Oct 30, 2024
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Oct 30, 2024
@mory91 mory91 force-pushed the add-kleidiai-thirdparty branch from f05c178 to 4b52286 Compare October 30, 2024 09:18
@dmitry-gorokhov dmitry-gorokhov self-assigned this Oct 30, 2024
@dmitry-gorokhov dmitry-gorokhov added the platform: arm OpenVINO on ARM / ARM64 label Oct 30, 2024
@alvoron
Copy link
Contributor

alvoron commented Nov 25, 2024

build_jenkins

@mory91 mory91 requested a review from a team as a code owner November 28, 2024 07:53
@mory91 mory91 requested review from ilya-lavrenov and removed request for a team November 28, 2024 07:53
@alvoron
Copy link
Contributor

alvoron commented Nov 28, 2024

build_jenkins

@@ -175,6 +176,11 @@ if(DNNL_USE_ACL)
set(OV_CPU_WITH_ACL ON)
endif()

if(ENABLE_KLEIDIAI_FOR_CPU)
add_definitions(-DOV_CPU_WITH_KLEIDIAI)
set(OV_CPU_WITH_KLEIDIAI ON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it's not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to use it later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, why ENABLE_KLEIDIAI_FOR_CPU is not enough?

@alvoron
Copy link
Contributor

alvoron commented Nov 28, 2024

build_jenkins

@alvoron
Copy link
Contributor

alvoron commented Nov 29, 2024

build_jenkins

@alvoron
Copy link
Contributor

alvoron commented Nov 29, 2024

build_jenkins

@alvoron
Copy link
Contributor

alvoron commented Nov 29, 2024

build_jenkins

@alvoron
Copy link
Contributor

alvoron commented Nov 29, 2024

build_jenkins

@alvoron
Copy link
Contributor

alvoron commented Nov 29, 2024

build_jenkins

@alvoron
Copy link
Contributor

alvoron commented Dec 2, 2024

build_jenkins

@alvoron
Copy link
Contributor

alvoron commented Dec 2, 2024

build_jenkins

@alvoron
Copy link
Contributor

alvoron commented Dec 3, 2024

build_jenkins

@@ -218,8 +218,8 @@ void CPUTestsBase::CheckPluginRelatedResultsImpl(const std::shared_ptr<const ov:

auto primType = getExecValue(ov::exec_model_info::IMPL_TYPE);

ASSERT_TRUE(primTypeCheck(primType))
<< "primType is unexpected : " << primType << " Expected : " << selectedType;
// ASSERT_TRUE(primTypeCheck(primType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed?

Copy link
Contributor

github-actions bot commented Jan 1, 2025

This PR will be closed in a week because of 2 weeks of no activity.

Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@NishantPrabhuFujitsu
Copy link
Contributor

NishantPrabhuFujitsu commented Jan 27, 2025

Hi. I was trying to use these changes locally to see if KleidiAI gets used for fp32 inference. I see that the ACL executor still gets used and not Kleidi's. Does OpenVINO have to be built with any special flags for this to work, or is integration not complete yet? I have detailed my experiment setup below for reference.

Setup

I replicated the changes in my fork along with the following changes in intel_cpu/src/nodes/executors/fullyconnected_implementations.cpp (see // <<< ADDED BY ME >>> blocks):

using LayoutConfig = std::vector<LayoutType>;
static const LayoutConfig dnnlFCLayoutConfig{LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp};
static const LayoutConfig aclFCLayoutConfig{LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp};
// <<< ADDED BY ME >>>
static const LayoutConfig kleidiaiFCLayoutConfig{LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp, LayoutType::ncsp};

template <dnnl::impl::cpu::x64::cpu_isa_t ISA>
struct Require {
    bool operator()() {
        return dnnl::impl::cpu::x64::mayiuse(ISA);
    }
};

// clang-format off
static const TypeMapping dnnlFCTypeMapping {
    // {src, wei, bia, dst}                                   pt<src, wei, bias, dst>
    {{_bf16, _bf16 | _f32, _any, _bf16 | _f32},               pt(bypass(), bypass(), use<3>(), bypass())},
    {{_f16, _f16, _any, _f16 | _f32},                         pt(bypass(), bypass(), use<3>(), bypass())},
    // integer precision outputs are not supported for float precision inputs
    {{_f32 | _bf16 | _f16, _any, _any, _i8 | _u8},            pt(bypass(), bypass(), use<0>(), use<0>())},
    // compresses float weights which do not match input data precision
    {{_f32, _half_float, _any, _any | _any},                  pt(bypass(), bypass(), use<0>(), use<0>())},
    {{_bf16, _f16, _any, _any | _any},                        pt(bypass(), bypass(), use<0>(), use<0>())},
    {{_f16, _bf16, _any, _any | _any},                        pt(bypass(), bypass(), use<0>(), use<0>())},
    // quantization configuration
    // int8 inner_product does not support f16 output and bias
    {{_u8 | _i8, _i8, _u8 | _i8 | _i32 | _bf16 | _f32 | _undefined, _u8 | _i8 | _i32 | _bf16 | _f32}, pt(bypass(), bypass(), bypass(),  bypass())},
    {{_u8 | _i8, _i8, _f16, _u8 | _i8 | _i32 | _bf16 | _f32}, pt(bypass(), bypass(), just<f32>(), bypass())},
    {{_u8 | _i8, _i8, _any, _any}, pt(bypass(), bypass(), just<f32>(), just<f32>())},
    // compresses int weights (@todo more strict requrements for output precision?)
    {{_bf16, _u8 | _i8 | _nf4 | _u4 | _i4 | _f4e2m1, _any, _any},       pt(bypass(), bypass(), use<0>(), use<0>()),
     Require<dnnl::impl::cpu::x64::avx512_core_bf16>()}, // Ticket 122347
    {{_bf16, _u8 | _i8 | _nf4 | _u4 | _i4 | _f4e2m1, _any, _any},       pt(just<f32>(), bypass(), just<f32>(), just<f32>())},
    {{_f32,  _u8 | _i8 | _nf4 | _u4 | _i4 | _f4e2m1, _any, _any},       pt(bypass(), bypass(), use<0>(), use<0>())},
    // @todo should we fallback to FPXX instead of _f32?
    {{_any, _any, _any, _any},                                pt(just<f32>(), just<f32>(), just<f32>(), just<f32>())},
    // @todo explicitly cover configuration limitations for oneDNN on ARM
};

static const TypeMapping aclFCTypeMapping {
    // {src, wei, bia, dst}                  pt<src, wei, bias, dst>
    {{_f32 | _f16, _f32 | _f16, _any, _any}, pt(bypass(), bypass(), use<0>(), use<0>())},
    {{_any, _any, _any, _any},               pt(just<f32>(), just<f32>(), just<f32>(), just<f32>())}
};
// <<< ADDED BY ME >>>
static const TypeMapping kleidiaiFCTypeMapping {
    // {src, wei, bia, dst}                 pt<src, wei, bias, dst>
    {{_f32, _f32, _any, _f32},              pt(bypass(), bypass(), use<0>(), bypass())},
    {{_any, _any, _any, _any},              pt(just<f32>(), just<f32>(), just<f32>(), just<f32>())}
};

static const TypeMapping aclLowpFCTypeMapping {
    // {src, wei, bia, dst}                  pt<src, wei, bias, dst>
    {{_i8, _i8, _any, _f32},                 pt(bypass(), bypass(), use<3>(), bypass())}
};

static const MappingNotation dnnlConvolutionMappingNotation {
    ARG_SRC, ARG_WEI, ARG_BIAS, ARG_DST
};

static const MappingNotation aclFullyConnectedMappingNotation {
    ARG_SRC, ARG_WEI, ARG_BIAS, ARG_DST
};
// <<< ADDED BY ME >>>
static const MappingNotation kleidiaiFullyConnectedMappingNotation {
    ARG_SRC, ARG_WEI, ARG_BIAS, ARG_DST
};

and this change to requiresFallback method in the CPU instance definition.

// requiresFallback
[](const FCConfig& config) -> ov::optional<executor::Config<FCAttrs>> {
    return requiresFallbackCommon(config,
                                    kleidiaiFCTypeMapping,
                                    kleidiaiFCLayoutConfig,
                                    kleidiaiFullyConnectedMappingNotation);
},

Then I built OpenVINO with the usual commands:

cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_PYTHON=ON -DENABLE_WHEEL=ON ..
cmake --build . --parallel 32

and installed the generated .whl file before running my inference script. Inference was run on Graviton3 on AWS.

@NishantPrabhuFujitsu
Copy link
Contributor

@dmitry-gorokhov Any insights on the above?

@dmitry-gorokhov
Copy link
Contributor

@NishantPrabhuFujitsu I just tried to build this PR with enabled tests (-DENABLE_TESTS=ON) and run corresponding FC tests: ./ov_cpu_func_tests --gtest_filter=*FC_KLEIDIAI_2D*. Most of the tests are executed via MatMulKleidiAIExecutor, others are not supported due to conditions in:

VERIFY(noPostOps(config), UNSUPPORTED_POST_OPS);
VERIFY(noSparseDecompression(config), UNSUPPORTED_SPARSE_WEIGHTS);
VERIFY(noWeightsDecompression(config), UNSUPPORTED_WEIGHTS_DECOMPRESSION);
VERIFY(everyone_is(f32, srcType(config), weiType(config), dstType(config)), UNSUPPORTED_SRC_PRECISIONS);
return MatMulKleidiAIExecutor::supports(config);
. In other words the executor works as expected.

I am not sure which workload you are trying to run and what's the difference in the graphs patterns. I would recommend to check which condition from the above code link returns false

}

bool MatMulKleidiAIExecutor::supports(const FCConfig& config) {
if (!config.attrs.weightsNonTransposed)
Copy link
Contributor

@NishantPrabhuFujitsu NishantPrabhuFujitsu Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitry-gorokhov I investigated further and found that this check (line 34) fails causing Kleidi executor to not get called. Is this behaviour expected? I was just running inference for an LLM in the exact same way as I have for the contributions I have made in the past.

I will try running the tests in the meantime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a gap in current MatMulKleidiAIExecutor coverage.
@alvoron generously ageed to help. He will extend MatMulKleidiAIExecutor to support !config.attrs.weightsNonTransposed case, so Kleidi will be used on regular LLMs.
Meanwhile I would recommend to work with ov_cpu_func_tests as a most convinient way to extend MatMulKleidiAIExecutor coverage on new precisions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks @alvoron, looking forward to getting this to work soon. In the meantime, I'll work on integrating the int8 microkernels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NishantPrabhuFujitsu I did some changes to support weights transpose.
I picked the current PR changes, rebased to the latest master and applied weights transpose changes.
Could you please try my PR?
#28830
I checked that all smoke_FC_KLEIDIAI_2D tests passed. It includes several tests with weightsNonTransposed that executed by kleidiai, so, I assume, you can try weightsNonTransposed cases as well.
Please let me know if any issues are observed, I'll fix it.

Copy link
Contributor

@NishantPrabhuFujitsu NishantPrabhuFujitsu Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvoron I tried your PR, and matmuls in LLM inference (weightsNonTransposed case) are now executed by Kleidi. Thanks for helping!

However, I have noticed the following drawbacks.

  1. Inference with kleidi is really slow. Please find below some benchmarking results where I compare kleidi with gemm:acl for f32:f32:f32 single-prompt inference.

image

To generate these results, I exported TinyLlama-1.1B-Chat-v1.0 with optimum in fp32 weight format and used f32 precision hint during inference for both cases.

  1. Inference with kleidi consumes a lot of memory. While running the above benchmark, inference with ACL needed <6 GB RAM while kleidi consumed >100 GB of RAM (and was going to consume even more); I had to cut the benchmarking short to prevent the process from getting killed. I am currently not sure what's the cause of this.

Let me know if you have any insights on the above. I'll investigate further from my end as well, while working on integrating the int8 microkernels.

Copy link
Contributor

@dmitry-gorokhov dmitry-gorokhov Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NishantPrabhuFujitsu, glad to know it works now.
I left couple of comments in #28830. These recommendations should help to dramatically improve the perf and avoid memory leaks.

@NishantPrabhuFujitsu
Copy link
Contributor

Also when I try building with -DENABLE_TESTS=ON for some reason the build gets stuck. In my last build attempt, it was stuck in the state below for 20+ mins after which I decided to stop it. I'm not sure if this is expected or an issue at my end, but if you're aware of this situation please let me know.

...
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/eltwise_chain.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fullyconnected_strided_inputs_outputs.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fuse_muladd_ewsimple.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fuse_non0_output_port.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fuse_scaleshift_and_fakequantize.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fuse_split_concat_pair_to_interpolate.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fuse_transpose_reorder.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/index_add_scatter_elements_update.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/init_state_inplace_conflicts.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/inplace_edge.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/inplace_resolve_io.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/input_noreorder_eltwise_bf16.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/input_output_tensor_reuse.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/input_tensor_roi.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/lora_pattern.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/matmul_decompress_convert.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/matmul_strided_inputs_outputs.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/merge_transpose_reorder.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/ngram.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/not_fused_conv_simple_op.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/read_value_assign.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/remove_convert.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/reshape_chain.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/reshape_fc.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/reshape_inplace.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/reshape_permute_conv_permute_reshape_act.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/sdpa_group_beam_search.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/seq_native_order.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/shape_infer_subgraph.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/shapeof_any_layout.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/split_concat_add.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/split_matmul_concat.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/stateful_init_graph.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/static_zero_dims.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/strided_slice_zero_dims.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/tile_with_two_output_edges.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/undefined_et.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/concat_sdp.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/conv_concat.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/conv_maxpool_activ.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/eltwise_chain.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/fuse_transpose_reorder.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/matmul_weights_decompression.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/undefined_et.cpp.o
[100%] Linking CXX executable /home/nishant/workspace/llm/openvino/konark_openvino/bin/aarch64/Release/ov_cpu_func_tests
[100%] Built target ov_cpu_func_tests

@dmitry-gorokhov
Copy link
Contributor

Also when I try building with -DENABLE_TESTS=ON for some reason the build gets stuck. In my last build attempt, it was stuck in the state below for 20+ mins after which I decided to stop it. I'm not sure if this is expected or an issue at my end, but if you're aware of this situation please let me know.

...
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/eltwise_chain.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fullyconnected_strided_inputs_outputs.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fuse_muladd_ewsimple.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fuse_non0_output_port.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fuse_scaleshift_and_fakequantize.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fuse_split_concat_pair_to_interpolate.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/fuse_transpose_reorder.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/index_add_scatter_elements_update.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/init_state_inplace_conflicts.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/inplace_edge.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/inplace_resolve_io.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/input_noreorder_eltwise_bf16.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/input_output_tensor_reuse.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/input_tensor_roi.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/lora_pattern.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/matmul_decompress_convert.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/matmul_strided_inputs_outputs.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/merge_transpose_reorder.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/ngram.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/not_fused_conv_simple_op.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/read_value_assign.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/remove_convert.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/reshape_chain.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/reshape_fc.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/reshape_inplace.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/reshape_permute_conv_permute_reshape_act.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/sdpa_group_beam_search.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/seq_native_order.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/shape_infer_subgraph.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/shapeof_any_layout.cpp.o
[ 99%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/split_concat_add.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/split_matmul_concat.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/stateful_init_graph.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/static_zero_dims.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/strided_slice_zero_dims.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/tile_with_two_output_edges.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/common/undefined_et.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/concat_sdp.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/conv_concat.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/conv_maxpool_activ.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/eltwise_chain.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/fuse_transpose_reorder.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/matmul_weights_decompression.cpp.o
[100%] Building CXX object src/plugins/intel_cpu/tests/functional/CMakeFiles/ov_cpu_func_tests.dir/custom/subgraph_tests/src/classes/undefined_et.cpp.o
[100%] Linking CXX executable /home/nishant/workspace/llm/openvino/konark_openvino/bin/aarch64/Release/ov_cpu_func_tests
[100%] Built target ov_cpu_func_tests

This is smt unknown. Haven't seen before.
What is your compiler and OS versions?

@NishantPrabhuFujitsu
Copy link
Contributor

NishantPrabhuFujitsu commented Feb 4, 2025

I am compiling with GCC 12.3.0 on Ubuntu 22.04.5 LTS, kernel version 6.8.0-1021-aws. The machine is AWS Graviton3 with 32 cores. The exact build commands used (after installing required dependencies) is:

openvino/build$ cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_PYTHON=ON -DENABLE_WHEEL=ON -DENABLE_TESTS=ON ..
openvino/build$ cmake --build . --parallel 32

@alvoron
Copy link
Contributor

alvoron commented Feb 5, 2025

I am compiling with GCC 12.3.0 on Ubuntu 22.04.5 LTS, kernel version 6.8.0-1021-aws. The machine is AWS Graviton3 with 32 cores. The exact build commands used (after installing required dependencies) is:

openvino/build$ cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_PYTHON=ON -DENABLE_WHEEL=ON -DENABLE_TESTS=ON ..
openvino/build$ cmake --build . --parallel 32

I'll try to reproduce it on AWS.

UPD: I was not able to reproduce the issue using gcc 11.4.0 (Ubuntu 22.04.5 LTS / 6.8.0-1021-aws). Build was completed successfully using your commands.
I upgraded gcc to 12.3.0 (I have to add Ubuntu Toolchain Test PPA for that) and then I was able to reproduce the issue. I'll review what causes the issue. However, I was able to build ov_cpu_func_tests using gcc-12 if I set specific target in cmake command: cmake --build . --target ov_cpu_func_tests --parallel 32

To avoid the issue I'd suggest to downgrade to gcc-11, taking into account that Ubuntu 22.04 comes with GCC 11 by default.
Or if you'd like to stay on gcc-12, could you try to build specific targets only, openvino_intel_cpu_plugin or ov_cpu_func_tests?

@NishantPrabhuFujitsu
Copy link
Contributor

NishantPrabhuFujitsu commented Feb 6, 2025

@alvoron I was able to compile successfully using gcc-11, so I'll stick with that for now. There's no requirement to use gcc-12 specifically.

@dmitry-gorokhov
Copy link
Contributor

dmitry-gorokhov commented Feb 7, 2025

Since we will not merge this PR, I would suggest to move all further work/discussions into #28830

github-merge-queue bot pushed a commit that referenced this pull request Feb 17, 2025
### Details:
 - `kleidiai` is added as git submodule
 - `kleidiai` is built statically and linked into cpu plugin library
 - MatMul kleidiai executor is added
 - weights transpose is supported in MatMul kleidiai executor
- Initial implementation is inherited from
#27331

### Tickets:
 - *ticket-id*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: CPU OpenVINO CPU plugin category: dependency_changes Pull requests that update a dependency file ExternalPR External contributor platform: arm OpenVINO on ARM / ARM64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants