-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multicast don't work in AArch64 when using version 5.X or older kernel because of an eBPF problem. #33408
Comments
The commit is this. |
Thanks, @yushoyamaguchi, for opening this issue. There are few things to check The minimum kernel version required for multicast is 5.15 ( This is probably for x86 ). Looks like for arm64 it is 6.0+ kernel version. You can add these in the documentation. Aside from the documentation, we may need to add a feature probe helper in the cilium/ebpf library and use it when initializing multicast. On failure to probe ( subprog -> tailcall support ), multicast cell init should fail, and an appropriate message can be logged. |
Adding documentation will definitely be one of the enhancement we can take for user, i think we should update https://docs.cilium.io/en/stable/operations/system_requirements/#required-kernel-versions-for-advanced-features once version dependencies are clear. Actually this brings me an another question that all these kernel versions are meant to be for x86? that case, probably we could add note for that as well.
if i am not mistaken. this means once we update the ConfigMap to enable multicast, some ciliumnodes would fail to initialize the cilium-agent with multicast capability, cilium-agent pods are going to in crash loop but providing error information from multicast cell can let the user know that multicast feature cannot be enabled with this ciliumnode. is my understanding correct? so the expectation here from cilium is that user needs to configure the all ciliumnodes can be enabled with multicast if user wants to use the multicast feature? i am not sure about this graceful behavior, is this common behavior for cilium features? besides if the fallback behavior is the case, this also needs to be considered with cilium/cilium-cli#2620. in case of some ciliumnodes cannot mange the multicast, it should be shown for user that as well. thanks, |
@harsimran-pabla @fujitatomoya I found the detail cause of this bug. When using eBPF with version 5 and older versions of the AArch64 kernel, it is prohibited to use tail calls within sub-program. To avoid this, in Cilium's eBPF programs, most of functions are declared as inline functions. However, the function Line 338 in c3d943f
The reason is that the call to Line 433 in c3d943f
This is the only instance where As @fujitatomoya said, I want to discuss about how to handle this bug. cc @YutaroHayakawa I'm sorry for adding in cc,. Thank you so much for telling me the cause. |
Should we allow this and check kernel compatibility on the agent side by adding functionality in cilium/ebpf ? |
The kernel version that can enable multicast is different between AMD64 and AArch64 due to the difference in the timing of when tail-call from eBPF sub-programs is enabled. I wrote about it in the document. ref : The commit which allow for tailcalls in BPF subprogram in each architecture are below AMD64: torvalds/linux@e411901c0b775 This commit is reflected to version 5.10 or newer kernel AArch64: torvalds/linux@d4609a5 This commit is reflected to 6.0 or newer kernel This PR is a little part of the solution of cilium#33408 Signed-off-by: Yusho Yamaguchi <[email protected]>
The kernel version that can enable multicast is different between AMD64 and AArch64 due to the difference in the timing of when tail-call from eBPF sub-programs is enabled. I wrote about it in the document. ref : The commit which allow for tailcalls in BPF subprogram in each architecture are below AMD64: torvalds/linux@e411901c0b775 This commit is reflected to version 5.10 or newer kernel AArch64: torvalds/linux@d4609a5 This commit is reflected to 6.0 or newer kernel This PR is a little part of the solution of #33408 Signed-off-by: Yusho Yamaguchi <[email protected]>
I guess the solution here is
CC @ldelossa |
Thank you for a great suggestion. Do you have a plan that some community members work on this implementation? One more important thing : I am very sorry for increasing procedures. Thank you. |
[ oss commit e64311d ] The kernel version that can enable multicast is different between AMD64 and AArch64 due to the difference in the timing of when tail-call from eBPF sub-programs is enabled. I wrote about it in the document. ref : The commit which allow for tailcalls in BPF subprogram in each architecture are below AMD64: torvalds/linux@e411901c0b775 This commit is reflected to version 5.10 or newer kernel AArch64: torvalds/linux@d4609a5 This commit is reflected to 6.0 or newer kernel This PR is a little part of the solution of #33408 Signed-off-by: Yusho Yamaguchi <[email protected]> Signed-off-by: Gilberto Bertin <[email protected]>
[ upstream commit e64311d ] The kernel version that can enable multicast is different between AMD64 and AArch64 due to the difference in the timing of when tail-call from eBPF sub-programs is enabled. I wrote about it in the document. ref : The commit which allow for tailcalls in BPF subprogram in each architecture are below AMD64: torvalds/linux@e411901c0b775 This commit is reflected to version 5.10 or newer kernel AArch64: torvalds/linux@d4609a5 This commit is reflected to 6.0 or newer kernel This PR is a little part of the solution of #33408 Signed-off-by: Yusho Yamaguchi <[email protected]> Signed-off-by: Gilberto Bertin <[email protected]>
[ upstream commit e64311d ] The kernel version that can enable multicast is different between AMD64 and AArch64 due to the difference in the timing of when tail-call from eBPF sub-programs is enabled. I wrote about it in the document. ref : The commit which allow for tailcalls in BPF subprogram in each architecture are below AMD64: torvalds/linux@e411901c0b775 This commit is reflected to version 5.10 or newer kernel AArch64: torvalds/linux@d4609a5 This commit is reflected to 6.0 or newer kernel This PR is a little part of the solution of #33408 Signed-off-by: Yusho Yamaguchi <[email protected]> Signed-off-by: Gilberto Bertin <[email protected]>
This issue has been automatically marked as stale because it has not |
This issue has not seen any activity since it was marked stale. |
Is there an existing issue for this?
What happened?
When running cilium in this environment, cilium-agent cannot get out from CrashLoop and emit below error log.
Cilium Version
1.16.0-rc
Kernel Version
Linux cecinode91 5.15.0-1055-raspi
Kubernetes Version
1.27.1 (Maybe)
Regression
No response
Sysdump
No response
Relevant log output
Verifier error: program tail_mcast_ep_delivery: load program: invalid argument: tail_calls are not allowed in non-JITed programs with bpf-to-bpf calls (242 line(s) omitted) Verifier log: load program: invalid argument:
Anything else?
I find the error message from the kernel in Linux source code,
https://elixir.bootlin.com/linux/v5.19.17/source/kernel/bpf/verifier.c#L6309
The cause of this error seems that
allow_tail_call_in_subprogs()
returns false.error message : tail_calls are not allowed in non-JITed programs with bpf-to-bpf calls
v5.19.17 : https://elixir.bootlin.com/linux/v5.19.17/source/kernel/bpf/verifier.c
v6.0-rc1 : https://elixir.bootlin.com/linux/v5.19.17/source/kernel/bpf/verifier.c
Then, I find that older version than 5.19.17 kernel always return false when using AArch64.
(The cause was which
bpf_jit_supports_subprog_tailcalls()
returns true or false. )v5.19.17 : https://elixir.bootlin.com/linux/v5.19.17/C/ident/bpf_jit_supports_subprog_tailcalls , https://elixir.bootlin.com/linux/v5.19.17/source/kernel/bpf/core.c#L2720
v6.0-rc1 : https://elixir.bootlin.com/linux/v6.0-rc1/C/ident/bpf_jit_supports_subprog_tailcalls , https://elixir.bootlin.com/linux/v6.0-rc1/source/arch/arm64/net/bpf_jit_comp.c#L1637
Therefore, It seems that when using AArch64, 6.0 or newer kernel is required to run cilium multicast in kubernetes node.
This restriction is not written in any document in cilium.
Therefore I want to write this restriction on the multicast document.
Can I create the Pull Request?
Cilium Users Document
Code of Conduct
The text was updated successfully, but these errors were encountered: