Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow running as root to inject chaos #525

Merged
merged 4 commits into from
Apr 23, 2024

Conversation

lenaschoenburg
Copy link
Member

@lenaschoenburg lenaschoenburg commented Apr 8, 2024

Closes #520

@lenaschoenburg lenaschoenburg force-pushed the os/fix-stress-test-permissions branch from d0fa86f to b2a5178 Compare April 8, 2024 13:50
@lenaschoenburg lenaschoenburg changed the title fix: allow running as root when setting up stress test fix: allow running as root to inject chaos Apr 8, 2024
@lenaschoenburg lenaschoenburg marked this pull request as ready for review April 8, 2024 13:59
@lenaschoenburg
Copy link
Member Author

lenaschoenburg commented Apr 8, 2024

I'm still getting errors in testbench. Can't reproduce locally though.
Command:

--namespace 87bf3cb8-ab84-4da1-b3e9-3a855daf0469-zeebe stress broker --cpu --role=LEADER --partitionId=3 --verbose --jsonLogging --dockerImageTag 8.6.0-SNAPSHOT-main-e1bea882

Output:

panic: command terminated with exit code 100
[signal SIGSEGV: segmentation violation code=0xc000e54d70 addr=0x4f869b pc=0xc1c3f2]

goroutine 67 [running]:
github.com/zeebe-io/zeebe-chaos/go-chaos/cmd.ensureNoError(...)
	/home/ls/Source/github.com/zeebe-io/zeebe-chaos/go-chaos/cmd/disconnect.go:24
github.com/zeebe-io/zeebe-chaos/go-chaos/cmd.AddStressCmd.func1(0xc0008fab00?, {0x19d8cab?, 0x4?, 0x19d8b63?})
	/home/ls/Source/github.com/zeebe-io/zeebe-chaos/go-chaos/cmd/stress.go:55 +0x34b
github.com/spf13/cobra.(*Command).execute(0xc0008fc000, {0xc00183cf30, 0x9, 0x9})
	/home/ls/go/pkg/mod/github.com/spf13/[email protected]/command.go:987 +0xaa3
github.com/spf13/cobra.(*Command).ExecuteC(0xc000455b00)
	/home/ls/go/pkg/mod/github.com/spf13/[email protected]/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).ExecuteContextC(...)
	/home/ls/go/pkg/mod/github.com/spf13/[email protected]/command.go:1048
github.com/zeebe-io/zeebe-chaos/go-chaos/cmd.runZbChaosCommand({0xc001bd00e0, 0xb, 0xe}, {0x1c743e0?, 0xc000521dc0})
	/home/ls/Source/github.com/zeebe-io/zeebe-chaos/go-chaos/cmd/worker.go:98 +0xf8
github.com/zeebe-io/zeebe-chaos/go-chaos/worker.HandleZbChaosJob({0x1c6ac10, 0xc000270120}, {0xc0003888f0}, 0x1afe550)
	/home/ls/Source/github.com/zeebe-io/zeebe-chaos/go-chaos/worker/chaos_worker.go:107 +0x513
github.com/zeebe-io/zeebe-chaos/go-chaos/cmd.handleZbChaosJob({0x1c6ac10?, 0xc000270120?}, {0x0?})
	/home/ls/Source/github.com/zeebe-io/zeebe-chaos/go-chaos/cmd/worker.go:91 +0x25
github.com/camunda/zeebe/clients/go/v8/pkg/worker.(*jobDispatcher).run.func2()
	/home/ls/go/pkg/mod/github.com/camunda/zeebe/clients/go/[email protected]/pkg/worker/jobDispatcher.go:54 +0xc9
created by github.com/camunda/zeebe/clients/go/v8/pkg/worker.(*jobDispatcher).run in goroutine 15
	/home/ls/go/pkg/mod/github.com/camunda/zeebe/clients/go/[email protected]/pkg/worker/jobDispatcher.go:45 +0x149

@lenaschoenburg lenaschoenburg marked this pull request as draft April 9, 2024 06:32
@lenaschoenburg
Copy link
Member Author

One issue was that we did not disable reconciliation before enabling root access. Still, even after fixing this, I did not get any output from running the apt commands to install stress and procps.

Executing manually in a zeebe pod that was set up for root access and then running apt update results in this:

root@zeebe-0:/usr/local/zeebe# apt update
E: setgroups 65534 failed - setgroups (1: Operation not permitted)
E: setegid 65534 failed - setegid (1: Operation not permitted)
E: seteuid 100 failed - seteuid (1: Operation not permitted)
E: setgroups 0 failed - setgroups (1: Operation not permitted)
rm: cannot remove '/var/cache/apt/archives/partial/*.deb': Permission denied
Reading package lists... Done
W: chown to _apt:root of directory /var/lib/apt/lists/partial failed - SetupAPTPartialDirectory (1: Operation not permitted)
W: chown to _apt:root of directory /var/lib/apt/lists/auxfiles failed - SetupAPTPartialDirectory (1: Operation not permitted)
E: setgroups 65534 failed - setgroups (1: Operation not permitted)
E: setegid 65534 failed - setegid (1: Operation not permitted)
E: seteuid 100 failed - seteuid (1: Operation not permitted)
E: setgroups 0 failed - setgroups (1: Operation not permitted)
E: Method gave invalid 400 URI Failure message: Failed to setgroups - setgroups (1: Operation not permitted)
E: Method gave invalid 400 URI Failure message: Failed to setgroups - setgroups (1: Operation not permitted)
E: Method http has died unexpectedly!
E: Sub-process http returned an error code (112)
E: Method http has died unexpectedly!
E: Sub-process http returned an error code (112)

I think at this point there are three unresolved issues:

  1. We need to wait for the root user changes to apply before running the apt commands
  2. For some reason we don't get any output when running commands on pods. Even a simple "echo hello" command does not result in any output.
  3. The root user change does not work because of some permission issues, see error messages above.

@lenaschoenburg
Copy link
Member Author

One idea I have to overcome this is to attach ephemeral debug containers that have the necessary tools installed.

@ChrisKujawa
Copy link
Member

Was also thinking a out this but thought might not work with permissions?

Another one I was thinking whether we could leverage ebpf

@lenaschoenburg
Copy link
Member Author

I think I got the permissions working:

ls@camunda:~/Source/github.com/zeebe-io/zeebe-chaos/go-chaos$ kubectl logs zeebe-2 -c debug-tvjfnw

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Get:1 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security/multiverse amd64 Packages [44.7 kB]
Get:6 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1690 kB]
Get:7 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [2124 kB]
Get:8 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1081 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy/restricted amd64 Packages [164 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB]
Get:11 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1357 kB]
Get:14 http://archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [61.3 kB]
Get:15 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [2173 kB]
Get:16 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [1969 kB]
Get:17 http://archive.ubuntu.com/ubuntu jammy-backports/main amd64 Packages [80.9 kB]
Get:18 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 Packages [33.3 kB]
Fetched 30.9 MB in 4s (7275 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
All packages are up to date.

@ChrisKujawa
Copy link
Member

@oleschoenburg was it working? Can we get it merged?

@lenaschoenburg
Copy link
Member Author

@Zelldon I have something ready locally that seems to work more or less but now the scaling test doesn't work anymore ?! I need to test this manually a bit more.

@lenaschoenburg lenaschoenburg force-pushed the os/fix-stress-test-permissions branch from ce187fb to d145b85 Compare April 11, 2024 13:12
This overwrites the security context on both the container and the deployment/statefulset.
Only works because reconciliation is already disabled at that point.
@lenaschoenburg lenaschoenburg force-pushed the os/fix-stress-test-permissions branch 2 times, most recently from a195ce6 to 859486b Compare April 11, 2024 15:15
Due to camunda/camunda#17347 we can't rely on a sensible value
@lenaschoenburg lenaschoenburg force-pushed the os/fix-stress-test-permissions branch from 859486b to 755f387 Compare April 11, 2024 15:19
Copy link
Member

@ChrisKujawa ChrisKujawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @oleschoenburg if this for real works this would be great! Could you check my comments?

go-chaos/internal/network.go Show resolved Hide resolved
cmd := []string{"ip", "route", "replace", "unreachable", podIp}
cmdWithSetup := []string{"sh", "-c", "apt update && apt install -y iproute2 && " + strings.Join(cmd, " ")}
var containerName string
if strings.Contains(podName, "gateway") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to work with SaaS/SM?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have also functions to get the right pod name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to attach to the correct container, both on SaaS and self-managed. We already have the right podName here. Do you mean we have helper functions to get the right container name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod name you use is actually only correct in SaaS in SM it is different but not sure whether you use the pod name here anyhow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code here uses the pod name to figure out the correct container name. If the pod name includes gateway, we assume that the container name is zeebe-gateway, otherwise we expect it to be zeebe. I think this should work in both SaaS and SM, right?

go-chaos/internal/pods.go Show resolved Hide resolved
@@ -302,6 +303,35 @@ func (c K8Client) createPortForwardUrl(names []string) *url.URL {
return portForwardCreateURL
}

func (c K8Client) ExecuteCommandViaDebugContainer(podName string, containerName string, debugImage string, cmd []string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verified whether it was for real disrupting the network? Or maybe it was just silently failing? 🤔 So for example did the brokers showed that they can't connect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet! That's one of the things I still have to do :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested it now with a SaaS cluster 🎉

$ zbchaos --namespace 8fe648b5-1137-4d6a-af0d-be8b467fd67e-zeebe disconnect brokers --broker2NodeId 2 --broker1NodeId 1 --verbose
Flags: {1 LEADER -1  10  msg false 1 LEADER 1 2 LEADER 2 1 1713797975967 false false true false false 30 false -1 benchmark 30  8fe648b5-1137-4d6a-af0d-be8b467fd67e-zeebe 1 1 benchmark-task 0 0 0 1 -1 true}
Connecting to 8fe648b5-1137-4d6a-af0d-be8b467fd67e-zeebe
Running experiment in SaaS environment.
Patched statefulset
Port forward to zeebe-gateway-5489b4cdcf-ptfl9
Successfully created port forwarding tunnel from 46535 (local) to 26500 (remote)
Found Broker zeebe-1 with node id 1.
Found Broker zeebe-2 with node id 2.
Debug container debug-lvccl4 is running command [sh -c apt update && apt install -y iproute2 && ip route replace unreachable 10.64.73.38]
Disconnect zeebe-1 from zeebe-2
Debug container debug-658dxp is running command [sh -c apt update && apt install -y iproute2 && ip route replace unreachable 10.64.53.29]
Disconnect zeebe-2 from zeebe-1

$ zbchaos --namespace 8fe648b5-1137-4d6a-af0d-be8b467fd67e-zeebe connect brokers --verbose
Flags: {1 LEADER -1  10  msg false 1 LEADER -1 2 LEADER -1 1 1713798188959 false false true false false 30 false -1 benchmark 30  8fe648b5-1137-4d6a-af0d-be8b467fd67e-zeebe 1 1 benchmark-task 0 0 0 1 -1 true}
Connecting to 8fe648b5-1137-4d6a-af0d-be8b467fd67e-zeebe
Running experiment in SaaS environment.
Debug container debug-cv444p is running command [sh -c apt update && apt install -y iproute2 && ip route del $(ip route | grep -m 1 unreachable)]
Connected zeebe-0 again, removed unreachable routes.
Debug container debug-zg45fm is running command [sh -c apt update && apt install -y iproute2 && ip route del $(ip route | grep -m 1 unreachable)]
Connected zeebe-1 again, removed unreachable routes.
Debug container debug-tm9pdj is running command [sh -c apt update && apt install -y iproute2 && ip route del $(ip route | grep -m 1 unreachable)]
Connected zeebe-2 again, removed unreachable routes.

image
image

Copy link
Member

@ChrisKujawa ChrisKujawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great stuff @oleschoenburg ❤️ thanks for your efforts!

@lenaschoenburg lenaschoenburg marked this pull request as ready for review April 23, 2024 06:35
@lenaschoenburg lenaschoenburg merged commit 1989c57 into main Apr 23, 2024
3 checks passed
@lenaschoenburg lenaschoenburg deleted the os/fix-stress-test-permissions branch April 23, 2024 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chaos injection requires root which is no longer permitted
2 participants