-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: verify cluster can survive dataloss of one broker at a time. #275
Conversation
1a9798e
to
1d50652
Compare
@Zelldon Integration test fails eventhough, I did not change anything related to it. Any idea? |
@deepthidevaki could you try to rebase to see whether the failure is gone? |
1d50652
to
f4e80fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing 🤩 Thanks for this @deepthidevaki 🚀
I love how easy it is now to add new experiments :)
I also run your experiment with our integration test 😄
Create ChaosToolkit instance
Open workers: [zbchaos, readExperiments].
Handle read experiments job [key: 2251799813685265]
Read experiments successful, complete job with: {"experiments":[{"contributions":{"availability":"high","reliability":"high"},"description":"Zeebe should be able to handle data loss of one broker at a time.","method":[{"name":"Deploy process","provider":{"arguments":["deploy","process"],"path":"zbchaos","type":"process"},"timeout":900,"type":"action"},{"name":"Delete data of broker 0 and restart the pod","pauses":{"after":60},"provider":{"arguments":["dataloss","delete","--nodeId=0"],"path":"zbchaos","type":"process"},"type":"action"},{"name":"Broker 0 can recover after data loss","provider":{"arguments":["verify","readiness"],"path":"zbchaos","type":"process"},"timeout":900,"type":"probe"},{"name":"Delete data of broker 1 and restart the pod","pauses":{"after":60},"provider":{"arguments":["dataloss","delete","--nodeId=1"],"path":"zbchaos","type":"process"},"type":"action"},{"name":"Broker 1 can recover after data loss","provider":{"arguments":["verify","readiness"],"path":"zbchaos","type":"process"},"timeout":900,"type":"probe"},{"name":"Delete data of broker 2 and restart the pod","pauses":{"after":60},"provider":{"arguments":["dataloss","delete","--nodeId=2"],"path":"zbchaos","type":"process"},"type":"action"},{"name":"Broker 2 can recover after data loss","provider":{"arguments":["verify","readiness"],"path":"zbchaos","type":"process"},"timeout":900,"type":"probe"},{"name":"There is no data loss. Should be able to create process instances on partition 1","provider":{"arguments":["verify","instance-creation","--partitionId=1"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"},{"name":"There is no data loss. Should be able to create process instances on partition 2","provider":{"arguments":["verify","instance-creation","--partitionId=2"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"},{"name":"There is no data loss. Should be able to create process instances on partition 3","provider":{"arguments":["verify","instance-creation","--partitionId=3"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"}],"rollbacks":[],"steady-state-hypothesis":{"probes":[{"name":"All pods should be ready","provider":{"arguments":["verify","readiness"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"}],"title":"Zeebe is alive"},"title":"Zeebe dataloss experiment","version":"0.1.0"},{"contributions":{"availability":"high","reliability":"high"},"description":"This fake experiment is just to test the integration with Zeebe and zbchaos workers","method":[{"name":"Show again the version","provider":{"arguments":["version"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"action"}],"rollbacks":[],"steady-state-hypothesis":{"probes":[{"name":"Show version","provider":{"arguments":["version"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"}],"title":"Zeebe is alive"},"title":"This is a fake experiment","version":"0.1.0"}]}.
Handle zbchaos job [key: 2251799813685328]
Running command with args: [verify readiness]
Connecting to zell-chaos
Running experiment in self-managed environment.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813685734]
Running command with args: [deploy process]
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Deploy file bpmn/one_task.bpmn (size: 2526 bytes).
Deployed process model bpmn/one_task.bpmn successful with key 2251799813685431.
Deployed given process model , under key 2251799813685431!
Handle zbchaos job [key: 2251799813685786]
Running command with args: [dataloss delete --nodeId=0]
Connecting to zell-chaos
Running experiment in self-managed environment.
Deleting PV pvc-3b9d67f4-0394-4a05-b553-6c256225e2db
Deleting PVC data-zell-chaos-zeebe-0 in namespace zell-chaos
Deleted pod zell-chaos-zeebe-0 in namespace zell-chaos
Handle zbchaos job [key: 2251799813687017]
Running command with args: [verify readiness]
Connecting to zell-chaos
Running experiment in self-managed environment.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813687058]
Running command with args: [dataloss delete --nodeId=1]
Connecting to zell-chaos
Running experiment in self-managed environment.
Deleting PV pvc-7a2ff9c3-01c1-448a-a94a-98700c976044
Deleting PVC data-zell-chaos-zeebe-1 in namespace zell-chaos
Deleted pod zell-chaos-zeebe-1 in namespace zell-chaos
Handle zbchaos job [key: 2251799813688290]
Running command with args: [verify readiness]
Connecting to zell-chaos
Running experiment in self-managed environment.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813688332]
Running command with args: [dataloss delete --nodeId=2]
Connecting to zell-chaos
Running experiment in self-managed environment.
Deleting PV pvc-83fe86fc-7a59-4e88-9e5c-3d6d70cb9818
Deleting PVC data-zell-chaos-zeebe-2 in namespace zell-chaos
Deleted pod zell-chaos-zeebe-2 in namespace zell-chaos
Handle zbchaos job [key: 2251799813689567]
Running command with args: [verify readiness]
Connecting to zell-chaos
Running experiment in self-managed environment.
Pod zell-chaos-zeebe-2 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-2 is in phase Running, but not ready. Wait for some seconds.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813689629]
Running command with args: [verify instance-creation --partitionId=1]
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 4503599627381404 on partition 2, required partition 1.
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 6755399441066664 on partition 3, required partition 1.
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 2251799813696273 on partition 1, required partition 1.
The steady-state was successfully verified!
Handle zbchaos job [key: 2251799813689697]
Running command with args: [verify instance-creation --partitionId=2]
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 4503599627381426 on partition 2, required partition 2.
The steady-state was successfully verified!
Handle zbchaos job [key: 2251799813689741]
Running command with args: [verify instance-creation --partitionId=3]
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 6755399441066692 on partition 3, required partition 3.
The steady-state was successfully verified!
Handle zbchaos job [key: 2251799813689787]
Running command with args: [verify readiness]
Connecting to zell-chaos
Running experiment in self-managed environment.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813689879]
Running command with args: [version]
zbchaos development (commit: HEAD)
Handle zbchaos job [key: 2251799813689922]
Running command with args: [version]
zbchaos development (commit: HEAD)
Handle zbchaos job [key: 2251799813689965]
Running command with args: [version]
zbchaos development (commit: HEAD)
Instance 2251799813685255 [definition 2251799813685253 ] completed
--- PASS: Test_ShouldBeAbleToRunExperiments (231.88s)
PASS
Process finished with the exit code 0
I hope I can create soon some GHA to setup us an zeebe cluster in our k8 env, where we then can run the experiments against as integration test.
After a broker recovered from loss of disk, cluster should be able to survive another broker's disk loss. After a series of loss of disk of one broker at a time, the cluster should not suffer dataloss. We verify this by creating instances of the process that is deployed before the disk loss.
In this we don't have to call
zbchaos dataloss prepare
because there is no need to add init containers. Since we are only deleting one broker at a time, the pod can be immediately restarted.related to #4