feat: verify cluster can survive dataloss of one broker at a time. #275

deepthidevaki · 2022-12-07T08:55:00Z

After a broker recovered from loss of disk, cluster should be able to survive another broker's disk loss. After a series of loss of disk of one broker at a time, the cluster should not suffer dataloss. We verify this by creating instances of the process that is deployed before the disk loss.

In this we don't have to call zbchaos dataloss prepare because there is no need to add init containers. Since we are only deleting one broker at a time, the pod can be immediately restarted.

related to #4

deepthidevaki · 2022-12-07T10:38:22Z

@Zelldon Integration test fails eventhough, I did not change anything related to it. Any idea?

ChrisKujawa · 2022-12-07T10:52:40Z

Yeah I think this is fixed with #271 related to d779de4

ChrisKujawa · 2022-12-07T11:42:14Z

@deepthidevaki could you try to rebase to see whether the failure is gone?

ChrisKujawa

Amazing 🤩 Thanks for this @deepthidevaki 🚀

I love how easy it is now to add new experiments :)

I also run your experiment with our integration test 😄

Create ChaosToolkit instance
Open workers: [zbchaos, readExperiments].
Handle read experiments job [key: 2251799813685265]
Read experiments successful, complete job with: {"experiments":[{"contributions":{"availability":"high","reliability":"high"},"description":"Zeebe should be able to handle data loss of one broker at a time.","method":[{"name":"Deploy process","provider":{"arguments":["deploy","process"],"path":"zbchaos","type":"process"},"timeout":900,"type":"action"},{"name":"Delete data of broker 0 and restart the pod","pauses":{"after":60},"provider":{"arguments":["dataloss","delete","--nodeId=0"],"path":"zbchaos","type":"process"},"type":"action"},{"name":"Broker 0 can recover after data loss","provider":{"arguments":["verify","readiness"],"path":"zbchaos","type":"process"},"timeout":900,"type":"probe"},{"name":"Delete data of broker 1 and restart the pod","pauses":{"after":60},"provider":{"arguments":["dataloss","delete","--nodeId=1"],"path":"zbchaos","type":"process"},"type":"action"},{"name":"Broker 1 can recover after data loss","provider":{"arguments":["verify","readiness"],"path":"zbchaos","type":"process"},"timeout":900,"type":"probe"},{"name":"Delete data of broker 2 and restart the pod","pauses":{"after":60},"provider":{"arguments":["dataloss","delete","--nodeId=2"],"path":"zbchaos","type":"process"},"type":"action"},{"name":"Broker 2 can recover after data loss","provider":{"arguments":["verify","readiness"],"path":"zbchaos","type":"process"},"timeout":900,"type":"probe"},{"name":"There is no data loss. Should be able to create process instances on partition 1","provider":{"arguments":["verify","instance-creation","--partitionId=1"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"},{"name":"There is no data loss. Should be able to create process instances on partition 2","provider":{"arguments":["verify","instance-creation","--partitionId=2"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"},{"name":"There is no data loss. Should be able to create process instances on partition 3","provider":{"arguments":["verify","instance-creation","--partitionId=3"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"}],"rollbacks":[],"steady-state-hypothesis":{"probes":[{"name":"All pods should be ready","provider":{"arguments":["verify","readiness"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"}],"title":"Zeebe is alive"},"title":"Zeebe dataloss experiment","version":"0.1.0"},{"contributions":{"availability":"high","reliability":"high"},"description":"This fake experiment is just to test the integration with Zeebe and zbchaos workers","method":[{"name":"Show again the version","provider":{"arguments":["version"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"action"}],"rollbacks":[],"steady-state-hypothesis":{"probes":[{"name":"Show version","provider":{"arguments":["version"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"}],"title":"Zeebe is alive"},"title":"This is a fake experiment","version":"0.1.0"}]}.
Handle zbchaos job [key: 2251799813685328]
Running command with args: [verify readiness] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813685734]
Running command with args: [deploy process] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Deploy file bpmn/one_task.bpmn (size: 2526 bytes).
Deployed process model bpmn/one_task.bpmn successful with key 2251799813685431.
Deployed given process model , under key 2251799813685431!
Handle zbchaos job [key: 2251799813685786]
Running command with args: [dataloss delete --nodeId=0] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Deleting PV pvc-3b9d67f4-0394-4a05-b553-6c256225e2db
Deleting PVC data-zell-chaos-zeebe-0 in namespace zell-chaos 
Deleted pod zell-chaos-zeebe-0 in namespace zell-chaos
Handle zbchaos job [key: 2251799813687017]
Running command with args: [verify readiness] 
Connecting to zell-chaos
Running experiment in self-managed environment.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813687058]
Running command with args: [dataloss delete --nodeId=1] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Deleting PV pvc-7a2ff9c3-01c1-448a-a94a-98700c976044
Deleting PVC data-zell-chaos-zeebe-1 in namespace zell-chaos 
Deleted pod zell-chaos-zeebe-1 in namespace zell-chaos
Handle zbchaos job [key: 2251799813688290]
Running command with args: [verify readiness] 
Connecting to zell-chaos
Running experiment in self-managed environment.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813688332]
Running command with args: [dataloss delete --nodeId=2] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Deleting PV pvc-83fe86fc-7a59-4e88-9e5c-3d6d70cb9818
Deleting PVC data-zell-chaos-zeebe-2 in namespace zell-chaos 
Deleted pod zell-chaos-zeebe-2 in namespace zell-chaos
Handle zbchaos job [key: 2251799813689567]
Running command with args: [verify readiness] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Pod zell-chaos-zeebe-2 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-2 is in phase Running, but not ready. Wait for some seconds.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813689629]
Running command with args: [verify instance-creation --partitionId=1] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 4503599627381404 on partition 2, required partition 1.
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 6755399441066664 on partition 3, required partition 1.
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 2251799813696273 on partition 1, required partition 1.
The steady-state was successfully verified!
Handle zbchaos job [key: 2251799813689697]
Running command with args: [verify instance-creation --partitionId=2] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 4503599627381426 on partition 2, required partition 2.
The steady-state was successfully verified!
Handle zbchaos job [key: 2251799813689741]
Running command with args: [verify instance-creation --partitionId=3] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 6755399441066692 on partition 3, required partition 3.
The steady-state was successfully verified!
Handle zbchaos job [key: 2251799813689787]
Running command with args: [verify readiness] 
Connecting to zell-chaos
Running experiment in self-managed environment.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813689879]
Running command with args: [version] 
zbchaos development (commit: HEAD)
Handle zbchaos job [key: 2251799813689922]
Running command with args: [version] 
zbchaos development (commit: HEAD)
Handle zbchaos job [key: 2251799813689965]
Running command with args: [version] 
zbchaos development (commit: HEAD)
Instance 2251799813685255 [definition 2251799813685253 ] completed
--- PASS: Test_ShouldBeAbleToRunExperiments (231.88s)
PASS

Process finished with the exit code 0

I hope I can create soon some GHA to setup us an zeebe cluster in our k8 env, where we then can run the experiments against as integration test.

deepthidevaki force-pushed the dd-4-dataloss-experiment branch from 1a9798e to 1d50652 Compare December 7, 2022 10:21

deepthidevaki marked this pull request as ready for review December 7, 2022 10:27

deepthidevaki requested a review from ChrisKujawa as a code owner December 7, 2022 10:27

feat: verify cluster can survive dataloss of one broker at a time.

f4e80fd

deepthidevaki force-pushed the dd-4-dataloss-experiment branch from 1d50652 to f4e80fd Compare December 7, 2022 11:45

ChrisKujawa approved these changes Dec 7, 2022

View reviewed changes

deepthidevaki merged commit b77c0ce into main Dec 7, 2022

deepthidevaki deleted the dd-4-dataloss-experiment branch December 7, 2022 12:00

ChrisKujawa mentioned this pull request Dec 7, 2022

Hypothesis: Data loss of one node will not cause cluster wide issues #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: verify cluster can survive dataloss of one broker at a time. #275

feat: verify cluster can survive dataloss of one broker at a time. #275

deepthidevaki commented Dec 7, 2022 •

edited

Loading

deepthidevaki commented Dec 7, 2022

ChrisKujawa commented Dec 7, 2022

ChrisKujawa commented Dec 7, 2022

ChrisKujawa left a comment

feat: verify cluster can survive dataloss of one broker at a time. #275

feat: verify cluster can survive dataloss of one broker at a time. #275

Conversation

deepthidevaki commented Dec 7, 2022 • edited Loading

deepthidevaki commented Dec 7, 2022

ChrisKujawa commented Dec 7, 2022

ChrisKujawa commented Dec 7, 2022

ChrisKujawa left a comment

Choose a reason for hiding this comment

deepthidevaki commented Dec 7, 2022 •

edited

Loading