Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: verify cluster can survive dataloss of one broker at a time. #275

Merged
merged 1 commit into from
Dec 7, 2022

Conversation

deepthidevaki
Copy link
Contributor

@deepthidevaki deepthidevaki commented Dec 7, 2022

After a broker recovered from loss of disk, cluster should be able to survive another broker's disk loss. After a series of loss of disk of one broker at a time, the cluster should not suffer dataloss. We verify this by creating instances of the process that is deployed before the disk loss.

In this we don't have to call zbchaos dataloss prepare because there is no need to add init containers. Since we are only deleting one broker at a time, the pod can be immediately restarted.

related to #4

@deepthidevaki deepthidevaki force-pushed the dd-4-dataloss-experiment branch from 1a9798e to 1d50652 Compare December 7, 2022 10:21
@deepthidevaki deepthidevaki marked this pull request as ready for review December 7, 2022 10:27
@deepthidevaki
Copy link
Contributor Author

@Zelldon Integration test fails eventhough, I did not change anything related to it. Any idea?

@ChrisKujawa
Copy link
Member

Yeah I think this is fixed with #271 related to d779de4

@ChrisKujawa
Copy link
Member

@deepthidevaki could you try to rebase to see whether the failure is gone?

@deepthidevaki deepthidevaki force-pushed the dd-4-dataloss-experiment branch from 1d50652 to f4e80fd Compare December 7, 2022 11:45
Copy link
Member

@ChrisKujawa ChrisKujawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing 🤩 Thanks for this @deepthidevaki 🚀

I love how easy it is now to add new experiments :)

I also run your experiment with our integration test 😄

Create ChaosToolkit instance
Open workers: [zbchaos, readExperiments].
Handle read experiments job [key: 2251799813685265]
Read experiments successful, complete job with: {"experiments":[{"contributions":{"availability":"high","reliability":"high"},"description":"Zeebe should be able to handle data loss of one broker at a time.","method":[{"name":"Deploy process","provider":{"arguments":["deploy","process"],"path":"zbchaos","type":"process"},"timeout":900,"type":"action"},{"name":"Delete data of broker 0 and restart the pod","pauses":{"after":60},"provider":{"arguments":["dataloss","delete","--nodeId=0"],"path":"zbchaos","type":"process"},"type":"action"},{"name":"Broker 0 can recover after data loss","provider":{"arguments":["verify","readiness"],"path":"zbchaos","type":"process"},"timeout":900,"type":"probe"},{"name":"Delete data of broker 1 and restart the pod","pauses":{"after":60},"provider":{"arguments":["dataloss","delete","--nodeId=1"],"path":"zbchaos","type":"process"},"type":"action"},{"name":"Broker 1 can recover after data loss","provider":{"arguments":["verify","readiness"],"path":"zbchaos","type":"process"},"timeout":900,"type":"probe"},{"name":"Delete data of broker 2 and restart the pod","pauses":{"after":60},"provider":{"arguments":["dataloss","delete","--nodeId=2"],"path":"zbchaos","type":"process"},"type":"action"},{"name":"Broker 2 can recover after data loss","provider":{"arguments":["verify","readiness"],"path":"zbchaos","type":"process"},"timeout":900,"type":"probe"},{"name":"There is no data loss. Should be able to create process instances on partition 1","provider":{"arguments":["verify","instance-creation","--partitionId=1"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"},{"name":"There is no data loss. Should be able to create process instances on partition 2","provider":{"arguments":["verify","instance-creation","--partitionId=2"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"},{"name":"There is no data loss. Should be able to create process instances on partition 3","provider":{"arguments":["verify","instance-creation","--partitionId=3"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"}],"rollbacks":[],"steady-state-hypothesis":{"probes":[{"name":"All pods should be ready","provider":{"arguments":["verify","readiness"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"}],"title":"Zeebe is alive"},"title":"Zeebe dataloss experiment","version":"0.1.0"},{"contributions":{"availability":"high","reliability":"high"},"description":"This fake experiment is just to test the integration with Zeebe and zbchaos workers","method":[{"name":"Show again the version","provider":{"arguments":["version"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"action"}],"rollbacks":[],"steady-state-hypothesis":{"probes":[{"name":"Show version","provider":{"arguments":["version"],"path":"zbchaos","timeout":900,"type":"process"},"tolerance":0,"type":"probe"}],"title":"Zeebe is alive"},"title":"This is a fake experiment","version":"0.1.0"}]}.
Handle zbchaos job [key: 2251799813685328]
Running command with args: [verify readiness] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Pending, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-1 is in phase Running, but not ready. Wait for some seconds.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813685734]
Running command with args: [deploy process] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Deploy file bpmn/one_task.bpmn (size: 2526 bytes).
Deployed process model bpmn/one_task.bpmn successful with key 2251799813685431.
Deployed given process model , under key 2251799813685431!
Handle zbchaos job [key: 2251799813685786]
Running command with args: [dataloss delete --nodeId=0] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Deleting PV pvc-3b9d67f4-0394-4a05-b553-6c256225e2db
Deleting PVC data-zell-chaos-zeebe-0 in namespace zell-chaos 
Deleted pod zell-chaos-zeebe-0 in namespace zell-chaos
Handle zbchaos job [key: 2251799813687017]
Running command with args: [verify readiness] 
Connecting to zell-chaos
Running experiment in self-managed environment.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813687058]
Running command with args: [dataloss delete --nodeId=1] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Deleting PV pvc-7a2ff9c3-01c1-448a-a94a-98700c976044
Deleting PVC data-zell-chaos-zeebe-1 in namespace zell-chaos 
Deleted pod zell-chaos-zeebe-1 in namespace zell-chaos
Handle zbchaos job [key: 2251799813688290]
Running command with args: [verify readiness] 
Connecting to zell-chaos
Running experiment in self-managed environment.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813688332]
Running command with args: [dataloss delete --nodeId=2] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Deleting PV pvc-83fe86fc-7a59-4e88-9e5c-3d6d70cb9818
Deleting PVC data-zell-chaos-zeebe-2 in namespace zell-chaos 
Deleted pod zell-chaos-zeebe-2 in namespace zell-chaos
Handle zbchaos job [key: 2251799813689567]
Running command with args: [verify readiness] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Pod zell-chaos-zeebe-2 is in phase Running, but not ready. Wait for some seconds.
Pod zell-chaos-zeebe-2 is in phase Running, but not ready. Wait for some seconds.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813689629]
Running command with args: [verify instance-creation --partitionId=1] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 4503599627381404 on partition 2, required partition 1.
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 6755399441066664 on partition 3, required partition 1.
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 2251799813696273 on partition 1, required partition 1.
The steady-state was successfully verified!
Handle zbchaos job [key: 2251799813689697]
Running command with args: [verify instance-creation --partitionId=2] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 4503599627381426 on partition 2, required partition 2.
The steady-state was successfully verified!
Handle zbchaos job [key: 2251799813689741]
Running command with args: [verify instance-creation --partitionId=3] 
Connecting to zell-chaos
Running experiment in self-managed environment.
Successfully created port forwarding tunnel
Send create process instance command, with BPMN process ID 'benchmark' and version '-1' (-1 means latest) [variables: '', awaitResult: false]
Created process instance with key 6755399441066692 on partition 3, required partition 3.
The steady-state was successfully verified!
Handle zbchaos job [key: 2251799813689787]
Running command with args: [verify readiness] 
Connecting to zell-chaos
Running experiment in self-managed environment.
All Zeebe nodes are running.
Handle zbchaos job [key: 2251799813689879]
Running command with args: [version] 
zbchaos development (commit: HEAD)
Handle zbchaos job [key: 2251799813689922]
Running command with args: [version] 
zbchaos development (commit: HEAD)
Handle zbchaos job [key: 2251799813689965]
Running command with args: [version] 
zbchaos development (commit: HEAD)
Instance 2251799813685255 [definition 2251799813685253 ] completed
--- PASS: Test_ShouldBeAbleToRunExperiments (231.88s)
PASS

Process finished with the exit code 0

I hope I can create soon some GHA to setup us an zeebe cluster in our k8 env, where we then can run the experiments against as integration test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants