Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts not resilient to gateway restarts #136

Closed
deepthidevaki opened this issue Apr 21, 2022 · 7 comments
Closed

Scripts not resilient to gateway restarts #136

deepthidevaki opened this issue Apr 21, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@deepthidevaki
Copy link
Contributor

Here the script finds one gateway
https://github.com/zeebe-io/zeebe-chaos/blob/aee26dc8070b93e31a37d14798504149bc867498/chaos-workers/chaos-experiments/scripts/start-instance-on-partition-with-version.sh#L19

And then it tries to exec into the gateway
https://github.com/zeebe-io/zeebe-chaos/blob/aee26dc8070b93e31a37d14798504149bc867498/chaos-workers/chaos-experiments/scripts/start-instance-on-partition-with-version.sh#L31

But between execution of these two lines, the gateway pod was terminated and a new pod was started to replace it. But the script tried to access the terminated gateway and eventually timeouts, failing the experiment.

@deepthidevaki deepthidevaki added the bug Something isn't working label Apr 21, 2022
@ChrisKujawa
Copy link
Member

Might make sense to use the service to be more resilient. Or use the helper retryUntilSuccess as we do here https://github.com/zeebe-io/zeebe-chaos/blob/aee26dc8070b93e31a37d14798504149bc867498/chaos-workers/chaos-experiments/scripts/start-instance-on-partition-with-version.sh#L38

@deepthidevaki
Copy link
Contributor Author

2022-04-21 04:34:21.442 CEST
chaos-worker
An instance where this happened:

"++ kubectl exec zeebe-gateway-c7fdf4f5c-v7mzz -n 0b25276f-1113-4627-9c17-5b867256e62a-zeebe -- zbctl create instance benchmark --insecure"
Debug
2022-04-21 04:34:21.505 CEST
chaos-worker
"error: cannot exec into a container in a completed pod; current phase is Failed"

The pod zeebe-gateway-c7fdf4f5c-v7mzz was terminated before this time.

@deepthidevaki
Copy link
Contributor Author

Or use the helper retryUntilSuccess as we do here

It is already using retryUntilSuccess. The problem is it is retying to connect to the same terminated gateway.

@ChrisKujawa
Copy link
Member

Yeah because getGateway is not included in the loop.

@deepthidevaki
Copy link
Contributor Author

deepthidevaki commented Apr 21, 2022

Why don't we execute zbctl on the chaos worker? We have the authenticationDetails for the cluster available in the process variables.

@ChrisKujawa
Copy link
Member

Currently, it is independent of where and against what it is executed. Local, helm, cloud/saas etc.

@ChrisKujawa
Copy link
Member

I think it is no longer an issue, if we experience an issue the zbchaos worker will restart and retry later. Gateways are chosen random #297

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants