Scripts not resilient to gateway restarts #136

deepthidevaki · 2022-04-21T12:06:12Z

Here the script finds one gateway
https://github.com/zeebe-io/zeebe-chaos/blob/aee26dc8070b93e31a37d14798504149bc867498/chaos-workers/chaos-experiments/scripts/start-instance-on-partition-with-version.sh#L19

And then it tries to exec into the gateway
https://github.com/zeebe-io/zeebe-chaos/blob/aee26dc8070b93e31a37d14798504149bc867498/chaos-workers/chaos-experiments/scripts/start-instance-on-partition-with-version.sh#L31

But between execution of these two lines, the gateway pod was terminated and a new pod was started to replace it. But the script tried to access the terminated gateway and eventually timeouts, failing the experiment.

ChrisKujawa · 2022-04-21T12:07:37Z

Might make sense to use the service to be more resilient. Or use the helper retryUntilSuccess as we do here https://github.com/zeebe-io/zeebe-chaos/blob/aee26dc8070b93e31a37d14798504149bc867498/chaos-workers/chaos-experiments/scripts/start-instance-on-partition-with-version.sh#L38

deepthidevaki · 2022-04-21T12:07:49Z

2022-04-21 04:34:21.442 CEST
chaos-worker
An instance where this happened:

"++ kubectl exec zeebe-gateway-c7fdf4f5c-v7mzz -n 0b25276f-1113-4627-9c17-5b867256e62a-zeebe -- zbctl create instance benchmark --insecure"
Debug
2022-04-21 04:34:21.505 CEST
chaos-worker
"error: cannot exec into a container in a completed pod; current phase is Failed"

The pod zeebe-gateway-c7fdf4f5c-v7mzz was terminated before this time.

deepthidevaki · 2022-04-21T12:09:02Z

Or use the helper retryUntilSuccess as we do here

It is already using retryUntilSuccess. The problem is it is retying to connect to the same terminated gateway.

ChrisKujawa · 2022-04-21T12:10:10Z

Yeah because getGateway is not included in the loop.

deepthidevaki · 2022-04-21T12:12:59Z

Why don't we execute zbctl on the chaos worker? We have the authenticationDetails for the cluster available in the process variables.

ChrisKujawa · 2022-04-21T12:15:15Z

Currently, it is independent of where and against what it is executed. Local, helm, cloud/saas etc.

ChrisKujawa · 2022-12-21T09:22:05Z

I think it is no longer an issue, if we experience an issue the zbchaos worker will restart and retry later. Gateways are chosen random #297

deepthidevaki added the bug Something isn't working label Apr 21, 2022

ChrisKujawa closed this as completed Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts not resilient to gateway restarts #136

Scripts not resilient to gateway restarts #136

deepthidevaki commented Apr 21, 2022

ChrisKujawa commented Apr 21, 2022

deepthidevaki commented Apr 21, 2022

deepthidevaki commented Apr 21, 2022

ChrisKujawa commented Apr 21, 2022

deepthidevaki commented Apr 21, 2022 •

edited

Loading

ChrisKujawa commented Apr 21, 2022

ChrisKujawa commented Dec 21, 2022

Scripts not resilient to gateway restarts #136

Scripts not resilient to gateway restarts #136

Comments

deepthidevaki commented Apr 21, 2022

ChrisKujawa commented Apr 21, 2022

deepthidevaki commented Apr 21, 2022

deepthidevaki commented Apr 21, 2022

ChrisKujawa commented Apr 21, 2022

deepthidevaki commented Apr 21, 2022 • edited Loading

ChrisKujawa commented Apr 21, 2022

ChrisKujawa commented Dec 21, 2022

deepthidevaki commented Apr 21, 2022 •

edited

Loading