Hypothesis: Partition one Broker with Gateway doesn't affect other partitions #29

ChrisKujawa · 2020-06-26T06:24:37Z

Hypothesis

We believe that when we isolate one Broker (leader of a partition) with the Gateway that we do not affect other partitions.

Expected during the experiment:

the topology stays the same, since gateway can ping indirectly (is discussable whether this is ideal or not)
when Broker 0 is leader for a partition then the processing for that partition stops but other partitions should not be affected
We can somehow determine in the metrics that they can't connect to each other
After connecting again the affected partition should recover

ChrisKujawa · 2020-06-26T06:25:59Z

Yesterday we run a Chaos experiment to verify this.

Observations:

As expected we see no difference in the Topology. All commands which are send to that partition time out. Other partitions haven't been affected 👍 With the metrics we have we seen that: there is no progress in the partition, the partition is still healthy (which makes sense) and we see a lot of timeouts happening.

Unfortunately we need multiple metrics to correlate somehow that it might be due to connectivity issues. I think we can improve here. For example it is not directly visible that one partition stopped processing. For that @pihme had a good idea and we will add a new panel, which directly shows the current record processing stats. I think this is also useful for exporting to directly see whether we have currently exporting problems.

What else is missing on the metrics side from my point of view:

a panel which shows me that all requests to a specific partition currently time out.
metrics for the transport between gateway and broker to better analyze problems like that. Would be nice to have Introduce gateway-broker transport metrics camunda/camunda#4487
Liveness and Health stats of the Gateway in the metrics. I think this is currently not supported?

After reconnecting the nodes we saw that the related partition started to process again. Interesting was that it seems that there piled some traffic up and after reconnecting we saw a burst against partition one (partition 2 was disconnected), but this caused no issues.

I think was good and interesting experiment again and gave us a bit more insights what else we need.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hypothesis: Partition one Broker with Gateway doesn't affect other partitions #29

Hypothesis: Partition one Broker with Gateway doesn't affect other partitions #29

ChrisKujawa commented Jun 26, 2020

ChrisKujawa commented Jun 26, 2020

Hypothesis: Partition one Broker with Gateway doesn't affect other partitions #29

Hypothesis: Partition one Broker with Gateway doesn't affect other partitions #29

Comments

ChrisKujawa commented Jun 26, 2020

Hypothesis

ChrisKujawa commented Jun 26, 2020