Skip to content

Commit

Permalink
style: grammarly
Browse files Browse the repository at this point in the history
  • Loading branch information
ChrisKujawa authored Jun 2, 2023
1 parent 1b43398 commit 5daa4a6
Showing 1 changed file with 20 additions and 20 deletions.
40 changes: 20 additions & 20 deletions chaos-days/blog/2023-06-02-Using-Large-Multi-Instance/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,19 @@ authors: zell

New day new chaos. :skull: In today's chaos day I want to pick up a topic, which had bothered people for long time. I created a [chaos day three years ago](https://zeebe-io.github.io/zeebe-chaos/2020/07/16/big-multi-instance/) around this topic as well.

Today, we experiment with large multi-instances again. In the recent patch release [8.2.5](https://github.com/camunda/zeebe/releases/tag/8.2.5) we fixed an issue with spawning larger multi instances. Previously if you have created a process instance with a large multi-instance it was likely that this caused to blacklist the process instance, since the multi-instance spawning running into `maxMessageSize` limitations.
Today, we experiment with large multi-instances again. In the recent patch release [8.2.5](https://github.com/camunda/zeebe/releases/tag/8.2.5) we fixed an issue with spawning larger multi instances. Previously if you have created a process instance with a large multi-instance it was likely that this caused to blacklist the process instance, since the multi-instance spawning ran into `maxMessageSize` limitations.

This means the process instance was stuck and was no longer executable. In operate this was not shown and caused a lot of friction or confusion to users. With the recent fix Zeebe should chunk even large collections into smaller batches in order to spawn/execute the multi-instance without any issues.
This means the process instance was stuck and was no longer executable. In Operate this was not shown and caused a lot of friction or confusion to users. With the recent fix, Zeebe should chunk even large collections into smaller batches to spawn/execute the multi-instance without any issues.

**TL;DR;** We were able to see that even large multi-instances can be executed now. :white_check_mark: At some point we experienced performance regressions (during creating new multi-instance elements) but the execution of the process instance doesn't fail anymore. One problem at a time, we will likely investigate further in order to improve the performance of such use case.
**TL;DR;** We were able to see that even large multi-instances can be executed now. :white_check_mark: At some point, we experienced performance regressions (during creating new multi-instance elements) but the execution of the process instance doesn't fail anymore. One problem at a time, we will likely investigate further to improve the performance of such a use case.

When we reached the `maxMessageSize` we got a rejection, if the input collection is too large we see some weird unexpected errors from NGINX.

<!--truncate-->

## Chaos Experiment

We do regularly game days in Camunda, and for such we also create projects to make incidents etc. reproducible. In today's chaos day I will reuse some code created by [Philipp Ossler](https://github.com/saig0), thanks for that :bow: Since we mimic in such game days customers, the process is a bit more complex than necessary for such chaos day, but I will keep it like that.
We do regularly game days in Camunda, and for such we also create projects to make incidents, etc. reproducible. In today's chaos day, I will reuse some code created by [Philipp Ossler](https://github.com/saig0), thanks for that :bow: Since we mimic in such game days customers, the process is a bit more complex than necessary for such chaos day, but I will keep it like that.

![order-process](order-process.png)

Expand All @@ -39,69 +39,69 @@ The input collection `items`, which is used in the multi-instance is generated v
)).toList();
```

In the following experiment we will play around with the `size` value.
In the following experiment, we will play around with the `size` value.

For the experiment we will use an Camunda 8 SaaS cluster with the generation `Zeebe 8.2.5` (G3-S).
For the experiment, we will use a Camunda 8 SaaS cluster with the generation `Zeebe 8.2.5` (G3-S).

### Expected

When creating a process instance with a large collection, we expect based on the recent bug fix that the multi-instance creation is batched and created without issues.

One limiting factor might be the `maxMessageSize` with regard to the input collection, but in this case I would expect that the creation of the process instance is already rejected before.
One limiting factor might be the `maxMessageSize` with regard to the input collection, but in this case, I would expect that the creation of the process instance is already rejected before.

### Actual

Between the following experiments I always recreated the clusters, in order to reduce the blast radius and better understand and isolate the impact.
Between the following experiments, I always recreated the clusters, to reduce the blast radius and better understand and isolate the impact.

#### Starting small (20k)

In previous versions the multi-instance creation failed already quite early. For example in the game day reproducer project we had a collection defined with `20.000` items, which we are now reusing for the start.
In previous versions, the multi-instance creation failed already quite early. For example in the game day reproducer project, we had a collection defined with `20.000` items, which we are now reusing for the start.

The creation of the process instance worked without any issues. We can observe in Operate the incremental creation of sub-process instances, which is great.

![incremental-creation-20k](20k-operate-inc.png)

We can see in the metrics that batch processing is limited by only 2-4 commands in a batch. Interesting fact, which might explain why it takes a while until all instances of the multi-instance sub-process are created. We can even see rollbacks during batch processing, visible in the "Number of batch processing retries" panel.
We can see in the metrics that batch processing is limited by only 2-4 commands in a batch. That is an interesting fact that might explain why it takes a while until all instances of the multi-instance sub-process are created. We can even see rollbacks during batch processing, visible in the "Number of batch processing retries" panel.

![processing-metrics-20k](20k-processing-metrics.png)

The processing queue seem to increase dramatically.
The processing queue seems to increase dramatically.

After a while we can see that all 20k instances are created without any bigger issues. :rocket:
After a while, we can see that all 20k instances are created without any bigger issues. :rocket:

![complete-20k](20k-operate-complete.png)

It took around 10 minutes. Taking a look at the metrics again we see that in between big command batches have been created/processed, which allowed to reduce the processing queue.
It took around 10 minutes. Taking a look at the metrics again we see that in between big command batches have been created/processed, which allowed us to reduce the processing queue.

![processing-metrics-20k-pt2](20k-processing-metrics-2.png)

In between the backpressure was quite high, but after creation of all instances the cluster is in a healthy state again. The creation of such multi-instance worked :white_check_mark:
In between the backpressure was quite high, but after the creation of all instances, the cluster is in a healthy state again. The creation of such multi-instance worked :white_check_mark:

![general-metrics-20k](20k-general-metrics.png)

#### Increase collection (200k)

Again, creation of such process instance was not a problem itself. We can observe the creation of the sub-process instances (multi-instance) in operate, which happens incrementally.
Again, the creation of such a process instance was not a problem itself. We can observe the creation of the sub-process instances (multi-instance) in Operate, which happens incrementally.

![incremental-creation-200k](200k-operate-inc.png)

It takes ages until the instances are created (After 3h ~66k instances are created). Again we see here small chunks of batches, and there are also rollbacks during batch processing.

![processing-metrics-200k](200k-processing-metrics.png)

The processing of that partitions is in this case totally blocked by the multi-instance creation, we can see that on the 100% back pressure. :x:
The processing of that partitions is in this case blocked by the multi-instance creation, we can see that on the 100% back pressure. :x:

![general-metrics-200k](200k-general-metrics.png)

Even after one hour not all instances are created (not even 20k), it takes longer than before the creation of 20.000 instances.
Even after one hour, not all instances are created (not even 20k), it takes longer than before the creation of 20.000 instances.

![incremental-creation-200k-part2](200k-operate-inc2.png)

#### Make it really big (2 million)

In order to escalate this even more I increase the input collection again by factor 10 to 2 million.
To escalate this even more I increase the input collection again by a factor of 10 to 2 million.

After creation I see as response the following log message in my log:
After creation, I see as a response the following log message in my log:

```
Failed to create process instance of 'order-process'
Expand Down Expand Up @@ -149,7 +149,7 @@ DATA-----------------------------



I tried to incremently decrease the input collection until it is working again, when reaching 250k I finally see a better understandable error.
I tried to incrementally decrease the input collection until it is working again, when reaching 250k I finally see a better understandable error.

```shell
2023-06-02 13:53:51.485 ERROR 29870 --- [ main] i.c.cloud.gameday.ProcessApplication : Failed to create process instance of 'order-process'
Expand Down

0 comments on commit 5daa4a6

Please sign in to comment.