diff --git a/chaos-days/blog/2023-06-02-Using-Large-Multi-Instance/index.md b/chaos-days/blog/2023-06-02-Using-Large-Multi-Instance/index.md index 2febd23b0..53d32c6fb 100644 --- a/chaos-days/blog/2023-06-02-Using-Large-Multi-Instance/index.md +++ b/chaos-days/blog/2023-06-02-Using-Large-Multi-Instance/index.md @@ -14,11 +14,11 @@ authors: zell New day new chaos. :skull: In today's chaos day I want to pick up a topic, which had bothered people for long time. I created a [chaos day three years ago](https://zeebe-io.github.io/zeebe-chaos/2020/07/16/big-multi-instance/) around this topic as well. -Today, we experiment with large multi-instances again. In the recent patch release [8.2.5](https://github.com/camunda/zeebe/releases/tag/8.2.5) we fixed an issue with spawning larger multi instances. Previously if you have created a process instance with a large multi-instance it was likely that this caused to blacklist the process instance, since the multi-instance spawning running into `maxMessageSize` limitations. +Today, we experiment with large multi-instances again. In the recent patch release [8.2.5](https://github.com/camunda/zeebe/releases/tag/8.2.5) we fixed an issue with spawning larger multi instances. Previously if you have created a process instance with a large multi-instance it was likely that this caused to blacklist the process instance, since the multi-instance spawning ran into `maxMessageSize` limitations. -This means the process instance was stuck and was no longer executable. In operate this was not shown and caused a lot of friction or confusion to users. With the recent fix Zeebe should chunk even large collections into smaller batches in order to spawn/execute the multi-instance without any issues. +This means the process instance was stuck and was no longer executable. In Operate this was not shown and caused a lot of friction or confusion to users. With the recent fix, Zeebe should chunk even large collections into smaller batches to spawn/execute the multi-instance without any issues. -**TL;DR;** We were able to see that even large multi-instances can be executed now. :white_check_mark: At some point we experienced performance regressions (during creating new multi-instance elements) but the execution of the process instance doesn't fail anymore. One problem at a time, we will likely investigate further in order to improve the performance of such use case. +**TL;DR;** We were able to see that even large multi-instances can be executed now. :white_check_mark: At some point, we experienced performance regressions (during creating new multi-instance elements) but the execution of the process instance doesn't fail anymore. One problem at a time, we will likely investigate further to improve the performance of such a use case. When we reached the `maxMessageSize` we got a rejection, if the input collection is too large we see some weird unexpected errors from NGINX. @@ -26,7 +26,7 @@ When we reached the `maxMessageSize` we got a rejection, if the input collection ## Chaos Experiment -We do regularly game days in Camunda, and for such we also create projects to make incidents etc. reproducible. In today's chaos day I will reuse some code created by [Philipp Ossler](https://github.com/saig0), thanks for that :bow: Since we mimic in such game days customers, the process is a bit more complex than necessary for such chaos day, but I will keep it like that. +We do regularly game days in Camunda, and for such we also create projects to make incidents, etc. reproducible. In today's chaos day, I will reuse some code created by [Philipp Ossler](https://github.com/saig0), thanks for that :bow: Since we mimic in such game days customers, the process is a bit more complex than necessary for such chaos day, but I will keep it like that. ![order-process](order-process.png) @@ -39,49 +39,49 @@ The input collection `items`, which is used in the multi-instance is generated v )).toList(); ``` -In the following experiment we will play around with the `size` value. +In the following experiment, we will play around with the `size` value. -For the experiment we will use an Camunda 8 SaaS cluster with the generation `Zeebe 8.2.5` (G3-S). +For the experiment, we will use a Camunda 8 SaaS cluster with the generation `Zeebe 8.2.5` (G3-S). ### Expected When creating a process instance with a large collection, we expect based on the recent bug fix that the multi-instance creation is batched and created without issues. -One limiting factor might be the `maxMessageSize` with regard to the input collection, but in this case I would expect that the creation of the process instance is already rejected before. +One limiting factor might be the `maxMessageSize` with regard to the input collection, but in this case, I would expect that the creation of the process instance is already rejected before. ### Actual -Between the following experiments I always recreated the clusters, in order to reduce the blast radius and better understand and isolate the impact. +Between the following experiments, I always recreated the clusters, to reduce the blast radius and better understand and isolate the impact. #### Starting small (20k) -In previous versions the multi-instance creation failed already quite early. For example in the game day reproducer project we had a collection defined with `20.000` items, which we are now reusing for the start. +In previous versions, the multi-instance creation failed already quite early. For example in the game day reproducer project, we had a collection defined with `20.000` items, which we are now reusing for the start. The creation of the process instance worked without any issues. We can observe in Operate the incremental creation of sub-process instances, which is great. ![incremental-creation-20k](20k-operate-inc.png) -We can see in the metrics that batch processing is limited by only 2-4 commands in a batch. Interesting fact, which might explain why it takes a while until all instances of the multi-instance sub-process are created. We can even see rollbacks during batch processing, visible in the "Number of batch processing retries" panel. +We can see in the metrics that batch processing is limited by only 2-4 commands in a batch. That is an interesting fact that might explain why it takes a while until all instances of the multi-instance sub-process are created. We can even see rollbacks during batch processing, visible in the "Number of batch processing retries" panel. ![processing-metrics-20k](20k-processing-metrics.png) -The processing queue seem to increase dramatically. +The processing queue seems to increase dramatically. -After a while we can see that all 20k instances are created without any bigger issues. :rocket: +After a while, we can see that all 20k instances are created without any bigger issues. :rocket: ![complete-20k](20k-operate-complete.png) -It took around 10 minutes. Taking a look at the metrics again we see that in between big command batches have been created/processed, which allowed to reduce the processing queue. +It took around 10 minutes. Taking a look at the metrics again we see that in between big command batches have been created/processed, which allowed us to reduce the processing queue. ![processing-metrics-20k-pt2](20k-processing-metrics-2.png) -In between the backpressure was quite high, but after creation of all instances the cluster is in a healthy state again. The creation of such multi-instance worked :white_check_mark: +In between the backpressure was quite high, but after the creation of all instances, the cluster is in a healthy state again. The creation of such multi-instance worked :white_check_mark: ![general-metrics-20k](20k-general-metrics.png) #### Increase collection (200k) -Again, creation of such process instance was not a problem itself. We can observe the creation of the sub-process instances (multi-instance) in operate, which happens incrementally. +Again, the creation of such a process instance was not a problem itself. We can observe the creation of the sub-process instances (multi-instance) in Operate, which happens incrementally. ![incremental-creation-200k](200k-operate-inc.png) @@ -89,19 +89,19 @@ It takes ages until the instances are created (After 3h ~66k instances are creat ![processing-metrics-200k](200k-processing-metrics.png) -The processing of that partitions is in this case totally blocked by the multi-instance creation, we can see that on the 100% back pressure. :x: +The processing of that partitions is in this case blocked by the multi-instance creation, we can see that on the 100% back pressure. :x: ![general-metrics-200k](200k-general-metrics.png) -Even after one hour not all instances are created (not even 20k), it takes longer than before the creation of 20.000 instances. +Even after one hour, not all instances are created (not even 20k), it takes longer than before the creation of 20.000 instances. ![incremental-creation-200k-part2](200k-operate-inc2.png) #### Make it really big (2 million) -In order to escalate this even more I increase the input collection again by factor 10 to 2 million. +To escalate this even more I increase the input collection again by a factor of 10 to 2 million. -After creation I see as response the following log message in my log: +After creation, I see as a response the following log message in my log: ``` Failed to create process instance of 'order-process' @@ -149,7 +149,7 @@ DATA----------------------------- -I tried to incremently decrease the input collection until it is working again, when reaching 250k I finally see a better understandable error. +I tried to incrementally decrease the input collection until it is working again, when reaching 250k I finally see a better understandable error. ```shell 2023-06-02 13:53:51.485 ERROR 29870 --- [ main] i.c.cloud.gameday.ProcessApplication : Failed to create process instance of 'order-process'