-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job push activation latency blog post #481
Conversation
b0223c6
to
f85a7fe
Compare
This was not really a "chaos" thing though, so this might not be the right blog for this 😄 |
Adds a blog post to summarize the improvements on activation latency we've seen with job push.
Haven't reviewed it yet. But was also wondering whether you maybe want to post it to the camunda blog instead or on medium. :) |
@npepinpe wdyt about submitting this to the camunda blog? |
maybe. the idea was to do something like https://zeebe-io.github.io/zeebe-chaos/2023/12/20/Broker-scaling-performance which was also about performance and not really chaos |
regardless i could use a review from another engineer :D |
@npepinpe will do when I find the time |
Ole and Deepthi can also help, considering your week so far 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @npepinpe thanks for this!🤩❤️ Great post and really valuable additional to the blog. I hope you will also post this to the camunda blog. I definitely think it is worth it.
|
||
Additionally, we wanted to guarantee that every component involved in streaming, including clients, would remain resilient in the face of load surges. | ||
|
||
**TL;DR;** Job activation latency is greatly reduced, with task based workloads seeing up to 50% reduced overall execution latency. Completing a task now immediately triggers pushing out the next one, meaning the latency to activate the next task in a sequence is bounded by how much time it takes to process its completion in Zeebe. Activation latency is unaffected by how many partitions or brokers there in a cluster, as opposed to job polling, thus ensuring scalability of the system. Finally, reuse of gRPC's flow control mechanism ensure clients cannot be overloaded even in the face of load surges, without impacting other workloads in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤩
|
||
## Why job activation latency matters | ||
|
||
Jobs are one of the fundamental building blocks of Zeebe, representing primarily all tasks (e.g. service, send, user), as well as some less obvious symbols (e.g. intermediate message throw event). In essence, they represent the actual unit of work in a process, the part users will implement, i.e. the actual application code. To reduce the likelihood of a job being worked on by multiple clients at the same time, it first goes through an activation process, where it is soft-locked for a specific amount of time. Soft-locked here means anyone can still interact with it - they can complete the job, fail it, etc. Only the activation is locked out, meaning no one else can activate the job until it's timed out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I like that you give a short intro first
|
||
## Polling: a first implementation | ||
|
||
Back in 2018, Zeebe introduced the `ActivateJobs` RPC for its gRPC clients, analogous to fetching and locking [external tasks in Camunda 7.x](https://docs.camunda.org/manual/7.20/user-guide/process-engine/external-tasks/). This endpoint allowed clients to activate fetch and activate a specific number of available jobs. In other words, it allowed them to _poll_ for jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I feel old now. I even worked on external tasks in Camunda 7
|
||
Back in 2018, Zeebe introduced the `ActivateJobs` RPC for its gRPC clients, analogous to fetching and locking [external tasks in Camunda 7.x](https://docs.camunda.org/manual/7.20/user-guide/process-engine/external-tasks/). This endpoint allowed clients to activate fetch and activate a specific number of available jobs. In other words, it allowed them to _poll_ for jobs. | ||
|
||
This was the first implementation to activate and work on jobs in Zeebe for multiple reason: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether this is true we had topic subscriptions but not sure whether we had jobs here. But I guess this doesn't matter 😄
- Every request - whether client to gateway, or gateway to broker - adds delay to the activation latency | ||
- In the worst case scenario, we have to poll _every_ partition. | ||
- The gateway does not know in advance which partitions have jobs available. | ||
- Scaling out your clients may have adverse effects by sending out too many requests which all have to be processed independently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍that is definitely many people run into
- Brokers push jobs out immediately as they become available, removing the need for a gateway-to-broker request. | ||
- Since the stream is long lived, there are almost no client requests required after the initial one. | ||
- No need to poll every partition anymore. | ||
- No thundering herd issues if you have many gateways all polling at the same time due to a notification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether this is clear. What do you mean all polling at the same time.
I will go ahead and merge this I think it is great. |
A blog post highlighting the improvements we've seen with job push. It:
Finally, there's a section at the bottom linking to the previous blog posts and the documentation.