New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Job push activation latency blog post #481

Merged

ChrisKujawa merged 3 commits into main from np-jp-blog

Jan 26, 2024

Member

npepinpe commented Jan 19, 2024 •

edited

Loading

A blog post highlighting the improvements we've seen with job push. It:

Explains what job polling is
Outlines the issues with polling as well as long polling
Explains the concept behind job push and how it would solve the issues seen with polling
Provides three different tests and benchmark results highlighting the results

Finally, there's a section at the bottom linking to the previous blog posts and the documentation.

npepinpe force-pushed the np-jp-blog branch 2 times, most recently from b0223c6 to f85a7fe Compare

January 21, 2024 20:08

npepinpe changed the title ~~draft~~ Job push activation latency blog post

npepinpe force-pushed the np-jp-blog branch from f85a7fe to 8bdd302 Compare

January 21, 2024 20:11

npepinpe marked this pull request as ready for review

January 21, 2024 20:11

npepinpe requested a review from ChrisKujawa as a code owner

January 21, 2024 20:11

Member Author

npepinpe commented Jan 21, 2024

This was not really a "chaos" thing though, so this might not be the right blog for this 😄

npepinpe force-pushed the np-jp-blog branch from 8bdd302 to e6b8811 Compare

January 21, 2024 20:17


          docs: add job push blog post

68c1b07

Adds a blog post to summarize the improvements on activation latency
we've seen with job push.

npepinpe force-pushed the np-jp-blog branch from e6b8811 to 68c1b07 Compare

January 21, 2024 20:31

Member

ChrisKujawa commented Jan 22, 2024

Haven't reviewed it yet. But was also wondering whether you maybe want to post it to the camunda blog instead or on medium. :)

Member

ChrisKujawa commented Jan 23, 2024

@npepinpe wdyt about submitting this to the camunda blog?

Member Author

npepinpe commented Jan 24, 2024

maybe. the idea was to do something like https://zeebe-io.github.io/zeebe-chaos/2023/12/20/Broker-scaling-performance which was also about performance and not really chaos

Member Author

npepinpe commented Jan 24, 2024

regardless i could use a review from another engineer :D

Member

ChrisKujawa commented Jan 24, 2024

@npepinpe will do when I find the time

Member Author

npepinpe commented Jan 24, 2024

Ole and Deepthi can also help, considering your week so far 😄

ChrisKujawa approved these changes

View reviewed changes

Member

ChrisKujawa left a comment

Awesome @npepinpe thanks for this!🤩❤️ Great post and really valuable additional to the blog. I hope you will also post this to the camunda blog. I definitely think it is worth it.

chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md


		Additionally, we wanted to guarantee that every component involved in streaming, including clients, would remain resilient in the face of load surges.

		TL;DR; Job activation latency is greatly reduced, with task based workloads seeing up to 50% reduced overall execution latency. Completing a task now immediately triggers pushing out the next one, meaning the latency to activate the next task in a sequence is bounded by how much time it takes to process its completion in Zeebe. Activation latency is unaffected by how many partitions or brokers there in a cluster, as opposed to job polling, thus ensuring scalability of the system. Finally, reuse of gRPC's flow control mechanism ensure clients cannot be overloaded even in the face of load surges, without impacting other workloads in the cluster.

Member

ChrisKujawa Jan 26, 2024

🤩

chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md


		## Why job activation latency matters

		Jobs are one of the fundamental building blocks of Zeebe, representing primarily all tasks (e.g. service, send, user), as well as some less obvious symbols (e.g. intermediate message throw event). In essence, they represent the actual unit of work in a process, the part users will implement, i.e. the actual application code. To reduce the likelihood of a job being worked on by multiple clients at the same time, it first goes through an activation process, where it is soft-locked for a specific amount of time. Soft-locked here means anyone can still interact with it - they can complete the job, fail it, etc. Only the activation is locked out, meaning no one else can activate the job until it's timed out.

Member

ChrisKujawa Jan 26, 2024

👍 I like that you give a short intro first

chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md


		## Polling: a first implementation

		Back in 2018, Zeebe introduced the `ActivateJobs` RPC for its gRPC clients, analogous to fetching and locking [external tasks in Camunda 7.x](https://docs.camunda.org/manual/7.20/user-guide/process-engine/external-tasks/). This endpoint allowed clients to activate fetch and activate a specific number of available jobs. In other words, it allowed them to _poll_ for jobs.

Member

ChrisKujawa Jan 26, 2024

Thanks, I feel old now. I even worked on external tasks in Camunda 7

chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md


		Back in 2018, Zeebe introduced the `ActivateJobs` RPC for its gRPC clients, analogous to fetching and locking [external tasks in Camunda 7.x](https://docs.camunda.org/manual/7.20/user-guide/process-engine/external-tasks/). This endpoint allowed clients to activate fetch and activate a specific number of available jobs. In other words, it allowed them to _poll_ for jobs.

		This was the first implementation to activate and work on jobs in Zeebe for multiple reason:

Member

ChrisKujawa Jan 26, 2024

I'm wondering whether this is true we had topic subscriptions but not sure whether we had jobs here. But I guess this doesn't matter 😄

chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md

+              - Every request - whether client to gateway, or gateway to broker - adds delay to the activation latency
+              - In the worst case scenario, we have to poll _every_ partition.
+              - The gateway does not know in advance which partitions have jobs available.
+              - Scaling out your clients may have adverse effects by sending out too many requests which all have to be processed independently

Member

ChrisKujawa Jan 26, 2024

👍that is definitely many people run into

chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md

+              - Brokers push jobs out immediately as they become available, removing the need for a gateway-to-broker request.
+              - Since the stream is long lived, there are almost no client requests required after the initial one.
+              - No need to poll every partition anymore.
+              - No thundering herd issues if you have many gateways all polling at the same time due to a notification.

Member

ChrisKujawa Jan 26, 2024

Not sure whether this is clear. What do you mean all polling at the same time.

chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md Outdated Show resolved Hide resolved

chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md Outdated Show resolved Hide resolved

ChrisKujawa added 2 commits

January 26, 2024 20:56


          docs: fix typo

52d4f8d


          docs: fix typo

81c740d

Member

ChrisKujawa commented Jan 26, 2024

I will go ahead and merge this I think it is great.

ChrisKujawa merged commit 7ab9378 into main

2 checks passed

ChrisKujawa deleted the np-jp-blog branch

January 26, 2024 20:06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet