Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Uneven job distribution between workers #2842

Closed
1 task done
nemphys opened this issue Oct 19, 2024 · 21 comments · Fixed by #2904
Closed
1 task done

[Bug]: Uneven job distribution between workers #2842

nemphys opened this issue Oct 19, 2024 · 21 comments · Fixed by #2904
Labels
bug Something isn't working

Comments

@nemphys
Copy link

nemphys commented Oct 19, 2024

Version

v5.4.3

Platform

NodeJS

What happened?

I am not sure if this is a bug, but I couldn't find a better category 😄

We are running a 5-node Redis sentinel setup, where each node is running both Redis and the application using BullMQ, therefore the node that is currently the master Redis instance has naturally the fastest connection to Redis.

What we observe is that this master node ends up executing the majority of jobs across queues, with the rest of the nodes usually picking up jobs only when this node has reached its worker concurrency limit.

I suppose this might be considered normal (I am not aware the mechanism that implements job picking, but it would make sense for it to be a on first-come-first-serve basis, with the node running the Redis master instance locally being the node that inherently arrives first), but I would like to know if there is some way to overcome this and get a more even job distribution between workers.

How to reproduce.

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@nemphys nemphys added the bug Something isn't working label Oct 19, 2024
@mynameisvasco
Copy link

I'm also facing this issue on my project.
I have about 32 servers connected to a redis cluster. Usually there are some servers with 60-70% CPU usage while others are 5-15%.
Did you find any solution to this problem?

@nemphys
Copy link
Author

nemphys commented Oct 25, 2024

@mynameisvasco no, I haven't.

As mentioned, I suppose it is caused by the latency difference between the workers and Redis but it would be nice to have a confirmation (and possible workarounds) by a maintainer!

@hhanesand
Copy link

Would also be interested in some insight here, we are seeing the same thing with our setup.

@manast
Copy link
Contributor

manast commented Nov 12, 2024

Workers picks up jobs as soon as they are done with the previous job. If you have a high concurrency factor then I guess it could happen that the first worker that starts picks up a lot of jobs, and then if there are not enough jobs for the rest of the workers they will be idling more than the first one. But overtime I do expect all workers doing more or less the same amount of processing.

@manast
Copy link
Contributor

manast commented Nov 12, 2024

Btw, it would be useful to know more about the setup, are they just standard jobs, delayed, prioritised?

@hhanesand
Copy link

hhanesand commented Nov 13, 2024

Hey! Thanks 😄 I wouldn't describe this as essential for us, just what we're seeing surprised us a bit.

  • At the moment we're running [email protected] & [email protected] and are about to update to 7.20.1 & 5.25.4
  • We are using a redis cluster.
  • We have a group id and job id on all our jobs, together with removeOnComplete and removeOnFail. That's it for options.
  • We have ~100 queues running across 10 servers - so about 1000 workers.
  • Worker concurrencies are adjusted frequently, depending on the number of jobs waiting.

These are the metrics we have. Average is typically below 10%. We can definitely scale down the amount of workers we're running, but are a tad concerned about the higher level of max utilization. Spikes may be caused by the adjustments we're doing, not sure.

CleanShot 2024-11-13 at 10 10 35@2x
CleanShot 2024-11-13 at 10 10 48@2x

@manast
Copy link
Contributor

manast commented Nov 13, 2024

@hhanesand yes, it is strange, maybe it has to do with the shape of the incoming jobs, like if they come in bursts, then it could be that one eager worker picks a lot of jobs leaving much less to the rest, something like that. I will think more about it.

@nemphys
Copy link
Author

nemphys commented Nov 13, 2024

In our setup, most jobs are standard jobs (not delayed/prioritized/etc, nothing special there).

Is my hypothesis correct regarding the latency between workers and redis? It could be the most simple explanation, but it would be nice to have some kind of solution for better distribution.

@manast
Copy link
Contributor

manast commented Nov 13, 2024

I am playing with the idea of introducing an automatic concurrency mechanism, based on how utilised is the NodeJS event loop. So for example if it is below a certain value it can increase the number of concurrent jobs up until some defined maximum. Well, I am assuming that the peaks in those graphs are produced by some workers processing more jobs in parallel than their peers.

@hhanesand
Copy link

hhanesand commented Nov 13, 2024

We do definitely have bursts of jobs. When we're enqueuing jobs we queue everything for one queue in an addBulk operation. Essentially we're doing

load jobs from outside datastore
split jobs by queue

for (jobs of jobs split by queue) {
    queue.addBulk(jobs)
}

I've tried using a flow producer for this so we can avoid O(# of queues) and instead have O(# of jobs / # bulk batch size), but ran into this issue.

Maybe adopting delayed jobs could help with this issue, but we're unable to predict all of the load - we have an API endpoint which adds jobs to queue unpredictably.

@manast
Copy link
Contributor

manast commented Nov 13, 2024

@hhanesand I think addBulk may be the issue here, we have a PR to resolve this problem: #2841

@manast
Copy link
Contributor

manast commented Nov 13, 2024

oh sorry, this would only affect if you are adding prioritised jobs not standard ones.

@manast
Copy link
Contributor

manast commented Nov 13, 2024

ah no, even standard ones actually sorry for the confusion.

@nemphys
Copy link
Author

nemphys commented Nov 13, 2024

As I mentioned, we are not using addBulk in our setup, will this still have a positive effect?

@manast
Copy link
Contributor

manast commented Nov 13, 2024

As I mentioned, we are not using addBulk in our setup, will this still have a positive effect?

Not for you, but it seems like this could be the case for @hhanesand

@manast
Copy link
Contributor

manast commented Nov 13, 2024

@nemphys it could be that what is happening in addBulk also happens with standard add in some scenarios, we will investigate it as well.

@nemphys
Copy link
Author

nemphys commented Nov 13, 2024

Nice! Any hint to what is actually happening so that we can have a general idea?

@roggervalf
Copy link
Collaborator

roggervalf commented Nov 19, 2024

hi @nemphys sorry for the delay. For awakening workers, we have a mechanism based on adding markers. For standard and prioritized jobs we add 1 marker with the same score. These markers only awakes a worker that consumes it and then it can start consuming other jobs. If these markers are consumed but there are still other jobs in waiting or prioritized state and there are other workers that couldn't consume this marker, they will be waiting for the next marker for at least 5 seconds (after this period workers, will check if there are pending jobs anyway).

github-actions bot pushed a commit that referenced this issue Nov 19, 2024
# [5.27.0](v5.26.2...v5.27.0) (2024-11-19)

### Features

* **queue:** add rateLimit method ([#2896](#2896)) ([db84ad5](db84ad5))
* **queue:** add removeRateLimitKey method ([#2806](#2806)) ([ff70613](ff70613))

### Performance Improvements

* **marker:** add base markers while consuming jobs to get workers busy ([#2904](#2904)) fixes [#2842](#2842) ([1759c8b](1759c8b))
@manast
Copy link
Contributor

manast commented Nov 19, 2024

@nemphys the PR above should improve the situation in your case, please let me know how it goes.

@nemphys
Copy link
Author

nemphys commented Nov 19, 2024

@manast thank you, I will try to test it in the first available slot, although I am stuck in an pretty old version due to issues upgrading repeatable jobs to the new format.

@manast
Copy link
Contributor

manast commented Nov 19, 2024

@nemphys if you need help upgrading please let us know.

alexandresoro pushed a commit to alexandresoro/ouca that referenced this issue Nov 27, 2024
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [bullmq](https://bullmq.io/) ([source](https://github.com/taskforcesh/bullmq)) | dependencies | minor | [`5.25.6` -> `5.29.1`](https://renovatebot.com/diffs/npm/bullmq/5.25.6/5.29.1) |

---

### Release Notes

<details>
<summary>taskforcesh/bullmq (bullmq)</summary>

### [`v5.29.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.29.1)

[Compare Source](taskforcesh/bullmq@v5.29.0...v5.29.1)

##### Bug Fixes

-   **scheduler:** remove deprecation warning on immediately option ([#&#8203;2923](taskforcesh/bullmq#2923)) ([14ca7f4](taskforcesh/bullmq@14ca7f4))

### [`v5.29.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.29.0)

[Compare Source](taskforcesh/bullmq@v5.28.2...v5.29.0)

##### Features

-   **queue:** refactor a protected addJob method allowing telemetry extensions ([09f2571](taskforcesh/bullmq@09f2571))

### [`v5.28.2`](https://github.com/taskforcesh/bullmq/releases/tag/v5.28.2)

[Compare Source](taskforcesh/bullmq@v5.28.1...v5.28.2)

##### Bug Fixes

-   **queue:** change \_jobScheduler from private to protected for extension ([#&#8203;2920](taskforcesh/bullmq#2920)) ([34c2348](taskforcesh/bullmq@34c2348))

### [`v5.28.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.28.1)

[Compare Source](taskforcesh/bullmq@v5.28.0...v5.28.1)

##### Bug Fixes

-   **scheduler:** use Job class from getter for extension ([#&#8203;2917](taskforcesh/bullmq#2917)) ([5fbb075](taskforcesh/bullmq@5fbb075))

### [`v5.28.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.28.0)

[Compare Source](taskforcesh/bullmq@v5.27.0...v5.28.0)

##### Features

-   **job-scheduler:** add telemetry support to the job scheduler ([72ea950](taskforcesh/bullmq@72ea950))

### [`v5.27.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.27.0)

[Compare Source](taskforcesh/bullmq@v5.26.2...v5.27.0)

##### Features

-   **queue:** add rateLimit method ([#&#8203;2896](taskforcesh/bullmq#2896)) ([db84ad5](taskforcesh/bullmq@db84ad5))
-   **queue:** add removeRateLimitKey method ([#&#8203;2806](taskforcesh/bullmq#2806)) ([ff70613](taskforcesh/bullmq@ff70613))

##### Performance Improvements

-   **marker:** add base markers while consuming jobs to get workers busy ([#&#8203;2904](taskforcesh/bullmq#2904)) fixes [#&#8203;2842](taskforcesh/bullmq#2842) ([1759c8b](taskforcesh/bullmq@1759c8b))

### [`v5.26.2`](https://github.com/taskforcesh/bullmq/releases/tag/v5.26.2)

[Compare Source](taskforcesh/bullmq@v5.26.1...v5.26.2)

##### Bug Fixes

-   **telemetry:** do not set span on parent context if undefined ([c417a23](taskforcesh/bullmq@c417a23))

### [`v5.26.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.26.1)

[Compare Source](taskforcesh/bullmq@v5.26.0...v5.26.1)

##### Bug Fixes

-   **queue:** fix generics to be able to properly be extended ([f2495e5](taskforcesh/bullmq@f2495e5))

### [`v5.26.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.26.0)

[Compare Source](taskforcesh/bullmq@v5.25.6...v5.26.0)

##### Features

-   improve queue getters to use generic job type ([#&#8203;2905](taskforcesh/bullmq#2905)) ([c9531ec](taskforcesh/bullmq@c9531ec))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOC4xNDIuNyIsInVwZGF0ZWRJblZlciI6IjM4LjE0Mi43IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJkZXBlbmRlbmNpZXMiXX0=-->

Reviewed-on: https://git.tristess.app/alexandresoro/ouca/pulls/340
Reviewed-by: Alexandre Soro <[email protected]>
Co-authored-by: renovate <[email protected]>
Co-committed-by: renovate <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants