-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Delayed jobs get stuck in delayed queue if Redis is busy #2015
Comments
I could get it to reliably reproduce with 40,000 jobs on my machine and verified that the solution indeed works. Any feedback will be appreciated |
@swayam18 thanks for the report and the solution proposal, I will verify it and release a fix asap. |
## [3.22.3](v3.22.2...v3.22.3) (2021-04-23) ### Bug Fixes * **delayed:** re-schedule updateDelay in case of error fixes [#2015](#2015) ([16bbfad](16bbfad))
🎉 This issue has been resolved in version 3.22.3 🎉 The release is available on: Your semantic-release bot 📦🚀 |
* develop: (39 commits) chore(release): 3.22.3 [skip ci] fix(delayed): re-schedule updateDelay in case of error fixes OptimalBits#2015 chore(release): 3.22.2 [skip ci] chore: fix github token chore: correct release branch test: increase timeout in test chore: add semantic release scripts fix(obliterate): obliterate many jobs fixes OptimalBits#2016 3.22.1 chore: upgrade dependencies chore: update CHANGELOG fix(obliterate): remove repeatable jobs fixes OptimalBits#2012 docs: fix typo docs: update README.md docs: updated README.md Update README.md 3.22.0 docs: update CHANGELOG feat: do not rely on comma to encode jobid in progress fixes OptimalBits#2003 (OptimalBits#2004) 3.21.1 ...
We faced the same issue today and it looks like we're running 3.29.0. At some point our job workers stopped doing any work. I scaled up our cluster and the new workers started picking up jobs but the old ones were still idle after getting the Redis is busy error a couple of times earlier. Looking at the logs, the exact error I get before the worker stops doing any work is the following:
However I don't think |
It looks like this is happening at boot when the application starts while Redis is already BUSY running a script (in our case we're running I checked the logs of 2 of two production pods that were not processing any jobs and they both hit that error exactly 16 times and just went idle after (not sure why 16, my concurrency is set to 12 at the moment). They did not process even a single job. It seems like hitting |
For now we simply added |
Description
When Redis is "busy running a script" the jobs in a queue get stuck in the delayed state unless a new job is added to the queue or the process is restarted. Upon investigation, the culprit is the following line:
bull/lib/queue.js
Line 899 in edfbd16
If this command ever fails, the recursion breaks and
updateDelayTimer
is not called again till a new delayed job is added. Since that may never happen, jobs may get permanently stuck in the delayed queue.Here is the sequence of events that lead to this scenario:
Redis is busy running a heavy script (for eg: queue.clean was run to clear failed jobs)
During this time, a call to
updateDelayTimer
is made, which in turn calls theupdateDelaySet
command:bull/lib/queue.js
Line 897 in edfbd16
bull/lib/queue.js
Line 899 in edfbd16
The
updateDelaySet
command fails with the following error:ReplyError: BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE.
as Redis is busy.The promise fails and the catch block simply emits an error:
bull/lib/queue.js
Line 932 in edfbd16
Now because of the failure, the
updateDelayTimer
function is never called after this point, leading to the delayed jobs being stuck. The only way to recover them is by adding another delayed job to the queue, which seemingly triggers the message handler to callupdateDelayTimer
and restart the recursive process.Proposed Solution
I am not 100% sure if this makes sense, but adding this line of code seems to have fix the problem:
Essentially, we retry the
updateDelayTimer
after a constant delay and hope that Redis is no longer busy and can now run theupdateDelaySet
command.Not 100% sure if this can cause more than one
this.updateDelayTimer
loop to be active, will need your feedback for this.Minimal, Working Test code to reproduce the issue.
Let me know if this is necessary and I will create a repo with the necessary code
Bull version
3.22.1
Additional information
The text was updated successfully, but these errors were encountered: