[Bug] Delayed jobs get stuck in delayed queue if Redis is busy #2015

swayam18 · 2021-04-21T12:37:37Z

Description

When Redis is "busy running a script" the jobs in a queue get stuck in the delayed state unless a new job is added to the queue or the process is restarted. Upon investigation, the culprit is the following line:

bull/lib/queue.js

Line 899 in edfbd16

.updateDelaySet(this, Date.now())

If this command ever fails, the recursion breaks and updateDelayTimer is not called again till a new delayed job is added. Since that may never happen, jobs may get permanently stuck in the delayed queue.

Here is the sequence of events that lead to this scenario:

Redis is busy running a heavy script (for eg: queue.clean was run to clear failed jobs)
During this time, a call to updateDelayTimer is made, which in turn calls the updateDelaySet command:

bull/lib/queue.js

Line 897 in edfbd16

Queue.prototype.updateDelayTimer = function() {

bull/lib/queue.js

Line 899 in edfbd16

.updateDelaySet(this, Date.now())
The updateDelaySet command fails with the following error: ReplyError: BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE. as Redis is busy.
The promise fails and the catch block simply emits an error:

bull/lib/queue.js

Line 932 in edfbd16

this.emit('error', err, 'Error updating the delay timer');

Now because of the failure, the updateDelayTimer function is never called after this point, leading to the delayed jobs being stuck. The only way to recover them is by adding another delayed job to the queue, which seemingly triggers the message handler to call updateDelayTimer and restart the recursive process.

Proposed Solution

I am not 100% sure if this makes sense, but adding this line of code seems to have fix the problem:

    .catch(err => {
      setTimeout(() => this.updateDelayTimer(), 1000); // <- this line
      this.emit('error', err, 'Error updating the delay timer');
    });

Essentially, we retry the updateDelayTimer after a constant delay and hope that Redis is no longer busy and can now run the updateDelaySet command.

Not 100% sure if this can cause more than one this.updateDelayTimer loop to be active, will need your feedback for this.

Minimal, Working Test code to reproduce the issue.

Let me know if this is necessary and I will create a repo with the necessary code

Bull version

3.22.1

Additional information

The text was updated successfully, but these errors were encountered:

swayam18 · 2021-04-23T05:13:03Z

I could get it to reliably reproduce with 40,000 jobs on my machine and verified that the solution indeed works. Any feedback will be appreciated

manast · 2021-04-23T15:53:30Z

@swayam18 thanks for the report and the solution proposal, I will verify it and release a fix asap.

## [3.22.3](v3.22.2...v3.22.3) (2021-04-23) ### Bug Fixes * **delayed:** re-schedule updateDelay in case of error fixes [#2015](#2015) ([16bbfad](16bbfad))

manast · 2021-04-23T16:50:35Z

🎉 This issue has been resolved in version 3.22.3 🎉

The release is available on:

Your semantic-release bot 📦🚀

* develop: (39 commits) chore(release): 3.22.3 [skip ci] fix(delayed): re-schedule updateDelay in case of error fixes OptimalBits#2015 chore(release): 3.22.2 [skip ci] chore: fix github token chore: correct release branch test: increase timeout in test chore: add semantic release scripts fix(obliterate): obliterate many jobs fixes OptimalBits#2016 3.22.1 chore: upgrade dependencies chore: update CHANGELOG fix(obliterate): remove repeatable jobs fixes OptimalBits#2012 docs: fix typo docs: update README.md docs: updated README.md Update README.md 3.22.0 docs: update CHANGELOG feat: do not rely on comma to encode jobid in progress fixes OptimalBits#2003 (OptimalBits#2004) 3.21.1 ...

felix-d · 2021-09-13T14:14:39Z

We faced the same issue today and it looks like we're running 3.29.0. At some point our job workers stopped doing any work. I scaled up our cluster and the new workers started picking up jobs but the old ones were still idle after getting the Redis is busy error a couple of times earlier.

Looking at the logs, the exact error I get before the worker stops doing any work is the following:

(node:39) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 8)
(node:39) UnhandledPromiseRejectionWarning: ReplyError: BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE.
    at parseError (/app/node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (/app/node_modules/redis-parser/lib/parser.js:302:14)

However I don't think updateDelayTimer is the one at cause this time. I'll keep digging.

felix-d · 2021-09-13T16:55:00Z

It looks like this is happening at boot when the application starts while Redis is already BUSY running a script (in our case we're running queue.clean at intervals and the number of jobs to clean can be quite large).

I checked the logs of 2 of two production pods that were not processing any jobs and they both hit that error exactly 16 times and just went idle after (not sure why 16, my concurrency is set to 12 at the moment). They did not process even a single job. It seems like hitting BUSY Redis is not an issue if the workers were already started and processing jobs.

felix-d · 2021-09-13T17:49:22Z

For now we simply added removeOnComplete to our jobs, we don't need to keep them around anymore, and thus we won't need to run queue.clean at intervals. This will stop us from getting the busy Redis error but it's not a fix to the actual problem.

swayam18 changed the title ~~Delayed jobs get stuck in delayed queue if Redis is busy~~ [Bug] Delayed jobs get stuck in delayed queue if Redis is busy Apr 23, 2021

manast added a commit that referenced this issue Apr 23, 2021

fix(delayed): re-schedule updateDelay in case of error fixes #2015

21ec12b

manast closed this as completed in 16bbfad Apr 23, 2021

github-actions bot pushed a commit that referenced this issue Apr 23, 2021

chore(release): 3.22.3 [skip ci]

e7ff7f0

## [3.22.3](v3.22.2...v3.22.3) (2021-04-23) ### Bug Fixes * **delayed:** re-schedule updateDelay in case of error fixes [#2015](#2015) ([16bbfad](16bbfad))

manast added the released label Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Delayed jobs get stuck in delayed queue if Redis is busy #2015

[Bug] Delayed jobs get stuck in delayed queue if Redis is busy #2015

swayam18 commented Apr 21, 2021

swayam18 commented Apr 23, 2021

manast commented Apr 23, 2021

manast commented Apr 23, 2021

felix-d commented Sep 13, 2021 •

edited

Loading

felix-d commented Sep 13, 2021 •

edited

Loading

felix-d commented Sep 13, 2021

[Bug] Delayed jobs get stuck in delayed queue if Redis is busy #2015

[Bug] Delayed jobs get stuck in delayed queue if Redis is busy #2015

Comments

swayam18 commented Apr 21, 2021

Description

Here is the sequence of events that lead to this scenario:

Proposed Solution

Minimal, Working Test code to reproduce the issue.

Bull version

Additional information

swayam18 commented Apr 23, 2021

manast commented Apr 23, 2021

manast commented Apr 23, 2021

felix-d commented Sep 13, 2021 • edited Loading

felix-d commented Sep 13, 2021 • edited Loading

felix-d commented Sep 13, 2021

felix-d commented Sep 13, 2021 •

edited

Loading

felix-d commented Sep 13, 2021 •

edited

Loading