Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Delayed jobs get stuck in delayed queue if Redis is busy #2015

Closed
swayam18 opened this issue Apr 21, 2021 · 6 comments
Closed

[Bug] Delayed jobs get stuck in delayed queue if Redis is busy #2015

swayam18 opened this issue Apr 21, 2021 · 6 comments
Labels

Comments

@swayam18
Copy link

Description

When Redis is "busy running a script" the jobs in a queue get stuck in the delayed state unless a new job is added to the queue or the process is restarted. Upon investigation, the culprit is the following line:

.updateDelaySet(this, Date.now())

If this command ever fails, the recursion breaks and updateDelayTimer is not called again till a new delayed job is added. Since that may never happen, jobs may get permanently stuck in the delayed queue.

Here is the sequence of events that lead to this scenario:

  1. Redis is busy running a heavy script (for eg: queue.clean was run to clear failed jobs)

  2. During this time, a call to updateDelayTimer is made, which in turn calls the updateDelaySet command:

    Queue.prototype.updateDelayTimer = function() {

    .updateDelaySet(this, Date.now())

  3. The updateDelaySet command fails with the following error: ReplyError: BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE. as Redis is busy.

  4. The promise fails and the catch block simply emits an error:

    this.emit('error', err, 'Error updating the delay timer');

Now because of the failure, the updateDelayTimer function is never called after this point, leading to the delayed jobs being stuck. The only way to recover them is by adding another delayed job to the queue, which seemingly triggers the message handler to call updateDelayTimer and restart the recursive process.

Proposed Solution

I am not 100% sure if this makes sense, but adding this line of code seems to have fix the problem:

    .catch(err => {
      setTimeout(() => this.updateDelayTimer(), 1000); // <- this line
      this.emit('error', err, 'Error updating the delay timer');
    });

Essentially, we retry the updateDelayTimer after a constant delay and hope that Redis is no longer busy and can now run the updateDelaySet command.

Not 100% sure if this can cause more than one this.updateDelayTimer loop to be active, will need your feedback for this.

Minimal, Working Test code to reproduce the issue.

Let me know if this is necessary and I will create a repo with the necessary code

Bull version

3.22.1

Additional information

@swayam18
Copy link
Author

I could get it to reliably reproduce with 40,000 jobs on my machine and verified that the solution indeed works. Any feedback will be appreciated

@swayam18 swayam18 changed the title Delayed jobs get stuck in delayed queue if Redis is busy [Bug] Delayed jobs get stuck in delayed queue if Redis is busy Apr 23, 2021
@manast
Copy link
Member

manast commented Apr 23, 2021

@swayam18 thanks for the report and the solution proposal, I will verify it and release a fix asap.

@manast manast closed this as completed in 16bbfad Apr 23, 2021
github-actions bot pushed a commit that referenced this issue Apr 23, 2021
## [3.22.3](v3.22.2...v3.22.3) (2021-04-23)

### Bug Fixes

* **delayed:** re-schedule updateDelay in case of error fixes [#2015](#2015) ([16bbfad](16bbfad))
@manast
Copy link
Member

manast commented Apr 23, 2021

🎉 This issue has been resolved in version 3.22.3 🎉

The release is available on:

Your semantic-release bot 📦🚀

papandreou added a commit to papandreou/bull that referenced this issue Apr 27, 2021
* develop: (39 commits)
  chore(release): 3.22.3 [skip ci]
  fix(delayed): re-schedule updateDelay in case of error fixes OptimalBits#2015
  chore(release): 3.22.2 [skip ci]
  chore: fix github token
  chore: correct release branch
  test: increase timeout in test
  chore: add semantic release scripts
  fix(obliterate): obliterate many jobs fixes OptimalBits#2016
  3.22.1
  chore: upgrade dependencies
  chore: update CHANGELOG
  fix(obliterate): remove repeatable jobs fixes OptimalBits#2012
  docs: fix typo
  docs: update README.md
  docs: updated README.md
  Update README.md
  3.22.0
  docs: update CHANGELOG
  feat: do not rely on comma to encode jobid in progress fixes OptimalBits#2003 (OptimalBits#2004)
  3.21.1
  ...
@felix-d
Copy link

felix-d commented Sep 13, 2021

We faced the same issue today and it looks like we're running 3.29.0. At some point our job workers stopped doing any work. I scaled up our cluster and the new workers started picking up jobs but the old ones were still idle after getting the Redis is busy error a couple of times earlier.

Looking at the logs, the exact error I get before the worker stops doing any work is the following:

(node:39) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 8)
(node:39) UnhandledPromiseRejectionWarning: ReplyError: BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE.
    at parseError (/app/node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (/app/node_modules/redis-parser/lib/parser.js:302:14)

However I don't think updateDelayTimer is the one at cause this time. I'll keep digging.

@felix-d
Copy link

felix-d commented Sep 13, 2021

It looks like this is happening at boot when the application starts while Redis is already BUSY running a script (in our case we're running queue.clean at intervals and the number of jobs to clean can be quite large).

I checked the logs of 2 of two production pods that were not processing any jobs and they both hit that error exactly 16 times and just went idle after (not sure why 16, my concurrency is set to 12 at the moment). They did not process even a single job. It seems like hitting BUSY Redis is not an issue if the workers were already started and processing jobs.

@felix-d
Copy link

felix-d commented Sep 13, 2021

For now we simply added removeOnComplete to our jobs, we don't need to keep them around anymore, and thus we won't need to run queue.clean at intervals. This will stop us from getting the busy Redis error but it's not a fix to the actual problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants