Intermittent ‘Missing lock for job’ Errors in BullMQ After Queue History Increase #2974

drmarshall · 2024-12-18T21:48:50Z

Version

v5.7.15

Platform

NodeJS

What happened?

I'm a fairly new bullmq user. This is less likely a bug than a misconfiguration or known bugged version? Appreciate any insights from the maintainers or community on this issue.

Across several different services/queues, we have started to see numerous: Missing lock for job N. moveToFinished, which then triggers a retry and leaves the job in failed state (approximately 3-5% of jobs). Previously this was a non-existent issue for us, so it's odd for it to appear across all our queues. The only recent change is that we've increased history to 100 jobs completed jobs per queue (up from ~25-50). Some job logs can be thousands of lines; job data are at most ~400 lines of pretty-printed JSON. Did I perhaps miss something in the documentation about the important of speed/load/memory requirements for the redis instance?

I have been trying to work on a reproducible example, but I have not yet been successful – I think that it might only happen under redis load? All the issues/documentation I can find seems to suggest this type of issues is from too short of a lockDuration during node event loop (doesn't seem to be the case here – some jobs are only 30 seconds and non-blocking of event loop) or calling moveToFinished or related methods manually (we are not).

I looked through issues/releases for a reference to a specific changes that might resolve this if it were a known issue. Please do let me know if I missed something in the history.

How to reproduce.

// src/my-service/queue.ts

import { Queue } from "bullmq"

import { getQueueConnection } from "@/lib/redis"

export const QUEUE = "my-queue"

let queue: Queue

export const getQueue = (): Queue => {
  if (!queue) {
    queue = new Queue(QUEUE, {
      connection: getQueueConnection(),
      defaultJobOptions: {
        attempts: 4,
        backoff: {
          type: "exponential",
          delay: 5000,
        },
      },
    })
    queue.on("error", (job) => {
      console.error(QUEUE, "queue error", job)
    })
  }
  return queue
}

...
// src/instrumentation.ts

const { Worker } = await import("bullmq")

export const register = async () => {
...
  if (process.env.NEXT_RUNTIME === "nodejs") {
    const worker = new Worker(
      QUEUE,
      async (job: Job & { data: MyJob }) => {
        try {
          const jobData = jobSchema.parse(job.data)
          await process(jobData, true, makeJobLogger(job))
        } catch (e) {
          console.error("Error processing my job", e)
          throw e
        }
      },
      {
        connection: getWorkerConnection(),
        concurrency: 1,
        removeOnComplete: { count: 100 },
        removeOnFail: { count: 500 },
        runRetryDelay: 1000,
        settings: {
          backoffStrategy,
        },
      }
    )
    myWorker.on("error", (err) => {
      console.error(
        "myWorker Error processing my job",
        err.message,
        err.cause,
        err.stack
      )
    })
    myWorker.on("completed", (job: Job & { data: myJob }) => {
      console.log("myWorker Job completed", job.id)
    })
  }
}

...

// src/lib/redis.ts

import IORedis from "ioredis"

import { env } from "@/env.mjs"

const retryStrategy = (times: number) => {
  return Math.max(Math.min(Math.exp(times), 20000), 1000)
}

let workerConnection: IORedis
let queueConnection: IORedis

export const getWorkerConnection = () => {
  if (!env.REDIS_URL) {
    throw new Error("REDIS_URL is not set")
  }
  if (!workerConnection) {
    workerConnection = new IORedis(env.REDIS_URL, {
      offlineQueue: true,
      maxRetriesPerRequest: null,
      enableReadyCheck: false,
      retryStrategy,
    })
  }
  return workerConnection
}

export const getQueueConnection = () => {
  if (!env.REDIS_URL) {
    throw new Error("REDIS_URL is not set")
  }

  if (!queueConnection) {
    queueConnection = new IORedis(env.REDIS_URL, {
      offlineQueue: false,
      maxRetriesPerRequest: null,
      enableReadyCheck: false,
      retryStrategy,
    })
  }
  return queueConnection
}

Relevant log output

Error: Missing lock for job 1234. moveToFinished
    at finishedErrors (/usr/node_modules/bullmq/dist/cjs/classes/scripts.js:308:28)
    at <anonymous> (/usr/node_modules/bullmq/dist/cjs/classes/scripts.js:290:24)
    at processTicksAndRejections (native:7:39)

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

manast · 2024-12-19T08:57:08Z

I think it would be useful if you upgrade to the latest version and we can see if you still experiment this issue before we dig deeper into it.

drmarshall · 2024-12-31T20:02:27Z

Upgrading the most recent version seems to present a similar issue. I am seeing lock renewal is fails with meaningful frequency (~5% of jobs):

762 |             });

763 |             try {

764 |                 const erroredJobIds = await this.scripts.extendLocks(jobs.map(job => job.id), jobs.map(job => job.token), this.opts.lockDuration);

765 |                 for (const jobId of erroredJobIds) {

766 |                     // TODO: Send signal to process function that the job has been lost.

767 |                     this.emit('error', new Error(`could not renew lock for job ${jobId}`));

                                             ^

error: could not renew lock for job 1589
      at /usr/node_modules/bullmq/dist/cjs/classes/worker.js:767:40

The job ends up being marked as failed because it appears to bullmq to be stalled; however the process continues bc no signal is sent to process to quit. It seems like implementing the TODO in the code above to send a signal to process that the job is running without a lock would fix my error (the job would fail and be retried appropriately rather than proceeding without a lock). Probably this should be the default behavior (though is a behavior change).

I looked for a task/issue for this TODO and didn't find one – perhaps I missed it? Or is this left to the user to implement an appropriate onError event that handles bullmq errors that are abortable?

manast · 2025-01-02T09:13:26Z

It is not normal that you get this error so often, the question is what is your process function doing, is it doing a very CPU intensive operation that does not let the worker time left to update the locks and are therefore lost before the job manages to complete it?

drmarshall · 2025-01-02T22:42:05Z

Thanks for that callout. Maybe I misunderstand the requirements for proper operation of a worker. I create a promise for a fluent-ffmpeg task (which calls child_process.spawn()) and handle the events, mostly just logging progress.

So, while the child process is CPU intensive, it seems to me that the node event loop is mostly free (anytime it's not handling a log event). Given the flake frequency, rather than increasing duration between lock renewals, allowing for a few failures of lock renewal seems like it will reduce these flakes meaningfully. Is increasing stalledInterval to 2x-3x lockDuration a reasonable approach here? Appreciate your insight on this tuning.

manast · 2025-01-07T22:10:40Z

I don't think tweaking stalledIntervall or lockDuration is the best solution here. If the call to ffmpeg is indeed creating a new process, then the main node event loop should be able to renew the lock, so I wonder if it really is the case that it is running in a separate process?

manast · 2025-01-07T22:13:03Z

Btw, If you could isolate the issue in a simple to run test I can definitely look deeper into it to see what is going on.

drmarshall · 2025-01-23T19:57:03Z

I can't create reproduction – it seems like it might be related to network issues. Most likely this is my fault. Appreciate advice.

drmarshall added the bug Something isn't working label Dec 18, 2024

manast added cannot reproduce and removed bug Something isn't working labels Jan 20, 2025

manast changed the title ~~[Bug]:~~ Intermittent ‘Missing lock for job’ Errors in BullMQ After Queue History Increase Jan 20, 2025

drmarshall closed this as completed Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent ‘Missing lock for job’ Errors in BullMQ After Queue History Increase #2974

Intermittent ‘Missing lock for job’ Errors in BullMQ After Queue History Increase #2974

drmarshall commented Dec 18, 2024

manast commented Dec 19, 2024

drmarshall commented Dec 31, 2024

manast commented Jan 2, 2025

drmarshall commented Jan 2, 2025 •

edited

Loading

manast commented Jan 7, 2025

manast commented Jan 7, 2025

drmarshall commented Jan 23, 2025

Intermittent ‘Missing lock for job’ Errors in BullMQ After Queue History Increase #2974

Intermittent ‘Missing lock for job’ Errors in BullMQ After Queue History Increase #2974

Comments

drmarshall commented Dec 18, 2024

Version

Platform

What happened?

How to reproduce.

Relevant log output

Code of Conduct

manast commented Dec 19, 2024

drmarshall commented Dec 31, 2024

manast commented Jan 2, 2025

drmarshall commented Jan 2, 2025 • edited Loading

manast commented Jan 7, 2025

manast commented Jan 7, 2025

drmarshall commented Jan 23, 2025

drmarshall commented Jan 2, 2025 •

edited

Loading