Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent ‘Missing lock for job’ Errors in BullMQ After Queue History Increase #2974

Closed
1 task done
drmarshall opened this issue Dec 18, 2024 · 7 comments
Closed
1 task done

Comments

@drmarshall
Copy link

Version

v5.7.15

Platform

NodeJS

What happened?

I'm a fairly new bullmq user. This is less likely a bug than a misconfiguration or known bugged version? Appreciate any insights from the maintainers or community on this issue.

Across several different services/queues, we have started to see numerous: Missing lock for job N. moveToFinished, which then triggers a retry and leaves the job in failed state (approximately 3-5% of jobs). Previously this was a non-existent issue for us, so it's odd for it to appear across all our queues. The only recent change is that we've increased history to 100 jobs completed jobs per queue (up from ~25-50). Some job logs can be thousands of lines; job data are at most ~400 lines of pretty-printed JSON. Did I perhaps miss something in the documentation about the important of speed/load/memory requirements for the redis instance?

I have been trying to work on a reproducible example, but I have not yet been successful – I think that it might only happen under redis load? All the issues/documentation I can find seems to suggest this type of issues is from too short of a lockDuration during node event loop (doesn't seem to be the case here – some jobs are only 30 seconds and non-blocking of event loop) or calling moveToFinished or related methods manually (we are not).

I looked through issues/releases for a reference to a specific changes that might resolve this if it were a known issue. Please do let me know if I missed something in the history.

How to reproduce.

// src/my-service/queue.ts

import { Queue } from "bullmq"

import { getQueueConnection } from "@/lib/redis"

export const QUEUE = "my-queue"

let queue: Queue

export const getQueue = (): Queue => {
  if (!queue) {
    queue = new Queue(QUEUE, {
      connection: getQueueConnection(),
      defaultJobOptions: {
        attempts: 4,
        backoff: {
          type: "exponential",
          delay: 5000,
        },
      },
    })
    queue.on("error", (job) => {
      console.error(QUEUE, "queue error", job)
    })
  }
  return queue
}

...
// src/instrumentation.ts

const { Worker } = await import("bullmq")

export const register = async () => {
...
  if (process.env.NEXT_RUNTIME === "nodejs") {
    const worker = new Worker(
      QUEUE,
      async (job: Job & { data: MyJob }) => {
        try {
          const jobData = jobSchema.parse(job.data)
          await process(jobData, true, makeJobLogger(job))
        } catch (e) {
          console.error("Error processing my job", e)
          throw e
        }
      },
      {
        connection: getWorkerConnection(),
        concurrency: 1,
        removeOnComplete: { count: 100 },
        removeOnFail: { count: 500 },
        runRetryDelay: 1000,
        settings: {
          backoffStrategy,
        },
      }
    )
    myWorker.on("error", (err) => {
      console.error(
        "myWorker Error processing my job",
        err.message,
        err.cause,
        err.stack
      )
    })
    myWorker.on("completed", (job: Job & { data: myJob }) => {
      console.log("myWorker Job completed", job.id)
    })
  }
}

...

// src/lib/redis.ts

import IORedis from "ioredis"

import { env } from "@/env.mjs"

const retryStrategy = (times: number) => {
  return Math.max(Math.min(Math.exp(times), 20000), 1000)
}

let workerConnection: IORedis
let queueConnection: IORedis

export const getWorkerConnection = () => {
  if (!env.REDIS_URL) {
    throw new Error("REDIS_URL is not set")
  }
  if (!workerConnection) {
    workerConnection = new IORedis(env.REDIS_URL, {
      offlineQueue: true,
      maxRetriesPerRequest: null,
      enableReadyCheck: false,
      retryStrategy,
    })
  }
  return workerConnection
}

export const getQueueConnection = () => {
  if (!env.REDIS_URL) {
    throw new Error("REDIS_URL is not set")
  }

  if (!queueConnection) {
    queueConnection = new IORedis(env.REDIS_URL, {
      offlineQueue: false,
      maxRetriesPerRequest: null,
      enableReadyCheck: false,
      retryStrategy,
    })
  }
  return queueConnection
}

Relevant log output

Error: Missing lock for job 1234. moveToFinished
    at finishedErrors (/usr/node_modules/bullmq/dist/cjs/classes/scripts.js:308:28)
    at <anonymous> (/usr/node_modules/bullmq/dist/cjs/classes/scripts.js:290:24)
    at processTicksAndRejections (native:7:39)

Code of Conduct

  • I agree to follow this project's Code of Conduct
@drmarshall drmarshall added the bug Something isn't working label Dec 18, 2024
@manast
Copy link
Contributor

manast commented Dec 19, 2024

I think it would be useful if you upgrade to the latest version and we can see if you still experiment this issue before we dig deeper into it.

@drmarshall
Copy link
Author

Upgrading the most recent version seems to present a similar issue. I am seeing lock renewal is fails with meaningful frequency (~5% of jobs):

762 |             });

763 |             try {

764 |                 const erroredJobIds = await this.scripts.extendLocks(jobs.map(job => job.id), jobs.map(job => job.token), this.opts.lockDuration);

765 |                 for (const jobId of erroredJobIds) {

766 |                     // TODO: Send signal to process function that the job has been lost.

767 |                     this.emit('error', new Error(`could not renew lock for job ${jobId}`));

                                             ^

error: could not renew lock for job 1589
      at /usr/node_modules/bullmq/dist/cjs/classes/worker.js:767:40

The job ends up being marked as failed because it appears to bullmq to be stalled; however the process continues bc no signal is sent to process to quit. It seems like implementing the TODO in the code above to send a signal to process that the job is running without a lock would fix my error (the job would fail and be retried appropriately rather than proceeding without a lock). Probably this should be the default behavior (though is a behavior change).

I looked for a task/issue for this TODO and didn't find one – perhaps I missed it? Or is this left to the user to implement an appropriate onError event that handles bullmq errors that are abortable?

@manast
Copy link
Contributor

manast commented Jan 2, 2025

It is not normal that you get this error so often, the question is what is your process function doing, is it doing a very CPU intensive operation that does not let the worker time left to update the locks and are therefore lost before the job manages to complete it?

@drmarshall
Copy link
Author

drmarshall commented Jan 2, 2025

Thanks for that callout. Maybe I misunderstand the requirements for proper operation of a worker. I create a promise for a fluent-ffmpeg task (which calls child_process.spawn()) and handle the events, mostly just logging progress.

So, while the child process is CPU intensive, it seems to me that the node event loop is mostly free (anytime it's not handling a log event). Given the flake frequency, rather than increasing duration between lock renewals, allowing for a few failures of lock renewal seems like it will reduce these flakes meaningfully. Is increasing stalledInterval to 2x-3x lockDuration a reasonable approach here? Appreciate your insight on this tuning.

@manast
Copy link
Contributor

manast commented Jan 7, 2025

I don't think tweaking stalledIntervall or lockDuration is the best solution here. If the call to ffmpeg is indeed creating a new process, then the main node event loop should be able to renew the lock, so I wonder if it really is the case that it is running in a separate process?

@manast
Copy link
Contributor

manast commented Jan 7, 2025

Btw, If you could isolate the issue in a simple to run test I can definitely look deeper into it to see what is going on.

@manast manast added cannot reproduce and removed bug Something isn't working labels Jan 20, 2025
@manast manast changed the title [Bug]: Intermittent ‘Missing lock for job’ Errors in BullMQ After Queue History Increase Jan 20, 2025
@drmarshall
Copy link
Author

I can't create reproduction – it seems like it might be related to network issues. Most likely this is my fault. Appreciate advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants