-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent ‘Missing lock for job’ Errors in BullMQ After Queue History Increase #2974
Comments
I think it would be useful if you upgrade to the latest version and we can see if you still experiment this issue before we dig deeper into it. |
Upgrading the most recent version seems to present a similar issue. I am seeing lock renewal is fails with meaningful frequency (~5% of jobs):
The job ends up being marked as failed because it appears to bullmq to be stalled; however the process continues bc no signal is sent to process to quit. It seems like implementing the I looked for a task/issue for this |
It is not normal that you get this error so often, the question is what is your process function doing, is it doing a very CPU intensive operation that does not let the worker time left to update the locks and are therefore lost before the job manages to complete it? |
Thanks for that callout. Maybe I misunderstand the requirements for proper operation of a worker. I create a promise for a So, while the child process is CPU intensive, it seems to me that the node event loop is mostly free (anytime it's not handling a log event). Given the flake frequency, rather than increasing duration between lock renewals, allowing for a few failures of lock renewal seems like it will reduce these flakes meaningfully. Is increasing |
I don't think tweaking stalledIntervall or lockDuration is the best solution here. If the call to ffmpeg is indeed creating a new process, then the main node event loop should be able to renew the lock, so I wonder if it really is the case that it is running in a separate process? |
Btw, If you could isolate the issue in a simple to run test I can definitely look deeper into it to see what is going on. |
I can't create reproduction – it seems like it might be related to network issues. Most likely this is my fault. Appreciate advice. |
Version
v5.7.15
Platform
NodeJS
What happened?
I'm a fairly new bullmq user. This is less likely a bug than a misconfiguration or known bugged version? Appreciate any insights from the maintainers or community on this issue.
Across several different services/queues, we have started to see numerous:
Missing lock for job N. moveToFinished
, which then triggers a retry and leaves the job in failed state (approximately 3-5% of jobs). Previously this was a non-existent issue for us, so it's odd for it to appear across all our queues. The only recent change is that we've increased history to 100 jobs completed jobs per queue (up from ~25-50). Some job logs can be thousands of lines; job data are at most ~400 lines of pretty-printed JSON. Did I perhaps miss something in the documentation about the important of speed/load/memory requirements for the redis instance?I have been trying to work on a reproducible example, but I have not yet been successful – I think that it might only happen under redis load? All the issues/documentation I can find seems to suggest this type of issues is from too short of a
lockDuration
during node event loop (doesn't seem to be the case here – some jobs are only 30 seconds and non-blocking of event loop) or calling moveToFinished or related methods manually (we are not).I looked through issues/releases for a reference to a specific changes that might resolve this if it were a known issue. Please do let me know if I missed something in the history.
How to reproduce.
Relevant log output
Code of Conduct
The text was updated successfully, but these errors were encountered: