-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When worker restarts, interrupted jobs are forever lost #307
Comments
Thanks for submitting this problem! How do you stop/restart the worker? With what signal? When the worker receives a SIGINT or a SIGTERM it should wait for ongoing jobs to finish before exiting. So after the worker exits the jobs that were running when the worker was requested to stop should be marked as "succeeded", or "failed", or "todo" if the job is to be retried. Also, note that a second signal (a second Ctrl+C for example) will immediately terminate the worker, interrupting any ongoing jobs (and leaving them as "doing" in the database). |
Hello ! Additionally to what @elemoine said, I'd like to add that the codebase has a function to detect that a job is stalled (stuck in "doing" for too long"), we did that in #7 and the tooling is still in place, but that function is currently not used anywhere. What would be needed now is:
@dionjwa do you think this would solve your problem ? |
@elemoine I'm not sure what signal is used, but it's not realistic for the worker to wait until the jobs are finished before exiting, since it's not under my control: kubernetes will give the container a signal, but it will only wait so long before terminating the container. My jobs are >90s, so too long to expect completion on getting a termination signal. @ewjoachim Yes, a function that automatically runs periodically that detects stalled jobs and resubmits then would work. This is an approach that other job queue managers also do e.g. https://github.com/OptimalBits/bull (this stores jobs in redis, but the exact same problem scenario exists there). In that case it would probably require a timeout or maxTime per job, so it's not guessing if a job is stalled, or just taking a long time. Thanks for your quick responses, much appreciated! |
For now, the existing code uses a single time value for all of the tasks. Having one value for each task would likely make the request much longer. Would it work if you used the longest timeout as the timeout value ? Also, I think there is a big risk in case on a network slowdown (let's say DNS problem where we have 2 DNS servers and for each connection we have to wait for the first DNS to timeout before we can continue. I experienced this often enough in my life to know it's a plausible scenario). In case of a network slowdown, I'd expect a slowdown in task execution, but with this, we could have tasks executing twice or even more, and even imagine being stuck repeatedly executing the same tasks. So what should happen ? When a task reaches timeout, should the worker kill it ? If it doesn't it's dangerous, I think. |
Just a note: it might be a phrasing issue. Presenting it as a "timeout" gives the impression that if we reach the timeout before the job is done, something will happen. We need the "duration after which it's unthinkable that the task would still be running" so that's probably something like 10 to 50 times the expected length of the task. |
@ewjoachim these are good questions. My two cents:
Overall, I think it's reasonable and useful to create a contract (you have 2 minutes to do this job once you start) and procrastinate fails (or retries or whatever) if it cannot. Great job! This library has a really good set of features. |
With k8s, to stop a pod, the container runtime sends a SIGTERM signal to the main process in each container. And once the "grace period" has expired, the SIGKILL signal is sent to the remaining processes. And Systemd uses the same sort of termination mechanism by default, with a timeout of 90 seconds by default. So the runtime environments commonly used already have this notion of "stop timeouts", so do we really want to add our own mechanism to Procrastinate? An open question really. When the runtime environment kills (with SIGKILL) the Procrastinate worker, that may leave jobs in the "doing" state in the queue. Currently, these jobs will become stalled. When the Procrastinate worker starts again it will ignore them, and they'll stay in the queue for ever. How about adding a This is what I have in mind: https://github.com/peopledoc/procrastinate/compare/ele_stalled-jobs. (Although there may be a race condition issue with this implementation.) |
Not fond of having that on worker startup because it means you'd have to restart a worker in order to do something on these jobs. I'm beginning to think documenting a |
When may a job be left in the queue with the "doing" state? My assumption was that it may only happen when the worker process got killed while executing a job. This is why I thought a
And I'm not sure about |
It's not because a worker was killed that a new one will be launched. e.g. if we scale down, and one of the workers we shut down had to be killed because the task took too long, we won't be taking this task anytime soon. That's why I'd rather this be integrated in the normal lifecycle.
Only in terms of deleting it. We may want to do any number of other things: relauching it, updating something in our DB to say there was an error, sending an alert, ...
Yes, that's what I was asking earlier. For me the time before a job is declared "stalled" must be long enough so that there cannot be a ambiguity. Say 100+ or 1000+ times the normal duration of the task. I'm not willing to integrate a "ping" system like celery has because it really feels like mixing business logic and infrastructure, but without that, it's really tricky to be certain whether a job is running or not. That's the heart of the "at-most-one vs at-least-one" problem. |
As discussed today with @ewjoachim, we plan to add an |
Oh we never answered this:
Please read on what the queueing lock is here: https://procrastinate.readthedocs.io/en/stable/glossary.html?term-Queueing-Lock and here: https://procrastinate.readthedocs.io/en/stable/reference.html#procrastinate.tasks.Task.configure This should make it much clearer why it's expected that there's no queueing lock if you don't use this feature :) |
@ewjoachim Thanks for the clarification! It's unclear from the proposal if the periodic task to retrieve all stalled jobs is something a user will have to configure, or if it happens automatically. |
procrastinate will provide building blocks and guidance (through documentation) on how to make a periodic task retrier. The user will be in charge of implementing it according to their own rules, given we're not sur there exists a single "sane default" action. It it ok from your point of view ? |
It's not ideal honestly. The one thing I want from a distributed job queue is to reliably do jobs no matter when or how workers come and go. Other features can be added, but reliably doing jobs without me needing to do extra custom logic is the main feature I always need in a job queue. In the meantime though, I will be very grateful for documentation on how to implement the task retrier, even pointers here if it's not going to be merged for a while. I basically need to solve this ASAP, so replacing procrastinate or finding another solution, since we are losing jobs. |
So just to let you know, we've had limited time to work on that on the last days, but we'll try and suggest a snippet that should help, either in the current stable procrastinate version or the next one. Also,... Form my current understanding, you're not losing tasks. For a very manual workaround, you can update tasks that have been in the "doing" state for a long time, and set them as "todo". They should run once again. |
I have been careful to apply the db migrations, but I suspect I have missed something.
procrastinate==0.15.2
Steps:
The relevant state in the db:
![image](https://user-images.githubusercontent.com/260221/92180062-f4081a00-edfa-11ea-9226-c0dcc3aea40c.png)
There's no
queueing_lock
which makes me suspicious.What is the expected state of a running job, and how would I determine if a job can be picked up if the worker failed?
Thanks!
The text was updated successfully, but these errors were encountered: