When worker restarts, interrupted jobs are forever lost #307

dionjwa · 2020-09-03T22:35:58Z

I have been careful to apply the db migrations, but I suspect I have missed something.

procrastinate version: procrastinate==0.15.2

Steps:

start a worker
submit a job
restart the worker while the job is processed
on worker restart, it never picks up the incomplete job

The relevant state in the db:

There's no queueing_lock which makes me suspicious.

What is the expected state of a running job, and how would I determine if a job can be picked up if the worker failed?

Thanks!

The text was updated successfully, but these errors were encountered:

elemoine · 2020-09-04T06:51:22Z

Thanks for submitting this problem!

How do you stop/restart the worker? With what signal?

When the worker receives a SIGINT or a SIGTERM it should wait for ongoing jobs to finish before exiting. So after the worker exits the jobs that were running when the worker was requested to stop should be marked as "succeeded", or "failed", or "todo" if the job is to be retried.

Also, note that a second signal (a second Ctrl+C for example) will immediately terminate the worker, interrupting any ongoing jobs (and leaving them as "doing" in the database).

ewjoachim · 2020-09-04T07:04:49Z

Hello !

Additionally to what @elemoine said, I'd like to add that the codebase has a function to detect that a job is stalled (stuck in "doing" for too long"), we did that in #7 and the tooling is still in place, but that function is currently not used anywhere. What would be needed now is:

A design decision on what to do with that. I'm thinking:
- providing a built-in task that updates all stalled jobs to either failed or todo (depending on the retry argument)
- Adding this in the admin shell too
... And then execute on the plan.

@dionjwa do you think this would solve your problem ?

dionjwa · 2020-09-04T20:39:18Z

@elemoine I'm not sure what signal is used, but it's not realistic for the worker to wait until the jobs are finished before exiting, since it's not under my control: kubernetes will give the container a signal, but it will only wait so long before terminating the container. My jobs are >90s, so too long to expect completion on getting a termination signal.

@ewjoachim Yes, a function that automatically runs periodically that detects stalled jobs and resubmits then would work. This is an approach that other job queue managers also do e.g. https://github.com/OptimalBits/bull (this stores jobs in redis, but the exact same problem scenario exists there).

In that case it would probably require a timeout or maxTime per job, so it's not guessing if a job is stalled, or just taking a long time.

Thanks for your quick responses, much appreciated!

ewjoachim · 2020-09-07T16:11:04Z

In that case it would probably require a timeout or maxTime per job, so it's not guessing if a job is stalled, or just taking a long time.

For now, the existing code uses a single time value for all of the tasks. Having one value for each task would likely make the request much longer. Would it work if you used the longest timeout as the timeout value ?

Also, I think there is a big risk in case on a network slowdown (let's say DNS problem where we have 2 DNS servers and for each connection we have to wait for the first DNS to timeout before we can continue. I experienced this often enough in my life to know it's a plausible scenario).

In case of a network slowdown, I'd expect a slowdown in task execution, but with this, we could have tasks executing twice or even more, and even imagine being stuck repeatedly executing the same tasks.

So what should happen ? When a task reaches timeout, should the worker kill it ? If it doesn't it's dangerous, I think.

ewjoachim · 2020-09-10T09:19:25Z

Just a note: it might be a phrasing issue. Presenting it as a "timeout" gives the impression that if we reach the timeout before the job is done, something will happen. We need the "duration after which it's unthinkable that the task would still be running" so that's probably something like 10 to 50 times the expected length of the task.

dionjwa · 2020-09-30T17:23:51Z

@ewjoachim these are good questions. My two cents:

choosing where you bound your edge cases helps put some reasonable boundaries
- I am assuming the max-job-time is from when a worker picks up the job, not how long it has been on the queue
- As the person putting jobs on the queue, I would prefer to have control about when I think jobs have taken too long, and should be killed
- I can add extra time to that value myself, since they are my jobs, I know their bounds better than anyone else
  - and I can pad that time myself to allow some network latency
  - but I might choose not to, I might value quicker turnaround with some rate of expected job failure due to taking a bit too long
if there are atrocious network conditions, then if the jobs require network resources (not all jobs do) then jobs will fail
- if jobs get retried on some exponential interval, then hopefully they will all still get consumed eventually
- I think a combination of configurable retries, with a per job (or per task/job type) timeout will handle the vast majority of use cases and allows a lot of configurability of durability, and responsiveness
- if network conditions stay terrible, then many jobs will fail
  - this is unavoidable in extreme conditions, most important then is proper recording of the error state

Overall, I think it's reasonable and useful to create a contract (you have 2 minutes to do this job once you start) and procrastinate fails (or retries or whatever) if it cannot.
I think it's hard to make those assumptions as the library developer since the parameters used to optimize are really only known by the app developer using the library, and jobs vary enormously.

Great job! This library has a really good set of features.

elemoine · 2020-10-02T13:56:36Z

With k8s, to stop a pod, the container runtime sends a SIGTERM signal to the main process in each container. And once the "grace period" has expired, the SIGKILL signal is sent to the remaining processes.

And Systemd uses the same sort of termination mechanism by default, with a timeout of 90 seconds by default.

So the runtime environments commonly used already have this notion of "stop timeouts", so do we really want to add our own mechanism to Procrastinate? An open question really.

When the runtime environment kills (with SIGKILL) the Procrastinate worker, that may leave jobs in the "doing" state in the queue. Currently, these jobs will become stalled. When the Procrastinate worker starts again it will ignore them, and they'll stay in the queue for ever. How about adding a run_stalled_jobs option to the worker? When this option is set, the worker will, at startup time, retrieve all the stalled jobs from the database, and re-run them.

This is what I have in mind: https://github.com/peopledoc/procrastinate/compare/ele_stalled-jobs. (Although there may be a race condition issue with this implementation.)

ewjoachim · 2020-10-03T15:07:16Z

How about adding a run_stalled_jobs option to the worker? When this option is set, the worker will, at startup time, retrieve all the stalled jobs from the database, and re-run them.

Not fond of having that on worker startup because it means you'd have to restart a worker in order to do something on these jobs.

I'm beginning to think documenting a get_error_jobs and/or get_stalled_jobs in the public API and just encouraging people to implement their own "meta" periodic task that does whatever they need to those tasks could be an idea. Empowering people with their own workflow rather than trying to cover every use case ? Not 100% sure about the approach.

elemoine · 2020-10-05T06:56:42Z

Not fond of having that on worker startup because it means you'd have to restart a worker in order to do something on these jobs.

When may a job be left in the queue with the "doing" state? My assumption was that it may only happen when the worker process got killed while executing a job. This is why I thought a run_stalled_jobs option would be useful.

I'm beginning to think documenting a get_error_jobs and/or get_stalled_jobs in the public API and just encouraging people to implement their own "meta" periodic task that does whatever they need to those tasks could be an idea.

get_error_jobs is sort of already covered by the remove_old_jobs built-in task.

And I'm not sure about get_stalled_jobs for now. If you implement your own periodic task that calls get_stalled_jobs and decide to re-run a "stalled" job, how do you know that this job is "stalled" as the result of a previous KILL of a worker, or that it's actually a job that is currently executing?

ewjoachim · 2020-10-05T08:46:41Z

When may a job be left in the queue with the "doing" state? My assumption was that it may only happen when the worker process got killed while executing a job. This is why I thought a run_stalled_jobs option would be useful.

It's not because a worker was killed that a new one will be launched. e.g. if we scale down, and one of the workers we shut down had to be killed because the task took too long, we won't be taking this task anytime soon. That's why I'd rather this be integrated in the normal lifecycle.

get_error_jobs is sort of already covered by the remove_old_jobs built-in task.

Only in terms of deleting it. We may want to do any number of other things: relauching it, updating something in our DB to say there was an error, sending an alert, ...

And I'm not sure about get_stalled_jobs for now. If you implement your own periodic task that calls get_stalled_jobs and decide to re-run a "stalled" job, how do you know that this job is "stalled" as the result of a previous KILL of a worker, or that it's actually a job that is currently executing?

Yes, that's what I was asking earlier. For me the time before a job is declared "stalled" must be long enough so that there cannot be a ambiguity. Say 100+ or 1000+ times the normal duration of the task. I'm not willing to integrate a "ping" system like celery has because it really feels like mixing business logic and infrastructure, but without that, it's really tricky to be certain whether a job is running or not. That's the heart of the "at-most-one vs at-least-one" problem.

elemoine · 2020-10-09T15:42:14Z

As discussed today with @ewjoachim, we plan to add an JobStore.list_jobs_with_latest_event function. Using that function, and JobStore.finish_job, it will be possible to create a periodic task that retrieve all the stalled jobs and restart them.

ewjoachim · 2020-10-09T16:10:36Z

Oh we never answered this:

There's no queueing_lock which makes me suspicious.

Please read on what the queueing lock is here: https://procrastinate.readthedocs.io/en/stable/glossary.html?term-Queueing-Lock and here: https://procrastinate.readthedocs.io/en/stable/reference.html#procrastinate.tasks.Task.configure

This should make it much clearer why it's expected that there's no queueing lock if you don't use this feature :)

dionjwa · 2020-10-19T20:36:44Z

@ewjoachim Thanks for the clarification!

It's unclear from the proposal if the periodic task to retrieve all stalled jobs is something a user will have to configure, or if it happens automatically.

ewjoachim · 2020-10-20T07:24:39Z

procrastinate will provide building blocks and guidance (through documentation) on how to make a periodic task retrier. The user will be in charge of implementing it according to their own rules, given we're not sur there exists a single "sane default" action. It it ok from your point of view ?

dionjwa · 2020-11-30T20:15:47Z

It's not ideal honestly. The one thing I want from a distributed job queue is to reliably do jobs no matter when or how workers come and go. Other features can be added, but reliably doing jobs without me needing to do extra custom logic is the main feature I always need in a job queue.

In the meantime though, I will be very grateful for documentation on how to implement the task retrier, even pointers here if it's not going to be merged for a while. I basically need to solve this ASAP, so replacing procrastinate or finding another solution, since we are losing jobs.

ewjoachim · 2020-12-08T16:56:18Z

So just to let you know, we've had limited time to work on that on the last days, but we'll try and suggest a snippet that should help, either in the current stable procrastinate version or the next one.

Also,... Form my current understanding, you're not losing tasks. For a very manual workaround, you can update tasks that have been in the "doing" state for a long time, and set them as "todo". They should run once again.

elemoine · 2020-12-11T16:48:47Z

@dionjwa see #366, feel free to give it a try, and report back here or in the PR. Thanks.

ewjoachim mentioned this issue Nov 7, 2020

Split the procrastinate_finish_job SQL function into two functions #332

Closed

elemoine self-assigned this Nov 25, 2020

elemoine mentioned this issue Dec 11, 2020

Add a "Retry stalled jobs" howto #366

Merged

5 tasks

elemoine closed this as completed in #366 Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When worker restarts, interrupted jobs are forever lost #307

When worker restarts, interrupted jobs are forever lost #307

dionjwa commented Sep 3, 2020

elemoine commented Sep 4, 2020 •

edited

Loading

ewjoachim commented Sep 4, 2020 •

edited

Loading

dionjwa commented Sep 4, 2020

ewjoachim commented Sep 7, 2020

ewjoachim commented Sep 10, 2020

dionjwa commented Sep 30, 2020

elemoine commented Oct 2, 2020 •

edited

Loading

ewjoachim commented Oct 3, 2020

elemoine commented Oct 5, 2020

ewjoachim commented Oct 5, 2020

elemoine commented Oct 9, 2020

ewjoachim commented Oct 9, 2020

dionjwa commented Oct 19, 2020

ewjoachim commented Oct 20, 2020

dionjwa commented Nov 30, 2020

ewjoachim commented Dec 8, 2020

elemoine commented Dec 11, 2020

When worker restarts, interrupted jobs are forever lost #307

When worker restarts, interrupted jobs are forever lost #307

Comments

dionjwa commented Sep 3, 2020

elemoine commented Sep 4, 2020 • edited Loading

ewjoachim commented Sep 4, 2020 • edited Loading

dionjwa commented Sep 4, 2020

ewjoachim commented Sep 7, 2020

ewjoachim commented Sep 10, 2020

dionjwa commented Sep 30, 2020

elemoine commented Oct 2, 2020 • edited Loading

ewjoachim commented Oct 3, 2020

elemoine commented Oct 5, 2020

ewjoachim commented Oct 5, 2020

elemoine commented Oct 9, 2020

ewjoachim commented Oct 9, 2020

dionjwa commented Oct 19, 2020

ewjoachim commented Oct 20, 2020

dionjwa commented Nov 30, 2020

ewjoachim commented Dec 8, 2020

elemoine commented Dec 11, 2020

elemoine commented Sep 4, 2020 •

edited

Loading

ewjoachim commented Sep 4, 2020 •

edited

Loading

elemoine commented Oct 2, 2020 •

edited

Loading