Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When worker restarts, interrupted jobs are forever lost #307

Closed
dionjwa opened this issue Sep 3, 2020 · 17 comments · Fixed by #366
Closed

When worker restarts, interrupted jobs are forever lost #307

dionjwa opened this issue Sep 3, 2020 · 17 comments · Fixed by #366
Assignees
Labels
Issue appropriate for: newcomers 🤩 This issue can be safely tackled by people who never worked in this project before Issue appropriate for: Occasional contributors 😉 This issue will be best tackled by people with a minimum of experience on the project Issue contains: Exploration & Design decisions 🧩 We don't know how this will be implemented yet Issue contains: Some documentation 📚 This issues involves writing some documentation Issue contains: Some Python 🐍 This issue involves writing some Python code

Comments

@dionjwa
Copy link

dionjwa commented Sep 3, 2020

I have been careful to apply the db migrations, but I suspect I have missed something.

  • procrastinate version: procrastinate==0.15.2

Steps:

  1. start a worker
  2. submit a job
  3. restart the worker while the job is processed
  4. on worker restart, it never picks up the incomplete job

The relevant state in the db:
image

There's no queueing_lock which makes me suspicious.

What is the expected state of a running job, and how would I determine if a job can be picked up if the worker failed?

Thanks!

@elemoine
Copy link
Contributor

elemoine commented Sep 4, 2020

Thanks for submitting this problem!

How do you stop/restart the worker? With what signal?

When the worker receives a SIGINT or a SIGTERM it should wait for ongoing jobs to finish before exiting. So after the worker exits the jobs that were running when the worker was requested to stop should be marked as "succeeded", or "failed", or "todo" if the job is to be retried.

Also, note that a second signal (a second Ctrl+C for example) will immediately terminate the worker, interrupting any ongoing jobs (and leaving them as "doing" in the database).

@ewjoachim
Copy link
Member

ewjoachim commented Sep 4, 2020

Hello !

Additionally to what @elemoine said, I'd like to add that the codebase has a function to detect that a job is stalled (stuck in "doing" for too long"), we did that in #7 and the tooling is still in place, but that function is currently not used anywhere. What would be needed now is:

  • A design decision on what to do with that. I'm thinking:
    • providing a built-in task that updates all stalled jobs to either failed or todo (depending on the retry argument)
    • Adding this in the admin shell too
  • ... And then execute on the plan.

@dionjwa do you think this would solve your problem ?

@dionjwa
Copy link
Author

dionjwa commented Sep 4, 2020

@elemoine I'm not sure what signal is used, but it's not realistic for the worker to wait until the jobs are finished before exiting, since it's not under my control: kubernetes will give the container a signal, but it will only wait so long before terminating the container. My jobs are >90s, so too long to expect completion on getting a termination signal.

@ewjoachim Yes, a function that automatically runs periodically that detects stalled jobs and resubmits then would work. This is an approach that other job queue managers also do e.g. https://github.com/OptimalBits/bull (this stores jobs in redis, but the exact same problem scenario exists there).

In that case it would probably require a timeout or maxTime per job, so it's not guessing if a job is stalled, or just taking a long time.

Thanks for your quick responses, much appreciated!

@ewjoachim
Copy link
Member

In that case it would probably require a timeout or maxTime per job, so it's not guessing if a job is stalled, or just taking a long time.

For now, the existing code uses a single time value for all of the tasks. Having one value for each task would likely make the request much longer. Would it work if you used the longest timeout as the timeout value ?

Also, I think there is a big risk in case on a network slowdown (let's say DNS problem where we have 2 DNS servers and for each connection we have to wait for the first DNS to timeout before we can continue. I experienced this often enough in my life to know it's a plausible scenario).

In case of a network slowdown, I'd expect a slowdown in task execution, but with this, we could have tasks executing twice or even more, and even imagine being stuck repeatedly executing the same tasks.

So what should happen ? When a task reaches timeout, should the worker kill it ? If it doesn't it's dangerous, I think.

@ewjoachim
Copy link
Member

Just a note: it might be a phrasing issue. Presenting it as a "timeout" gives the impression that if we reach the timeout before the job is done, something will happen. We need the "duration after which it's unthinkable that the task would still be running" so that's probably something like 10 to 50 times the expected length of the task.

@ewjoachim ewjoachim added Type: Bug Issue contains: Some documentation 📚 This issues involves writing some documentation Issue contains: Exploration & Design decisions 🧩 We don't know how this will be implemented yet Issue contains: Some Python 🐍 This issue involves writing some Python code Issue appropriate for: newcomers 🤩 This issue can be safely tackled by people who never worked in this project before Issue appropriate for: Occasional contributors 😉 This issue will be best tackled by people with a minimum of experience on the project labels Sep 15, 2020
@dionjwa
Copy link
Author

dionjwa commented Sep 30, 2020

@ewjoachim these are good questions. My two cents:

  • choosing where you bound your edge cases helps put some reasonable boundaries

    • I am assuming the max-job-time is from when a worker picks up the job, not how long it has been on the queue
    • As the person putting jobs on the queue, I would prefer to have control about when I think jobs have taken too long, and should be killed
    • I can add extra time to that value myself, since they are my jobs, I know their bounds better than anyone else
      • and I can pad that time myself to allow some network latency
      • but I might choose not to, I might value quicker turnaround with some rate of expected job failure due to taking a bit too long
  • if there are atrocious network conditions, then if the jobs require network resources (not all jobs do) then jobs will fail

    • if jobs get retried on some exponential interval, then hopefully they will all still get consumed eventually
    • I think a combination of configurable retries, with a per job (or per task/job type) timeout will handle the vast majority of use cases and allows a lot of configurability of durability, and responsiveness
    • if network conditions stay terrible, then many jobs will fail
      • this is unavoidable in extreme conditions, most important then is proper recording of the error state

Overall, I think it's reasonable and useful to create a contract (you have 2 minutes to do this job once you start) and procrastinate fails (or retries or whatever) if it cannot.
I think it's hard to make those assumptions as the library developer since the parameters used to optimize are really only known by the app developer using the library, and jobs vary enormously.

Great job! This library has a really good set of features.

@elemoine
Copy link
Contributor

elemoine commented Oct 2, 2020

With k8s, to stop a pod, the container runtime sends a SIGTERM signal to the main process in each container. And once the "grace period" has expired, the SIGKILL signal is sent to the remaining processes.

And Systemd uses the same sort of termination mechanism by default, with a timeout of 90 seconds by default.

So the runtime environments commonly used already have this notion of "stop timeouts", so do we really want to add our own mechanism to Procrastinate? An open question really.

When the runtime environment kills (with SIGKILL) the Procrastinate worker, that may leave jobs in the "doing" state in the queue. Currently, these jobs will become stalled. When the Procrastinate worker starts again it will ignore them, and they'll stay in the queue for ever. How about adding a run_stalled_jobs option to the worker? When this option is set, the worker will, at startup time, retrieve all the stalled jobs from the database, and re-run them.

This is what I have in mind: https://github.com/peopledoc/procrastinate/compare/ele_stalled-jobs. (Although there may be a race condition issue with this implementation.)

@ewjoachim
Copy link
Member

How about adding a run_stalled_jobs option to the worker? When this option is set, the worker will, at startup time, retrieve all the stalled jobs from the database, and re-run them.

Not fond of having that on worker startup because it means you'd have to restart a worker in order to do something on these jobs.

I'm beginning to think documenting a get_error_jobs and/or get_stalled_jobs in the public API and just encouraging people to implement their own "meta" periodic task that does whatever they need to those tasks could be an idea. Empowering people with their own workflow rather than trying to cover every use case ? Not 100% sure about the approach.

@elemoine
Copy link
Contributor

elemoine commented Oct 5, 2020

Not fond of having that on worker startup because it means you'd have to restart a worker in order to do something on these jobs.

When may a job be left in the queue with the "doing" state? My assumption was that it may only happen when the worker process got killed while executing a job. This is why I thought a run_stalled_jobs option would be useful.

I'm beginning to think documenting a get_error_jobs and/or get_stalled_jobs in the public API and just encouraging people to implement their own "meta" periodic task that does whatever they need to those tasks could be an idea.

get_error_jobs is sort of already covered by the remove_old_jobs built-in task.

And I'm not sure about get_stalled_jobs for now. If you implement your own periodic task that calls get_stalled_jobs and decide to re-run a "stalled" job, how do you know that this job is "stalled" as the result of a previous KILL of a worker, or that it's actually a job that is currently executing?

@ewjoachim
Copy link
Member

When may a job be left in the queue with the "doing" state? My assumption was that it may only happen when the worker process got killed while executing a job. This is why I thought a run_stalled_jobs option would be useful.

It's not because a worker was killed that a new one will be launched. e.g. if we scale down, and one of the workers we shut down had to be killed because the task took too long, we won't be taking this task anytime soon. That's why I'd rather this be integrated in the normal lifecycle.

get_error_jobs is sort of already covered by the remove_old_jobs built-in task.

Only in terms of deleting it. We may want to do any number of other things: relauching it, updating something in our DB to say there was an error, sending an alert, ...

And I'm not sure about get_stalled_jobs for now. If you implement your own periodic task that calls get_stalled_jobs and decide to re-run a "stalled" job, how do you know that this job is "stalled" as the result of a previous KILL of a worker, or that it's actually a job that is currently executing?

Yes, that's what I was asking earlier. For me the time before a job is declared "stalled" must be long enough so that there cannot be a ambiguity. Say 100+ or 1000+ times the normal duration of the task. I'm not willing to integrate a "ping" system like celery has because it really feels like mixing business logic and infrastructure, but without that, it's really tricky to be certain whether a job is running or not. That's the heart of the "at-most-one vs at-least-one" problem.

@elemoine
Copy link
Contributor

elemoine commented Oct 9, 2020

As discussed today with @ewjoachim, we plan to add an JobStore.list_jobs_with_latest_event function. Using that function, and JobStore.finish_job, it will be possible to create a periodic task that retrieve all the stalled jobs and restart them.

@ewjoachim
Copy link
Member

Oh we never answered this:

There's no queueing_lock which makes me suspicious.

Please read on what the queueing lock is here: https://procrastinate.readthedocs.io/en/stable/glossary.html?term-Queueing-Lock and here: https://procrastinate.readthedocs.io/en/stable/reference.html#procrastinate.tasks.Task.configure

This should make it much clearer why it's expected that there's no queueing lock if you don't use this feature :)

@dionjwa
Copy link
Author

dionjwa commented Oct 19, 2020

@ewjoachim Thanks for the clarification!

It's unclear from the proposal if the periodic task to retrieve all stalled jobs is something a user will have to configure, or if it happens automatically.

@ewjoachim
Copy link
Member

procrastinate will provide building blocks and guidance (through documentation) on how to make a periodic task retrier. The user will be in charge of implementing it according to their own rules, given we're not sur there exists a single "sane default" action. It it ok from your point of view ?

@dionjwa
Copy link
Author

dionjwa commented Nov 30, 2020

It's not ideal honestly. The one thing I want from a distributed job queue is to reliably do jobs no matter when or how workers come and go. Other features can be added, but reliably doing jobs without me needing to do extra custom logic is the main feature I always need in a job queue.

In the meantime though, I will be very grateful for documentation on how to implement the task retrier, even pointers here if it's not going to be merged for a while. I basically need to solve this ASAP, so replacing procrastinate or finding another solution, since we are losing jobs.

@ewjoachim
Copy link
Member

So just to let you know, we've had limited time to work on that on the last days, but we'll try and suggest a snippet that should help, either in the current stable procrastinate version or the next one.

Also,... Form my current understanding, you're not losing tasks. For a very manual workaround, you can update tasks that have been in the "doing" state for a long time, and set them as "todo". They should run once again.

@elemoine
Copy link
Contributor

@dionjwa see #366, feel free to give it a try, and report back here or in the PR. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue appropriate for: newcomers 🤩 This issue can be safely tackled by people who never worked in this project before Issue appropriate for: Occasional contributors 😉 This issue will be best tackled by people with a minimum of experience on the project Issue contains: Exploration & Design decisions 🧩 We don't know how this will be implemented yet Issue contains: Some documentation 📚 This issues involves writing some documentation Issue contains: Some Python 🐍 This issue involves writing some Python code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants