Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close race condition in procrastinate_fetch_job #231

Merged
merged 3 commits into from
Jun 15, 2020
Merged

Conversation

elemoine
Copy link
Contributor

@elemoine elemoine commented Jun 5, 2020

This closes a race condition in the procrastinate_fetch_job plpgsql function, where jobs sharing the same lock can be run out of order.

With this commit jobs with the same lock are always executed in order, whatever their ETAs and queues.

In effect:

  • if job A in queue 1 (id 1) and job B in queue 2 (id 2) have the same lock, and no workers process queue 1, then job B won't be executed, because job A must be executed first
  • if job A is deferred with ETA 1 year, no other jobs with the same lock will be executed for 1 year

The lock name may change from "lock" to "serial lock" in the future.

Closes #212.

Successful PR Checklist:

  • Tests
    • (not applicable?) we have no SQL tests for the moment
  • Documentation
    • (not applicable?) although it'd be good to mention that jobs with the same lock are now always executed in order
  • Had a good time contributing? that was a highly collaborative PR!

@elemoine
Copy link
Contributor Author

elemoine commented Jun 5, 2020

Making this a draft PR for the moment, as I'd like that #224 gets merged first, and that we cut a new release right after #224.

@codecov
Copy link

codecov bot commented Jun 5, 2020

Codecov Report

Merging #231 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #231   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           24        24           
  Lines         1100      1100           
  Branches       135       135           
=========================================
  Hits          1100      1100           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 051f631...fac4636. Read the comment docs.

@ewjoachim
Copy link
Member

I unchecked documentation. I think we should have a dedicated page on serial locking. We should also explicitely mention that this is the current way of implementing a chain.

@ewjoachim ewjoachim mentioned this pull request Jun 5, 2020
5 tasks
@elemoine
Copy link
Contributor Author

elemoine commented Jun 5, 2020

I unchecked documentation. I think we should have a dedicated page on serial locking. We should also explicitely mention that this is the current way of implementing a chain.

We have https://procrastinate.readthedocs.io/en/stable/howto/locks.html already, and I think it's quite explicit.

@tmartinfr

This comment has been minimized.

@elemoine elemoine marked this pull request as ready for review June 5, 2020 12:55
@tmartinfr
Copy link

@elemoine We should remove this section of the doc : https://github.com/peopledoc/procrastinate/blob/master/docs/discussions.rst#the-procrastinate_job_locks-table, and maybe update the mention Unavailable tasks, either locked, scheduled for the future or in a queue that the worker doesn't listen to, will be ignored. here : https://github.com/peopledoc/procrastinate/blob/master/docs/discussions.rst#about-locks

@elemoine
Copy link
Contributor Author

elemoine commented Jun 5, 2020

@elemoine We should remove this section of the doc : https://github.com/peopledoc/procrastinate/blob/master/docs/discussions.rst#the-procrastinate_job_locks-table,

👍

Unavailable tasks, either locked, scheduled for the future or in a queue that the worker doesn't listen to, will be ignored. here : https://github.com/peopledoc/procrastinate/blob/master/docs/discussions.rst#about-locks

I do not see the problem with that statement, but that may be me.

@elemoine elemoine requested a review from CorBott as a code owner June 5, 2020 13:22
@tmartinfr
Copy link

Unavailable tasks, either locked, scheduled for the future or in a queue that the worker doesn't listen to, will be ignored. here : https://github.com/peopledoc/procrastinate/blob/master/docs/discussions.rst#about-locks

I do not see the problem with that statement, but that may be me.

Not wrong, but it appeared not very clear for me. I could suggest something like : When a worker requests a task, it will always receive the oldest available task. If this oldest task is unavailable, either locked, scheduled for the future or in a queue that the worker doesn't listen to, no other task will be selected. (can be improved, and if you want to keep the existing version, it's fine for me)

@@ -115,9 +115,7 @@ their identifiers could be used (there's no hard limit on the length of a lock s
but stay reasonable).

A task can only take a single lock so there's no dead-lock scenario possible where two
running tasks are waiting one another. That being said, if a worker dies with a lock, it
will be up to you to free it. If the task fails but the worker survives though, the
lock will be freed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this outdated ?

Copy link
Member

@ewjoachim ewjoachim Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could rephrasing it as:

If a worker is killed without ending its job, following jobs with the same lock will not run until the interrupted job is either manually set to "failed" or "succeeded". If a job simply fails, following jobs with the same locks may run.

@@ -0,0 +1,48 @@
DROP TABLE IF EXISTS procrastinate_job_locks;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll be explicit in the changelog that with those migrations, the workers will need to be stopped when running the migration.
Of course, one can always write better migrations.

Copy link
Contributor

@CorBott CorBott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one small suggestion...

running tasks are waiting one another. That being said, if a worker dies with a lock, it
will be up to you to free it. If the task fails but the worker survives though, the
lock will be freed.
running tasks are waiting one another.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
running tasks are waiting one another.
running tasks are waiting for one another.

This closes a race condition in the procrastinate_fetch_job plpgsql function,
where jobs sharing the same lock can be run out of order.

With this commit jobs with the same lock are **always** executed in order,
whatever their ETAs and queues.

In effect:
- if job A in queue 1 (id 1) and job B in queue 2 (id 2) have the same lock,
  and no workers process queue 1, then job B won't be executed, because
  job A must be executed first
- if job A is deferred with ETA 1 year, no other jobs with the same lock will
  be executed for 1 year

The lock name may change from "lock" to "serial lock" in the future.
The result is the same, but it makes the query more easily readable.
@ewjoachim
Copy link
Member

We wanted to rename the lock "serial lock" but this will change a LOT of code, so it will have its own PR.

@ewjoachim ewjoachim merged commit 794b564 into master Jun 15, 2020
@ewjoachim ewjoachim deleted the ele_fetch-job branch June 15, 2020 17:04
@elemoine elemoine mentioned this pull request Jun 16, 2020
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR type: bugfix 🕵️ Contains bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in locked tasks: wrong execution order
5 participants