Close race condition in procrastinate_fetch_job #231

elemoine · 2020-06-05T11:20:39Z

This closes a race condition in the procrastinate_fetch_job plpgsql function, where jobs sharing the same lock can be run out of order.

With this commit jobs with the same lock are always executed in order, whatever their ETAs and queues.

In effect:

if job A in queue 1 (id 1) and job B in queue 2 (id 2) have the same lock, and no workers process queue 1, then job B won't be executed, because job A must be executed first
if job A is deferred with ETA 1 year, no other jobs with the same lock will be executed for 1 year

The lock name may change from "lock" to "serial lock" in the future.

Closes #212.

Successful PR Checklist:

Tests
- (not applicable?) we have no SQL tests for the moment
Documentation
- (not applicable?) although it'd be good to mention that jobs with the same lock are now always executed in order
Had a good time contributing? that was a highly collaborative PR!

elemoine · 2020-06-05T11:21:57Z

Making this a draft PR for the moment, as I'd like that #224 gets merged first, and that we cut a new release right after #224.

codecov · 2020-06-05T11:23:31Z

Codecov Report

Merging #231 into master will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##            master      #231   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           24        24           
  Lines         1100      1100           
  Branches       135       135           
=========================================
  Hits          1100      1100

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 051f631...fac4636. Read the comment docs.

ewjoachim · 2020-06-05T11:25:31Z

I unchecked documentation. I think we should have a dedicated page on serial locking. We should also explicitely mention that this is the current way of implementing a chain.

elemoine · 2020-06-05T11:36:18Z

I unchecked documentation. I think we should have a dedicated page on serial locking. We should also explicitely mention that this is the current way of implementing a chain.

We have https://procrastinate.readthedocs.io/en/stable/howto/locks.html already, and I think it's quite explicit.

tmartinfr · 2020-06-05T13:08:53Z

@elemoine We should remove this section of the doc : https://github.com/peopledoc/procrastinate/blob/master/docs/discussions.rst#the-procrastinate_job_locks-table, and maybe update the mention Unavailable tasks, either locked, scheduled for the future or in a queue that the worker doesn't listen to, will be ignored. here : https://github.com/peopledoc/procrastinate/blob/master/docs/discussions.rst#about-locks

elemoine · 2020-06-05T13:21:17Z

@elemoine We should remove this section of the doc : https://github.com/peopledoc/procrastinate/blob/master/docs/discussions.rst#the-procrastinate_job_locks-table,

👍

Unavailable tasks, either locked, scheduled for the future or in a queue that the worker doesn't listen to, will be ignored. here : https://github.com/peopledoc/procrastinate/blob/master/docs/discussions.rst#about-locks

I do not see the problem with that statement, but that may be me.

tmartinfr · 2020-06-05T13:35:01Z

Unavailable tasks, either locked, scheduled for the future or in a queue that the worker doesn't listen to, will be ignored. here : https://github.com/peopledoc/procrastinate/blob/master/docs/discussions.rst#about-locks

I do not see the problem with that statement, but that may be me.

Not wrong, but it appeared not very clear for me. I could suggest something like : When a worker requests a task, it will always receive the oldest available task. If this oldest task is unavailable, either locked, scheduled for the future or in a queue that the worker doesn't listen to, no other task will be selected. (can be improved, and if you want to keep the existing version, it's fine for me)

ewjoachim · 2020-06-05T15:28:38Z

docs/discussions.rst

@@ -115,9 +115,7 @@ their identifiers could be used (there's no hard limit on the length of a lock s
 but stay reasonable).

 A task can only take a single lock so there's no dead-lock scenario possible where two
-running tasks are waiting one another. That being said, if a worker dies with a lock, it
-will be up to you to free it. If the task fails but the worker survives though, the
-lock will be freed.


It this outdated ?

we could rephrasing it as:

If a worker is killed without ending its job, following jobs with the same lock will not run until the interrupted job is either manually set to "failed" or "succeeded". If a job simply fails, following jobs with the same locks may run.

ewjoachim · 2020-06-06T21:00:53Z

procrastinate/sql/migrations/delta_0.9.0_001_close_fetch_job_race_condition.sql

@@ -0,0 +1,48 @@
+DROP TABLE IF EXISTS procrastinate_job_locks;


We'll be explicit in the changelog that with those migrations, the workers will need to be stopped when running the migration.
Of course, one can always write better migrations.

CorBott

one small suggestion...

CorBott · 2020-06-08T07:39:49Z

docs/discussions.rst

-running tasks are waiting one another. That being said, if a worker dies with a lock, it
-will be up to you to free it. If the task fails but the worker survives though, the
-lock will be freed.
+running tasks are waiting one another.


Suggested change

running tasks are waiting one another.

running tasks are waiting for one another.

procrastinate/sql/schema.sql

This closes a race condition in the procrastinate_fetch_job plpgsql function, where jobs sharing the same lock can be run out of order. With this commit jobs with the same lock are **always** executed in order, whatever their ETAs and queues. In effect: - if job A in queue 1 (id 1) and job B in queue 2 (id 2) have the same lock, and no workers process queue 1, then job B won't be executed, because job A must be executed first - if job A is deferred with ETA 1 year, no other jobs with the same lock will be executed for 1 year The lock name may change from "lock" to "serial lock" in the future.

The result is the same, but it makes the query more easily readable.

ewjoachim · 2020-06-15T17:04:18Z

We wanted to rename the lock "serial lock" but this will change a LOT of code, so it will have its own PR.

ewjoachim mentioned this pull request Jun 5, 2020

WIP Removing the lock table #229

Closed

5 tasks

elemoine force-pushed the ele_fetch-job branch from 5fdc6c3 to 66b3070 Compare June 5, 2020 11:33

elemoine force-pushed the ele_fetch-job branch from 66b3070 to 0c0bc83 Compare June 5, 2020 11:37

This comment has been minimized.

Sign in to view

elemoine force-pushed the ele_fetch-job branch from 0c0bc83 to 1cbb84c Compare June 5, 2020 12:52

elemoine marked this pull request as ready for review June 5, 2020 12:55

elemoine requested review from DainDwarf, mgu, pmourlanne and sophie-ulti as code owners June 5, 2020 12:55

elemoine requested a review from CorBott as a code owner June 5, 2020 13:22

ewjoachim reviewed Jun 5, 2020

View reviewed changes

ewjoachim force-pushed the ele_fetch-job branch from e3bb4f1 to 113c0f9 Compare June 6, 2020 20:50

ewjoachim reviewed Jun 6, 2020

View reviewed changes

CorBott reviewed Jun 8, 2020

View reviewed changes

ewjoachim added PR:minor PR type: bugfix 🕵️ Contains bug fix and removed PR:minor labels Jun 8, 2020

k4nar reviewed Jun 9, 2020

View reviewed changes

procrastinate/sql/schema.sql Show resolved Hide resolved

ewjoachim force-pushed the ele_fetch-job branch from 113c0f9 to 319f339 Compare June 15, 2020 16:48

Remove outdated material from the discussions section

6ba2ecd

Rewrite fetch_job query with NOT EXISTS

fac4636

The result is the same, but it makes the query more easily readable.

ewjoachim force-pushed the ele_fetch-job branch from 319f339 to fac4636 Compare June 15, 2020 16:57

ewjoachim approved these changes Jun 15, 2020

View reviewed changes

ewjoachim merged commit 794b564 into master Jun 15, 2020

ewjoachim deleted the ele_fetch-job branch June 15, 2020 17:04

elemoine mentioned this pull request Jun 16, 2020

Rename migration scripts #250

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close race condition in procrastinate_fetch_job #231

Close race condition in procrastinate_fetch_job #231

elemoine commented Jun 5, 2020 •

edited by ewjoachim

Loading

elemoine commented Jun 5, 2020

codecov bot commented Jun 5, 2020 •

edited

Loading

ewjoachim commented Jun 5, 2020

elemoine commented Jun 5, 2020

This comment has been minimized.

tmartinfr commented Jun 5, 2020

elemoine commented Jun 5, 2020

tmartinfr commented Jun 5, 2020

ewjoachim Jun 5, 2020

ewjoachim Jun 5, 2020 •

edited

Loading

ewjoachim Jun 6, 2020

CorBott left a comment

CorBott Jun 8, 2020

ewjoachim commented Jun 15, 2020

		@@ -0,0 +1,48 @@
		DROP TABLE IF EXISTS procrastinate_job_locks;

	running tasks are waiting one another.
	running tasks are waiting for one another.

Close race condition in procrastinate_fetch_job #231

Close race condition in procrastinate_fetch_job #231

Conversation

elemoine commented Jun 5, 2020 • edited by ewjoachim Loading

Successful PR Checklist:

elemoine commented Jun 5, 2020

codecov bot commented Jun 5, 2020 • edited Loading

Codecov Report

ewjoachim commented Jun 5, 2020

elemoine commented Jun 5, 2020

This comment has been minimized.

tmartinfr commented Jun 5, 2020

elemoine commented Jun 5, 2020

tmartinfr commented Jun 5, 2020

ewjoachim Jun 5, 2020

Choose a reason for hiding this comment

ewjoachim Jun 5, 2020 • edited Loading

Choose a reason for hiding this comment

ewjoachim Jun 6, 2020

Choose a reason for hiding this comment

CorBott left a comment

Choose a reason for hiding this comment

CorBott Jun 8, 2020

Choose a reason for hiding this comment

ewjoachim commented Jun 15, 2020

elemoine commented Jun 5, 2020 •

edited by ewjoachim

Loading

codecov bot commented Jun 5, 2020 •

edited

Loading

ewjoachim Jun 5, 2020 •

edited

Loading