race conditions on blocked state in db #8541

sanderr · 2024-12-31T11:07:17Z

There are a few issues with the current implementation of the persisting of the blocked status of a resource. The following checklist presents these. There are multiple approaches we could take, which may end up with vastly different results. But whichever we choose, we must make sure that these checkboxes' conditions are met.

when a recovery event unblocks (TRANSIENT -> NO) a dependent, it should be persisted in the database. The current implementation only persists the resource state for the resource that finished its deploy.
the three paths that write to the blocked state should all be race-free, either by writing to different columns, or through proper locking
A transaction should not interleave or wrap the scheduler lock. Interleaving is obvious, wrapping could be bad because then anything that is written explicitly under the lock, isn't committed until after the lock is released, hence there's very little point to writing it under the lock.
If we keep the transient state in the database, path 1 (see below) should also write any changes it makes from TRANSIENT -> NO. This is not currently the case.

The three paths that write blocked state are

new version: may set it from any state to NO or to YES
deploy done for resource that finished: may set it from NO to TRANSIENT (send_deploy_done())
deploy done for dependent resources: may set it from TRANSIENT to NO

Option 1: leave transient out of the database for now

This will only work as a temporary solution, but it could buy us some time for the proper fix. In iso 8.0 the new states wouldn't be exposed over the API yet, so it won't matter if it's missing one element. If we were to only persist YES/NO, then we wouldn't need to write it under the scheduler lock at all. At scheduler initialization, we then simply start fresh (no `TRANSIENT).

We would then need to keep this ticket open or to create a follow-up one with a proper solution, because if TRANSIENT lives only in memory, we can't efficiently filter on it.

Option 2: write everything, keep it race free

Option 2.1: split blocked-for-undefined and blocked-for-dependencies

Make this two separate columns in the database. The new version flow (1) would then only ever write to the first, and it has authority to simply write YES/NO. The other two flows would then only ever write to the second. Races between flow 2 and 3 would still need to be prevented somehow.

Option 2.2: put all the writes under the scheduler lock

This would require some massaging of the code, but that part is pretty doable. The part that worries me is that we'd have to keep the scheduler lock while doing IO, blocking all further work in the meantime. I'm not a big fan of this option.

Option 2.3: use a dedicated Python lock

Introduce a new lock specifically for these writes, so that we don't need to hold the scheduler lock, but a dedicated one that blocks only these writes rather than the whole operation of the scheduler. This lock should never be acquired under the scheduler lock (which would defeat the purpose), but that might not be feasible (could lead to out-of-order writes?).

The text was updated successfully, but these errors were encountered:

sanderr · 2024-12-31T11:58:02Z

Overall, I'm very much in favor of option 1 for now, to give us time for a proper solution. Option 2.2 might also be acceptable in the short term.

sanderr · 2025-01-06T14:19:42Z

Conclusion: for now we stop writing the transient state. I will leave todo notes linked to this ticket (TODO[#8541]) in the locations where they should be written. Then, for 8.1 we give this problem proper thought: can we solve it cleanly? Is the transient status sufficiently important to go through the trouble? ...

, PR #8560) # Description Turns out we had some races on the persisting of the transient state. We decided that we'd leave them out of the database for now to prevent these races. This should have no adverse effects, except that the field will never be reported to the user. At initialization time, we'll simply read `NO` for all previously transiently blocked resources, causing us to deploy them once more (which is fine, if not optimal). first part of #8541 # Self Check: Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] Attached issue to pull request - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] ~~Sufficient test cases (reproduces the bug/tests the requested feature)~~ - [x] Correct, in line with design - [x] ~~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~~ - [x] ~~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~~

…8585) # Description This PR addresses a concurrency issue that I forgot about in #8560: the transaction must not wrap the scheduler lock, because then parts that have been written explicitly under the lock, would only really be persisted when the transaction commits later, defeating the purpose of writing it under the lock in the first place. Now that (since #8560) we no longer write the transient state, we can afford to simply write all state outside of the lock (which was part of the motivation for #8560). Since the diff is quite large for such a conceptually small change, I'll make sure to mark all effective changes with a comment on the diff. You can safely assume all else has remained unchanged, apart from indentation. part of #8541 (first stage) # Self Check: Strike through any lines that are not applicable (`~~line~~`) then check the box - [ ] Attached issue to pull request - [ ] Changelog entry - [ ] Type annotations are present - [ ] Code is clear and sufficiently documented - [ ] No (preventable) type errors (check using make mypy or make mypy-diff) - [ ] Sufficient test cases (reproduces the bug/tests the requested feature) - [ ] Correct, in line with design - [ ] End user documentation is included or an issue is created for end-user documentation (add ref to issue here: ) - [ ] If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)

sanderr added discussion There is a discussion happening on this issue bug Something isn't working labels Dec 31, 2024

sanderr self-assigned this Jan 6, 2025

sanderr mentioned this issue Jan 6, 2025

stop persisting transient state in database #8560

Closed

9 tasks

sanderr removed their assignment Jan 6, 2025

sanderr mentioned this issue Jan 8, 2025

make sure transaction does not wrap scheduler lock #8585

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race conditions on blocked state in db #8541

race conditions on blocked state in db #8541

sanderr commented Dec 31, 2024 •

edited

Loading

sanderr commented Dec 31, 2024 •

edited

Loading

sanderr commented Jan 6, 2025 •

edited

Loading

race conditions on blocked state in db #8541

race conditions on blocked state in db #8541

Comments

sanderr commented Dec 31, 2024 • edited Loading

Option 1: leave transient out of the database for now

Option 2: write everything, keep it race free

Option 2.1: split blocked-for-undefined and blocked-for-dependencies

Option 2.2: put all the writes under the scheduler lock

Option 2.3: use a dedicated Python lock

sanderr commented Dec 31, 2024 • edited Loading

sanderr commented Jan 6, 2025 • edited Loading

sanderr commented Dec 31, 2024 •

edited

Loading

sanderr commented Dec 31, 2024 •

edited

Loading

sanderr commented Jan 6, 2025 •

edited

Loading