-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
race conditions on blocked state in db #8541
Comments
Overall, I'm very much in favor of option 1 for now, to give us time for a proper solution. Option 2.2 might also be acceptable in the short term. |
Conclusion: for now we stop writing the transient state. I will leave todo notes linked to this ticket ( |
, PR #8560) # Description Turns out we had some races on the persisting of the transient state. We decided that we'd leave them out of the database for now to prevent these races. This should have no adverse effects, except that the field will never be reported to the user. At initialization time, we'll simply read `NO` for all previously transiently blocked resources, causing us to deploy them once more (which is fine, if not optimal). first part of #8541 # Self Check: Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] Attached issue to pull request - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] ~~Sufficient test cases (reproduces the bug/tests the requested feature)~~ - [x] Correct, in line with design - [x] ~~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~~ - [x] ~~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~~
…8585) # Description This PR addresses a concurrency issue that I forgot about in #8560: the transaction must not wrap the scheduler lock, because then parts that have been written explicitly under the lock, would only really be persisted when the transaction commits later, defeating the purpose of writing it under the lock in the first place. Now that (since #8560) we no longer write the transient state, we can afford to simply write all state outside of the lock (which was part of the motivation for #8560). Since the diff is quite large for such a conceptually small change, I'll make sure to mark all effective changes with a comment on the diff. You can safely assume all else has remained unchanged, apart from indentation. part of #8541 (first stage) # Self Check: Strike through any lines that are not applicable (`~~line~~`) then check the box - [ ] Attached issue to pull request - [ ] Changelog entry - [ ] Type annotations are present - [ ] Code is clear and sufficiently documented - [ ] No (preventable) type errors (check using make mypy or make mypy-diff) - [ ] Sufficient test cases (reproduces the bug/tests the requested feature) - [ ] Correct, in line with design - [ ] End user documentation is included or an issue is created for end-user documentation (add ref to issue here: ) - [ ] If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)
There are a few issues with the current implementation of the persisting of the blocked status of a resource. The following checklist presents these. There are multiple approaches we could take, which may end up with vastly different results. But whichever we choose, we must make sure that these checkboxes' conditions are met.
The three paths that write blocked state are
send_deploy_done()
)Option 1: leave transient out of the database for now
This will only work as a temporary solution, but it could buy us some time for the proper fix. In iso 8.0 the new states wouldn't be exposed over the API yet, so it won't matter if it's missing one element. If we were to only persist YES/NO, then we wouldn't need to write it under the scheduler lock at all. At scheduler initialization, we then simply start fresh (no `TRANSIENT).
We would then need to keep this ticket open or to create a follow-up one with a proper solution, because if
TRANSIENT
lives only in memory, we can't efficiently filter on it.Option 2: write everything, keep it race free
Option 2.1: split blocked-for-undefined and blocked-for-dependencies
Make this two separate columns in the database. The new version flow (1) would then only ever write to the first, and it has authority to simply write YES/NO. The other two flows would then only ever write to the second. Races between flow 2 and 3 would still need to be prevented somehow.
Option 2.2: put all the writes under the scheduler lock
This would require some massaging of the code, but that part is pretty doable. The part that worries me is that we'd have to keep the scheduler lock while doing IO, blocking all further work in the meantime. I'm not a big fan of this option.
Option 2.3: use a dedicated Python lock
Introduce a new lock specifically for these writes, so that we don't need to hold the scheduler lock, but a dedicated one that blocks only these writes rather than the whole operation of the scheduler. This lock should never be acquired under the scheduler lock (which would defeat the purpose), but that might not be feasible (could lead to out-of-order writes?).
The text was updated successfully, but these errors were encountered: