Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race conditions on blocked state in db #8541

Open
4 tasks
sanderr opened this issue Dec 31, 2024 · 2 comments
Open
4 tasks

race conditions on blocked state in db #8541

sanderr opened this issue Dec 31, 2024 · 2 comments
Labels
bug Something isn't working discussion There is a discussion happening on this issue

Comments

@sanderr
Copy link
Contributor

sanderr commented Dec 31, 2024

There are a few issues with the current implementation of the persisting of the blocked status of a resource. The following checklist presents these. There are multiple approaches we could take, which may end up with vastly different results. But whichever we choose, we must make sure that these checkboxes' conditions are met.

  • when a recovery event unblocks (TRANSIENT -> NO) a dependent, it should be persisted in the database. The current implementation only persists the resource state for the resource that finished its deploy.
  • the three paths that write to the blocked state should all be race-free, either by writing to different columns, or through proper locking
  • A transaction should not interleave or wrap the scheduler lock. Interleaving is obvious, wrapping could be bad because then anything that is written explicitly under the lock, isn't committed until after the lock is released, hence there's very little point to writing it under the lock.
  • If we keep the transient state in the database, path 1 (see below) should also write any changes it makes from TRANSIENT -> NO. This is not currently the case.

The three paths that write blocked state are

  1. new version: may set it from any state to NO or to YES
  2. deploy done for resource that finished: may set it from NO to TRANSIENT (send_deploy_done())
  3. deploy done for dependent resources: may set it from TRANSIENT to NO

Option 1: leave transient out of the database for now

This will only work as a temporary solution, but it could buy us some time for the proper fix. In iso 8.0 the new states wouldn't be exposed over the API yet, so it won't matter if it's missing one element. If we were to only persist YES/NO, then we wouldn't need to write it under the scheduler lock at all. At scheduler initialization, we then simply start fresh (no `TRANSIENT).

We would then need to keep this ticket open or to create a follow-up one with a proper solution, because if TRANSIENT lives only in memory, we can't efficiently filter on it.

Option 2: write everything, keep it race free

Option 2.1: split blocked-for-undefined and blocked-for-dependencies

Make this two separate columns in the database. The new version flow (1) would then only ever write to the first, and it has authority to simply write YES/NO. The other two flows would then only ever write to the second. Races between flow 2 and 3 would still need to be prevented somehow.

Option 2.2: put all the writes under the scheduler lock

This would require some massaging of the code, but that part is pretty doable. The part that worries me is that we'd have to keep the scheduler lock while doing IO, blocking all further work in the meantime. I'm not a big fan of this option.

Option 2.3: use a dedicated Python lock

Introduce a new lock specifically for these writes, so that we don't need to hold the scheduler lock, but a dedicated one that blocks only these writes rather than the whole operation of the scheduler. This lock should never be acquired under the scheduler lock (which would defeat the purpose), but that might not be feasible (could lead to out-of-order writes?).

@sanderr sanderr added discussion There is a discussion happening on this issue bug Something isn't working labels Dec 31, 2024
@sanderr
Copy link
Contributor Author

sanderr commented Dec 31, 2024

Overall, I'm very much in favor of option 1 for now, to give us time for a proper solution. Option 2.2 might also be acceptable in the short term.

@sanderr sanderr self-assigned this Jan 6, 2025
@sanderr
Copy link
Contributor Author

sanderr commented Jan 6, 2025

Conclusion: for now we stop writing the transient state. I will leave todo notes linked to this ticket (TODO[#8541]) in the locations where they should be written. Then, for 8.1 we give this problem proper thought: can we solve it cleanly? Is the transient status sufficiently important to go through the trouble? ...

@sanderr sanderr removed their assignment Jan 6, 2025
inmantaci pushed a commit that referenced this issue Jan 8, 2025
, PR #8560)

# Description

Turns out we had some races on the persisting of the transient state. We decided that we'd leave them out of the database for now to prevent these races. This should have no adverse effects, except that the field will never be reported to the user. At initialization time, we'll simply read `NO` for all previously transiently blocked resources, causing us to deploy them once more (which is fine, if not optimal).

first part of #8541

# Self Check:

Strike through any lines that are not applicable (`~~line~~`) then check the box

- [x] Attached issue to pull request
- [x] Changelog entry
- [x] Type annotations are present
- [x] Code is clear and sufficiently documented
- [x] No (preventable) type errors (check using make mypy or make mypy-diff)
- [x] ~~Sufficient test cases (reproduces the bug/tests the requested feature)~~
- [x] Correct, in line with design
- [x] ~~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~~
- [x] ~~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~~
inmantaci pushed a commit that referenced this issue Jan 9, 2025
…8585)

# Description

This PR addresses a concurrency issue that I forgot about in #8560: the transaction must not wrap the scheduler lock, because then parts that have been written explicitly under the lock, would only really be persisted when the transaction commits later, defeating the purpose of writing it under the lock in the first place.

Now that (since #8560) we no longer write the transient state, we can afford to simply write all state outside of the lock (which was part of the motivation for #8560).

Since the diff is quite large for such a conceptually small change, I'll make sure to mark all effective changes with a comment on the diff. You can safely assume all else has remained unchanged, apart from indentation.

part of #8541 (first stage)

# Self Check:

Strike through any lines that are not applicable (`~~line~~`) then check the box

- [ ] Attached issue to pull request
- [ ] Changelog entry
- [ ] Type annotations are present
- [ ] Code is clear and sufficiently documented
- [ ] No (preventable) type errors (check using make mypy or make mypy-diff)
- [ ] Sufficient test cases (reproduces the bug/tests the requested feature)
- [ ] Correct, in line with design
- [ ] End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )
- [ ] If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion There is a discussion happening on this issue
Projects
None yet
Development

No branches or pull requests

1 participant