-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a race around SQLite DB config validation #17902
Fix a race around SQLite DB config validation #17902
Conversation
@edsantiago @vrothberg PTAL. Haven't done much testing on my system but I suspect this ought to fix it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Ah no, same fart as I saw yesterday when trying to fix this race. |
Yep, seeing the same thing in my #17831. |
But now it seems to fail consistently, not only in rootless. |
It's an init problem. Super trivial reproducer: $ bin/podman system reset
...
$ bin/podman --db-backend sqlite info
[5-second hang]
Error: creating tables: beginning transaction: database is locked |
I was validating without adding This explains many, many things |
I'm not really clear on how this is happening, though. The transaction lock should be preventing anyone from getting in the DB at the same time. |
@vrothberg I can fix the error here by avoiding the transaction, but we need to get to the bottom of why these are happening. The transaction lock is not sufficient to avoid database locks, evidently. |
I'm only half paying attention here, but I wonder if I did a poor job of explaining? This is a problem ONLY AT INIT TIME. It works perfectly fine after sqlite has been used once. $ git checkout main
$ make
...
$ bin/podman system reset
...
$ bin/podman --db-backend sqlite info
... That works. Now that we have sqlite used once, your PR works perfectly fine: $ git checkout pr/17902
$ make
...
$ bin/podman --db-backend sqlite whatever..... it all works now, because the "main" init worked But if we use your PR immediately after Is that a better way to explain things? Or have I missed something? |
@edsantiago That is fully expected. What I'm saying is that the transaction lock was supposed to prevent this entirely. If we are in a position where, if we code things in a way sqlite does not like, we can get random CI flakes because of concurrent DB access, that is not a good position. We need to understand why these are happening. |
It's |
Shouldn't be. The error we're seeing is actually coming out of DB init, which should be running before validation happens. They're in the same thread (the same function, even) so it should be strictly serial. |
This seems to fix it diff --git a/libpod/sqlite_state.go b/libpod/sqlite_state.go
index 45d2db588..71066b78b 100644
--- a/libpod/sqlite_state.go
+++ b/libpod/sqlite_state.go
@@ -363,6 +363,7 @@ func (s *SQLiteState) ValidateDBConfig(runtime *Runtime) (defErr error) {
return fmt.Errorf("adding DB config row: %w", err)
}
+ tx.Commit()
return nil
}
[EDIT: with |
With Two new flakes, too: #17904 and #17905. And, it seems like the debian auto-update flake (#17607) happens a lot more with sqlite, does that make any sense? I'll keep forcing CI runs to see what else we find. |
Damn it, that makes perfect sense. Good catch. |
The DB config is a single-row table, and the first Podman process to run against the database creates it. However, there was a race where multiple Podman processes, started simultaneously, could try and write it. Only the first would succeed, with subsequent processes failing once (and then running correctly once re-ran), but it was happening often in CI and deserves fixing. [NO NEW TESTS NEEDED] It's a CI flake fix. Signed-off-by: Matt Heon <[email protected]>
422cdea
to
e061cb9
Compare
One run with no lock/unique flakes. The next one failed in f37 root container, yes, container. I don't remember how things work in containerized testing, but this caught me by surprise. Almost bedtime. I'll start one more job but won't look at results until the morrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@giuseppe PTAL
None of these strings appear in the downloaded logs:
...but please do not ignore the "database is locked" errors I'm reporting. |
We're on it 👍 |
No critical flakes in the run I started this morning (just the "c1 c2" flake, which is being addressed). I've started a new test run, will check in on it later this afternoon. |
Another successful run. LGTM. Thanks everyone. |
Thanks a ton for checking, @edsantiago ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: flouthoc, mheon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The DB config is a single-row table, and the first Podman process to run against the database creates it. However, there was a race where multiple Podman processes, started simultaneously, could try and write it. Only the first would succeed, with subsequent processes failing once (and then running correctly once re-ran), but it was happening often in CI and deserves fixing.
[NO NEW TESTS NEEDED] It's a CI flake fix.