Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race around SQLite DB config validation #17902

Merged

Conversation

mheon
Copy link
Member

@mheon mheon commented Mar 23, 2023

The DB config is a single-row table, and the first Podman process to run against the database creates it. However, there was a race where multiple Podman processes, started simultaneously, could try and write it. Only the first would succeed, with subsequent processes failing once (and then running correctly once re-ran), but it was happening often in CI and deserves fixing.

[NO NEW TESTS NEEDED] It's a CI flake fix.

NONE

@openshift-ci openshift-ci bot added release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 23, 2023
@mheon
Copy link
Member Author

mheon commented Mar 23, 2023

@edsantiago @vrothberg PTAL. Haven't done much testing on my system but I suspect this ought to fix it.

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

libpod/sqlite_state.go Show resolved Hide resolved
@vrothberg
Copy link
Member

[+0150s] Error: beginning refresh transaction: database is locked

Ah no, same fart as I saw yesterday when trying to fix this race.

@edsantiago
Copy link
Member

Yep, seeing the same thing in my #17831.

@vrothberg
Copy link
Member

But now it seems to fail consistently, not only in rootless.

@edsantiago
Copy link
Member

edsantiago commented Mar 23, 2023

It's an init problem. Super trivial reproducer:

$ bin/podman system reset
...
$ bin/podman --db-backend sqlite info
[5-second hang]
Error: creating tables: beginning transaction: database is locked

@mheon
Copy link
Member Author

mheon commented Mar 23, 2023

I was validating without adding --db-backend=sqlite

This explains many, many things

@mheon
Copy link
Member Author

mheon commented Mar 23, 2023

I'm not really clear on how this is happening, though. The transaction lock should be preventing anyone from getting in the DB at the same time.

@mheon
Copy link
Member Author

mheon commented Mar 23, 2023

@vrothberg I can fix the error here by avoiding the transaction, but we need to get to the bottom of why these are happening. The transaction lock is not sufficient to avoid database locks, evidently.

@edsantiago
Copy link
Member

I'm only half paying attention here, but I wonder if I did a poor job of explaining? This is a problem ONLY AT INIT TIME. It works perfectly fine after sqlite has been used once.

$ git checkout main
$ make
...
$ bin/podman system reset
...
$ bin/podman --db-backend sqlite info
...

That works. Now that we have sqlite used once, your PR works perfectly fine:

$ git checkout pr/17902
$ make
...
$ bin/podman --db-backend sqlite whatever..... it all works now, because the "main" init worked

But if we use your PR immediately after podman system reset, when there is no sqlite DB, then it dies.

Is that a better way to explain things? Or have I missed something?

@mheon
Copy link
Member Author

mheon commented Mar 23, 2023

@edsantiago That is fully expected. What I'm saying is that the transaction lock was supposed to prevent this entirely. If we are in a position where, if we code things in a way sqlite does not like, we can get random CI flakes because of concurrent DB access, that is not a good position. We need to understand why these are happening.

@edsantiago
Copy link
Member

It's ValidateDBConfig(). Is it possible for it to get invoked while the transaction is open?

@mheon
Copy link
Member Author

mheon commented Mar 23, 2023

Shouldn't be. The error we're seeing is actually coming out of DB init, which should be running before validation happens. They're in the same thread (the same function, even) so it should be strictly serial.

@edsantiago
Copy link
Member

edsantiago commented Mar 23, 2023

This seems to fix it

diff --git a/libpod/sqlite_state.go b/libpod/sqlite_state.go
index 45d2db588..71066b78b 100644
--- a/libpod/sqlite_state.go
+++ b/libpod/sqlite_state.go
@@ -363,6 +363,7 @@ func (s *SQLiteState) ValidateDBConfig(runtime *Runtime) (defErr error) {
 				return fmt.Errorf("adding DB config row: %w", err)
 			}
 
+			tx.Commit()
 			return nil
 		}
 

[EDIT: with err checking, yada yada]

@edsantiago
Copy link
Member

With tx.Commit(), my PR ran CI... and, yay, no UNIQUE constraint failures, but, sigh, one "database is locked" in f36 rootless. Sorry.

Two new flakes, too: #17904 and #17905. And, it seems like the debian auto-update flake (#17607) happens a lot more with sqlite, does that make any sense?

I'll keep forcing CI runs to see what else we find.

@mheon
Copy link
Member Author

mheon commented Mar 23, 2023

Damn it, that makes perfect sense. Good catch.

The DB config is a single-row table, and the first Podman process
to run against the database creates it. However, there was a race
where multiple Podman processes, started simultaneously, could
try and write it. Only the first would succeed, with subsequent
processes failing once (and then running correctly once re-ran),
but it was happening often in CI and deserves fixing.

[NO NEW TESTS NEEDED] It's a CI flake fix.

Signed-off-by: Matt Heon <[email protected]>
@mheon mheon force-pushed the fix_sqlite_validate_unique branch from 422cdea to e061cb9 Compare March 23, 2023 23:48
@edsantiago
Copy link
Member

One run with no lock/unique flakes. The next one failed in f37 root container, yes, container. I don't remember how things work in containerized testing, but this caught me by surprise. Almost bedtime. I'll start one more job but won't look at results until the morrow.

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@giuseppe PTAL

@edsantiago
Copy link
Member

None of these strings appear in the downloaded logs:

is locked|unique constraint|constraint fail|config row|adding db|dbconfig.id

...but please do not ignore the "database is locked" errors I'm reporting.

@vrothberg
Copy link
Member

...but please do not ignore the "database is locked" errors I'm reporting.

We're on it 👍

@edsantiago
Copy link
Member

No critical flakes in the run I started this morning (just the "c1 c2" flake, which is being addressed). I've started a new test run, will check in on it later this afternoon.

@edsantiago
Copy link
Member

Another successful run. LGTM. Thanks everyone.

@vrothberg
Copy link
Member

Thanks a ton for checking, @edsantiago !

@flouthoc @giuseppe ready to merge

Copy link
Collaborator

@flouthoc flouthoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 27, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: flouthoc, mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 30619bb into containers:main Mar 27, 2023
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 4, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants