Fix a race around SQLite DB config validation #17902

mheon · 2023-03-23T14:50:43Z

The DB config is a single-row table, and the first Podman process to run against the database creates it. However, there was a race where multiple Podman processes, started simultaneously, could try and write it. Only the first would succeed, with subsequent processes failing once (and then running correctly once re-ran), but it was happening often in CI and deserves fixing.

[NO NEW TESTS NEEDED] It's a CI flake fix.

NONE

mheon · 2023-03-23T14:51:13Z

@edsantiago @vrothberg PTAL. Haven't done much testing on my system but I suspect this ought to fix it.

vrothberg

LGTM

libpod/sqlite_state.go

vrothberg · 2023-03-23T15:19:45Z

[+0150s] Error: beginning refresh transaction: database is locked

Ah no, same fart as I saw yesterday when trying to fix this race.

edsantiago · 2023-03-23T15:21:18Z

Yep, seeing the same thing in my #17831.

vrothberg · 2023-03-23T15:22:49Z

But now it seems to fail consistently, not only in rootless.

edsantiago · 2023-03-23T16:01:03Z

It's an init problem. Super trivial reproducer:

$ bin/podman system reset
...
$ bin/podman --db-backend sqlite info
[5-second hang]
Error: creating tables: beginning transaction: database is locked

mheon · 2023-03-23T16:05:04Z

I was validating without adding --db-backend=sqlite

This explains many, many things

mheon · 2023-03-23T16:06:05Z

I'm not really clear on how this is happening, though. The transaction lock should be preventing anyone from getting in the DB at the same time.

mheon · 2023-03-23T17:18:02Z

@vrothberg I can fix the error here by avoiding the transaction, but we need to get to the bottom of why these are happening. The transaction lock is not sufficient to avoid database locks, evidently.

edsantiago · 2023-03-23T17:23:39Z

I'm only half paying attention here, but I wonder if I did a poor job of explaining? This is a problem ONLY AT INIT TIME. It works perfectly fine after sqlite has been used once.

$ git checkout main
$ make
...
$ bin/podman system reset
...
$ bin/podman --db-backend sqlite info
...

That works. Now that we have sqlite used once, your PR works perfectly fine:

$ git checkout pr/17902
$ make
...
$ bin/podman --db-backend sqlite whatever..... it all works now, because the "main" init worked

But if we use your PR immediately after podman system reset, when there is no sqlite DB, then it dies.

Is that a better way to explain things? Or have I missed something?

mheon · 2023-03-23T17:28:24Z

@edsantiago That is fully expected. What I'm saying is that the transaction lock was supposed to prevent this entirely. If we are in a position where, if we code things in a way sqlite does not like, we can get random CI flakes because of concurrent DB access, that is not a good position. We need to understand why these are happening.

edsantiago · 2023-03-23T17:58:05Z

It's ValidateDBConfig(). Is it possible for it to get invoked while the transaction is open?

mheon · 2023-03-23T18:10:29Z

Shouldn't be. The error we're seeing is actually coming out of DB init, which should be running before validation happens. They're in the same thread (the same function, even) so it should be strictly serial.

edsantiago · 2023-03-23T19:19:33Z

This seems to fix it

diff --git a/libpod/sqlite_state.go b/libpod/sqlite_state.go
index 45d2db588..71066b78b 100644
--- a/libpod/sqlite_state.go
+++ b/libpod/sqlite_state.go
@@ -363,6 +363,7 @@ func (s *SQLiteState) ValidateDBConfig(runtime *Runtime) (defErr error) {
 				return fmt.Errorf("adding DB config row: %w", err)
 			}
 
+			tx.Commit()
 			return nil
 		}

[EDIT: with err checking, yada yada]

edsantiago · 2023-03-23T21:28:43Z

With tx.Commit(), my PR ran CI... and, yay, no UNIQUE constraint failures, but, sigh, one "database is locked" in f36 rootless. Sorry.

Two new flakes, too: #17904 and #17905. And, it seems like the debian auto-update flake (#17607) happens a lot more with sqlite, does that make any sense?

I'll keep forcing CI runs to see what else we find.

mheon · 2023-03-23T23:46:34Z

Damn it, that makes perfect sense. Good catch.

The DB config is a single-row table, and the first Podman process to run against the database creates it. However, there was a race where multiple Podman processes, started simultaneously, could try and write it. Only the first would succeed, with subsequent processes failing once (and then running correctly once re-ran), but it was happening often in CI and deserves fixing. [NO NEW TESTS NEEDED] It's a CI flake fix. Signed-off-by: Matt Heon <[email protected]>

edsantiago · 2023-03-24T01:27:37Z

One run with no lock/unique flakes. The next one failed in f37 root container, yes, container. I don't remember how things work in containerized testing, but this caught me by surprise. Almost bedtime. I'll start one more job but won't look at results until the morrow.

vrothberg

LGTM
@giuseppe PTAL

edsantiago · 2023-03-24T12:44:20Z

None of these strings appear in the downloaded logs:

is locked|unique constraint|constraint fail|config row|adding db|dbconfig.id

...but please do not ignore the "database is locked" errors I'm reporting.

vrothberg · 2023-03-24T15:02:05Z

...but please do not ignore the "database is locked" errors I'm reporting.

We're on it 👍

edsantiago · 2023-03-24T18:30:54Z

No critical flakes in the run I started this morning (just the "c1 c2" flake, which is being addressed). I've started a new test run, will check in on it later this afternoon.

edsantiago · 2023-03-24T20:05:53Z

Another successful run. LGTM. Thanks everyone.

vrothberg · 2023-03-27T07:27:55Z

Thanks a ton for checking, @edsantiago !

@flouthoc @giuseppe ready to merge

flouthoc

LGTM
/lgtm
/approve

openshift-ci · 2023-03-27T09:07:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: flouthoc, mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [flouthoc,mheon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 23, 2023

vrothberg reviewed Mar 23, 2023

View reviewed changes

libpod/sqlite_state.go Show resolved Hide resolved

mheon force-pushed the fix_sqlite_validate_unique branch from 422cdea to e061cb9 Compare March 23, 2023 23:48

vrothberg reviewed Mar 24, 2023

View reviewed changes

flouthoc approved these changes Mar 27, 2023

View reviewed changes

openshift-ci bot assigned flouthoc Mar 27, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2023

openshift-merge-robot merged commit 30619bb into containers:main Mar 27, 2023

edsantiago mentioned this pull request Mar 28, 2023

sqlite: Error: adding DB config row: UNIQUE constraint failed: DBConfig.ID #17858

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 4, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a race around SQLite DB config validation #17902

Fix a race around SQLite DB config validation #17902

mheon commented Mar 23, 2023

mheon commented Mar 23, 2023

vrothberg left a comment

vrothberg commented Mar 23, 2023

edsantiago commented Mar 23, 2023

vrothberg commented Mar 23, 2023

edsantiago commented Mar 23, 2023 •

edited

Loading

mheon commented Mar 23, 2023

mheon commented Mar 23, 2023

mheon commented Mar 23, 2023

edsantiago commented Mar 23, 2023

mheon commented Mar 23, 2023

edsantiago commented Mar 23, 2023

mheon commented Mar 23, 2023

edsantiago commented Mar 23, 2023 •

edited

Loading

edsantiago commented Mar 23, 2023

mheon commented Mar 23, 2023

edsantiago commented Mar 24, 2023

vrothberg left a comment

edsantiago commented Mar 24, 2023

vrothberg commented Mar 24, 2023

edsantiago commented Mar 24, 2023

edsantiago commented Mar 24, 2023

vrothberg commented Mar 27, 2023

flouthoc left a comment

openshift-ci bot commented Mar 27, 2023

Fix a race around SQLite DB config validation #17902

Fix a race around SQLite DB config validation #17902

Conversation

mheon commented Mar 23, 2023

mheon commented Mar 23, 2023

vrothberg left a comment

Choose a reason for hiding this comment

vrothberg commented Mar 23, 2023

edsantiago commented Mar 23, 2023

vrothberg commented Mar 23, 2023

edsantiago commented Mar 23, 2023 • edited Loading

mheon commented Mar 23, 2023

mheon commented Mar 23, 2023

mheon commented Mar 23, 2023

edsantiago commented Mar 23, 2023

mheon commented Mar 23, 2023

edsantiago commented Mar 23, 2023

mheon commented Mar 23, 2023

edsantiago commented Mar 23, 2023 • edited Loading

edsantiago commented Mar 23, 2023

mheon commented Mar 23, 2023

edsantiago commented Mar 24, 2023

vrothberg left a comment

Choose a reason for hiding this comment

edsantiago commented Mar 24, 2023

vrothberg commented Mar 24, 2023

edsantiago commented Mar 24, 2023

edsantiago commented Mar 24, 2023

vrothberg commented Mar 27, 2023

flouthoc left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Mar 27, 2023

edsantiago commented Mar 23, 2023 •

edited

Loading

edsantiago commented Mar 23, 2023 •

edited

Loading