-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sqlite fixes #17874
Sqlite fixes #17874
Conversation
It now can safely run on bare databases, before any tables are created. Signed-off-by: Matthew Heon <[email protected]>
I was searching the SQLite docs for a fix, but apparently that was the wrong place; it's a common enough error with the Go frontend for SQLite that the fix is prominently listed in the API docs for go-sqlite3. Setting cache mode to 'shared' and using a maximum of 1 simultaneous open connection should fix. Performance implications of this are unclear, but cache=shared sounds like it will be a benefit, not a curse. [NO NEW TESTS NEEDED] This fixes a flake with concurrent DB access. Signed-off-by: Matthew Heon <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mheon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@edsantiago This should get rid of the database locked flake, but I haven't been able to reproduce locally so I can't be 100% on that. |
libpod/sqlite_state.go
Outdated
@@ -45,7 +45,7 @@ func NewSqliteState(runtime *Runtime) (_ State, defErr error) { | |||
return nil, fmt.Errorf("creating root directory: %w", err) | |||
} | |||
|
|||
conn, err := sql.Open("sqlite3", filepath.Join(basePath, "db.sql?_loc=auto")) | |||
conn, err := sql.Open("sqlite3", filepath.Join(basePath, "db.sql?_loc=auto&cache=shared")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to conflict with #17867
libpod/sqlite_state.go
Outdated
@@ -57,6 +57,9 @@ func NewSqliteState(runtime *Runtime) (_ State, defErr error) { | |||
} | |||
}() | |||
|
|||
// Necessary to avoid database locked errors. | |||
conn.SetMaxOpenConns(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that effectively excluding all other processes from opening a connection?
FWIW, the string "UNIQUE" (case-sensitive, caps) does not appear in the CI logs, nor do any of the following case-insensitive strings:
Timings, with a reminder that this is not equivalent to serious performance benchmarks
|
System test runs in #17831 are done. Saw lots of "database is locked", but no "unique etc" errors. |
The symptoms in containers#17859 indicate that setting the PRAGMAs in individual EXECs outside of a transaction can lead to concurrency issues and failures when the DB is locked. Hence set all PRAGMAs when opening the connection. Move them into individual constants to improve documentation and readability. Further make transactions exclusive as containers#17859 also mentions an error that the DB is locked during a transaction. [NO NEW TESTS NEEDED] - existing tests cover the code. Fixes: containers#17859 Signed-off-by: Valentin Rothberg <[email protected]> <MH: Cherry-picked on top of my branch> Signed-off-by: Matthew Heon <[email protected]>
I have pulled in #17867 from @vrothberg |
Marking as WIP: I'm going to perf test max connections vs transaction locking. |
@edsantiago Did you pull in Valentin's earlier PR, and if so did it resolve the database locked issue? |
@mheon yes I made an earlier run of #17831 using @vrothberg's fixes, and saw no instances of the "is locked" bug. I'm now rerebasing on this PR, and will resubmit in a few seconds. |
|
Maxcons is definitely slowing us down somewhat, at least at high container counts (that was 500) |
It goes down to ~1% slower at lower container counts |
Based on current numbers, the tx lock seems slightly faster. I'm going to remove the connection count limit. |
The SQLite transaction lock Valentin found is (slightly) faster. So let's go with that. Signed-off-by: Matthew Heon <[email protected]>
Re-pushed. Max connections logic is gone. Removing WIP. Going to switch to sqlite vs bolt perf comparisons. |
Followup: I did not see
What's really curious is that this is an e2e test that says Execing, whereas usually they say "Running". And the "Execing" line does not include the Worth pointing out that this PR is being run on a system with |
@mheon @vrothberg I'm sorry to report that f37 rootless sqlite had one failure in my PR:
(where "my PR" is #17831 rebased against 0c313b6e5) |
Damn. Alright, have to add even more validation there, I guess. |
I think there's a race condition in creating the table. Checking for the table existence should be part of the transaction: diff --git a/libpod/sqlite_state.go b/libpod/sqlite_state.go
index e1007dbc3709..9df98dc4b465 100644
--- a/libpod/sqlite_state.go
+++ b/libpod/sqlite_state.go
@@ -336,23 +336,21 @@ func (s *SQLiteState) ValidateDBConfig(runtime *Runtime) (defErr error) {
runtimeGraphDriver = storeOpts.GraphDriverName
}
- row := s.conn.QueryRow("SELECT Os, StaticDir, TmpDir, GraphRoot, RunRoot, GraphDriver, VolumeDir FROM DBConfig;")
-
+ tx, err := s.conn.Begin()
+ if err != nil {
+ return fmt.Errorf("beginning DB config transaction: %w", err)
+ }
+ defer func() {
+ if defErr != nil {
+ if err := tx.Rollback(); err != nil {
+ logrus.Errorf("Rolling back transaction to create DB config: %v", err)
+ }
+ }
+ }()
+ row := tx.QueryRow("SELECT Os, StaticDir, TmpDir, GraphRoot, RunRoot, GraphDriver, VolumeDir FROM DBConfig;")
if err := row.Scan(&os, &staticDir, &tmpDir, &graphRoot, &runRoot, &graphDriver, &volumePath); err != nil {
if errors.Is(err, sql.ErrNoRows) {
// Need to create runtime config info in DB
- tx, err := s.conn.Begin()
- if err != nil {
- return fmt.Errorf("beginning DB config transaction: %w", err)
- }
- defer func() {
- if defErr != nil {
- if err := tx.Rollback(); err != nil {
- logrus.Errorf("Rolling back transaction to create DB config: %v", err)
- }
- }
- }()
-
if _, err := tx.Exec(createRow, 1, schemaVersion, runtimeOS,
runtimeStaticDir, runtimeTmpDir, runtimeGraphRoot,
runtimeRunRoot, runtimeGraphDriver, runtimeVolumePath); err != nil { |
@edsantiago, could you apply the upper diff to your PR and see whether it fixes the flake? |
@vrothberg done, run is in progress. Quick note that this flake is rare: it did not happen at all on my last CI rerun, and it did not happen in this PR's most recent run. (As in: I downloaded all the logs and grep'ed them for even single failures). I will of course continue to download-and-grep all logs for this and my PR and report my findings. |
Well, doesn't look good. Didn't even make it past the image fetch:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I vote for merging this PR as is. It's already an improvement and we can tackle the remaining flake(s) separately.
This error is really weird. @mheon, shouldn't Podman just busy-wait until the DB is unlocked by another thread/process? |
Incremental progress is totally fine with me. /lgtm |
/hold cancel |
Two fixes: