-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlite: Error: beginning transaction: database is locked #18356
Comments
Current list of flake instances
|
@edsantiago, can you try the following patch? It's (again) dropping WAL mode. Let's see if the diff --git a/libpod/sqlite_state.go b/libpod/sqlite_state.go
index d435f10727c6..e433c0c09ca0 100644
--- a/libpod/sqlite_state.go
+++ b/libpod/sqlite_state.go
@@ -44,7 +44,7 @@ const (
// Assembled sqlite options used when opening the database.
sqliteOptions = "db.sql?" +
sqliteOptionLocation +
- sqliteOptionJournal +
+// sqliteOptionJournal +
sqliteOptionSynchronous +
sqliteOptionForeignKeys +
sqliteOptionTXLock |
@vrothberg the flake triggers infrequently, but FWIW one pass of #17831 just completed without seeing this flake. I just started another run. |
Thanks a lot, Ed! |
Seven CI runs today on #17831, lots of flakes, but did not see the |
@mheon WAL mode seems to play a role in the lock errors. Since you are keen on keeping WAL mode, I hand it over to you. |
@edsantiago @vrothberg Mind running for a few more days? I'd like to be sure about this before we make changes, we've thought we had this one locked down before and it's still here. |
I think we're up to fourteen CI runs, still haven't seen the |
Thirty-five CI runs with @vrothberg's patch: no sqlite-locked failures. I removed the patch. First CI run passed! Failed on the second:
[EDIT: it failed twice, actually, the first one above was f38 rootless, this second one I didn't notice was rawhide root] |
I think we have two options:
I let @mheon decide since he owns the sqlite backend. |
I spent a large portion of friday attempting to identify the specific cause of the bug. Per the docs, it should only occur on concurrent read/write from within a single process to a single table (https://www2.sqlite.org/cvstrac/wiki?p=DatabaseIsLocked) but absolutely none of this explains how a locked error would occur on the start of a transaction... I went to far as to try and trace what was accessing the database (to not much success, on-update hooks are a little complex apparently). At this point, let's rip it out until we can figure out what is actually going on. |
As shown in containers#17831, WAL mode plays a role in causing `database is locked` errors. Those are errors, in theory, should not happen as the DB should busy wait. mattn/go-sqlite3/issues/274 has some comments indicating that the busy handler behaves differently in WAL mode which may be an explanation to the error. For now, let's disable WAL mode and only re-enable it when we have clearer understanding of what's going on. The upstream issue along with the SQLite documentation do not give me the clear guidance that I would need. [NO NEW TESTS NEEDED] - flake is only reproducible in CI. Fixes: containers#18356 Signed-off-by: Valentin Rothberg <[email protected]>
Ugh. I'm sorry. I'm really, really sorry.
This is f38 aarch64 root, in my hammer-hammer PR, fully rebased on main as of this afternoon, 287a419, which I 100% guarantee includes #18356. The other failures were in Also, #18356 is identical (except for a comment) to what I tested all last week and this weekend. Over forty CI runs, and I never saw this. |
Next theory: set SQLite max connections to 1. |
I have a love-hate-relationship with your thorough testing, @edsantiago :) Please don't excuse for that. It's invaluable! There is still something we don't understand and that's making me nervous. In theory, the busy handler should kick in when the database is locked. |
Last seen on May 9 despite many-times-per-day testing on #17831. Does anyone have an explanation for how/why this failure has disappeared? |
The only explanation I could find is bumping the sqlite bindings in commit 92309d9. But that is a month after May 9, so I'd expect the flake to have occurred in between. |
A friendly reminder that this issue had no activity for 30 days. |
The last instance was May 9. I haven't been hammering on #17831 as much the past month, but even so we're still talking O(100) runs since then. I don't want to be the one to close this without understanding the reason, but I will not object to someone else closing it. |
Thank you, Ed! This (i.e., all the data points) is super helpful! I can take the blame for closing. If it flakes again, we can re-open. |
First: yes, this is after #18339. Seen just now in f37 rootless:
Why is it always
podman ps
that triggers this flake?The text was updated successfully, but these errors were encountered: