Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlite: Error: beginning transaction: database is locked #18356

Closed
edsantiago opened this issue Apr 26, 2023 · 19 comments · Fixed by #18519
Closed

sqlite: Error: beginning transaction: database is locked #18356

edsantiago opened this issue Apr 26, 2023 · 19 comments · Fixed by #18519
Labels
bugweek flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@edsantiago
Copy link
Member

First: yes, this is after #18339. Seen just now in f37 rootless:

  podman container restart running container
...
$ podman [options] ps -q
Error: beginning transaction: database is locked

Why is it always podman ps that triggers this flake?

@edsantiago edsantiago added flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. priority/high bugweek labels Apr 26, 2023
@edsantiago
Copy link
Member Author

Current list of flake instances

  • debian-12 : int podman debian-12 rootless host sqlite
    • 05-01 10:58 in Podman restart [It] podman restart running container
  • fedora-37 : int podman fedora-37 rootless host sqlite
    • 04-26 09:28 in Podman restart [It] podman container restart running container
  • rawhide : int podman rawhide rootless host sqlite
    • 04-26 13:47 in Podman restart [It] podman container restart running container

@vrothberg
Copy link
Member

@edsantiago, can you try the following patch? It's (again) dropping WAL mode. Let's see if the database is locked flake comes back again.

diff --git a/libpod/sqlite_state.go b/libpod/sqlite_state.go
index d435f10727c6..e433c0c09ca0 100644
--- a/libpod/sqlite_state.go
+++ b/libpod/sqlite_state.go
@@ -44,7 +44,7 @@ const (
        // Assembled sqlite options used when opening the database.
        sqliteOptions = "db.sql?" +
                sqliteOptionLocation +
-               sqliteOptionJournal +
+//             sqliteOptionJournal +
                sqliteOptionSynchronous +
                sqliteOptionForeignKeys +
                sqliteOptionTXLock

@edsantiago
Copy link
Member Author

@vrothberg the flake triggers infrequently, but FWIW one pass of #17831 just completed without seeing this flake. I just started another run.

@vrothberg
Copy link
Member

Thanks a lot, Ed!

@edsantiago
Copy link
Member Author

Seven CI runs today on #17831, lots of flakes, but did not see the transaction one. Tests do seem to be running slower, O(40m), it could just be slow VMs.

@vrothberg
Copy link
Member

@mheon WAL mode seems to play a role in the lock errors. Since you are keen on keeping WAL mode, I hand it over to you.

@mheon
Copy link
Member

mheon commented May 4, 2023

@edsantiago @vrothberg Mind running for a few more days? I'd like to be sure about this before we make changes, we've thought we had this one locked down before and it's still here.

@edsantiago
Copy link
Member Author

edsantiago commented May 5, 2023

I think we're up to fourteen CI runs, still haven't seen the locked flake. [EDIT: yes, I'll keep hammering away at it]

@edsantiago
Copy link
Member Author

edsantiago commented May 8, 2023

Thirty-five CI runs with @vrothberg's patch: no sqlite-locked failures.

I removed the patch. First CI run passed! Failed on the second:

$ podman [options] ps -q
Error: beginning transaction: database is locked

[EDIT: it failed twice, actually, the first one above was f38 rootless, this second one I didn't notice was rawhide root]

@vrothberg
Copy link
Member

I think we have two options:

  1. Disable WAL mode.
  2. If possible, address the locking issues.

I let @mheon decide since he owns the sqlite backend.

@mheon
Copy link
Member

mheon commented May 9, 2023

I spent a large portion of friday attempting to identify the specific cause of the bug. Per the docs, it should only occur on concurrent read/write from within a single process to a single table (https://www2.sqlite.org/cvstrac/wiki?p=DatabaseIsLocked) but absolutely none of this explains how a locked error would occur on the start of a transaction... I went to far as to try and trace what was accessing the database (to not much success, on-update hooks are a little complex apparently).

At this point, let's rip it out until we can figure out what is actually going on.

vrothberg added a commit to vrothberg/libpod that referenced this issue May 9, 2023
As shown in containers#17831, WAL mode plays a role in causing `database is locked`
errors.  Those are errors, in theory, should not happen as the DB should
busy wait.  mattn/go-sqlite3/issues/274 has some comments indicating
that the busy handler behaves differently in WAL mode which may be an
explanation to the error.

For now, let's disable WAL mode and only re-enable it when we have
clearer understanding of what's going on.  The upstream issue along with
the SQLite documentation do not give me the clear guidance that I would
need.

[NO NEW TESTS NEEDED] - flake is only reproducible in CI.

Fixes: containers#18356
Signed-off-by: Valentin Rothberg <[email protected]>
@edsantiago
Copy link
Member Author

Ugh. I'm sorry. I'm really, really sorry.

not ok 123 podman start --filter invalid-restart-policy - return error
 ...
# podman run -d quay.io/libpod/testimage:20221018 /bin/true
d3be2e45fb326ee8afa117698395995c2f9fb2961ae347ff315395ccf339d6b1
# podman start --filter restart-policy=fakepolicy d3be2e45fb326ee8afa117698395995c2f9fb2961ae347ff315395ccf339d6b1
Error: beginning transaction: database is locked
[ rc=125 (expected) ]
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: CID of restart-policy=<not-exists> container
#| expected: 'Error: fakepolicy invalid restart policy'
#|   actual: 'Error: beginning transaction: database is locked'
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is f38 aarch64 root, in my hammer-hammer PR, fully rebased on main as of this afternoon, 287a419, which I 100% guarantee includes #18356.

The other failures were in int tests, always in podman ps, and (due to the nature of int tests) on a relatively fresh setup with fresh roots and all that. This is test 123 in system tests, not a freshly-initialized DB by any means. And no contention: sys tests are not parallel. I don't know what this means, and don't have the brains to pursue it at this hour.

Also, #18356 is identical (except for a comment) to what I tested all last week and this weekend. Over forty CI runs, and I never saw this.

@edsantiago edsantiago reopened this May 9, 2023
@mheon
Copy link
Member

mheon commented May 10, 2023

Next theory: set SQLite max connections to 1.

@vrothberg
Copy link
Member

vrothberg commented May 10, 2023

I have a love-hate-relationship with your thorough testing, @edsantiago :) Please don't excuse for that. It's invaluable!

There is still something we don't understand and that's making me nervous. In theory, the busy handler should kick in when the database is locked.

@edsantiago
Copy link
Member Author

Last seen on May 9 despite many-times-per-day testing on #17831. Does anyone have an explanation for how/why this failure has disappeared?

@vrothberg
Copy link
Member

The only explanation I could find is bumping the sqlite bindings in commit 92309d9. But that is a month after May 9, so I'd expect the flake to have occurred in between.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

The last instance was May 9. I haven't been hammering on #17831 as much the past month, but even so we're still talking O(100) runs since then.

I don't want to be the one to close this without understanding the reason, but I will not object to someone else closing it.

@vrothberg
Copy link
Member

vrothberg commented Aug 10, 2023

Thank you, Ed! This (i.e., all the data points) is super helpful! I can take the blame for closing. If it flakes again, we can re-open.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 9, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bugweek flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants