Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More sqlite fixes #17889

Merged
merged 1 commit into from
Mar 23, 2023
Merged

Conversation

vrothberg
Copy link
Member

Does this PR introduce a user-facing change?

None

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 22, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2023
@vrothberg
Copy link
Member Author

@edsantiago, feel free to test them in your PR. I am optimistic.

@vrothberg
Copy link
Member Author

@cevich is the Build for fedora-37 CI_DESIRED_DATABASE:sqlite required? It's a runtime feature, so one build job should in theory be enough.

@edsantiago
Copy link
Member

is the Build for fedora-37 CI_DESIRED_DATABASE:sqlite required?

It's an annoying side effect of the way YAML does macro expansion. I don't think there's any sane way to get rid of it (without introducing horrible duplication), but maybe @cevich can think of a way.

@vrothberg
Copy link
Member Author

Error: beginning refresh transaction: database is locked

Still there 😱

@edsantiago
Copy link
Member

Yep, in my PR also.

@edsantiago
Copy link
Member

Consistently (three out of three runs) failing in int f37 rootless. Consistently NOT failing int root nor sys rootless. What is it about int rootless that is triggering this?

@vrothberg
Copy link
Member Author

Consistently (three out of three runs) failing in int f37 rootless. Consistently NOT failing int root nor sys rootless. What is it about int rootless that is triggering this?

Very curious but I've no idea.

@edsantiago
Copy link
Member

Also... there's something else broken in int sqlite tests. They are still running, 40 minutes and counting. Will post final timing results when the jobs finished, but right now it looks very bad.

SQLite developers consider it a misfeature [1], and after turning it on,
we saw a new set of flakes.  Let's turn it off and trust the developers
[1] that WAL mode is sufficient for our purposes.

Turning the shared cache off also makes the DB smaller and faster.

[NO NEW TESTS NEEDED]

[1] https://sqlite.org/forum/forumpost/1f291cdca4

Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg
Copy link
Member Author

I dropped the first commit which actually always caused the DB locked error when running rootless.

@vrothberg
Copy link
Member Author

Finger's crossed. Thinking about it, we haven't seen this issue without the shared cache which I dropped now again.

@vrothberg
Copy link
Member Author

@edsantiago this looks better on my end.

@edsantiago
Copy link
Member

YAY! None of the evil strings appear anywhere in the downloaded CI logs. LGTM but let's wait for my PR to finish CI.

@vrothberg
Copy link
Member Author

YAY! None of the evil strings appear anywhere in the downloaded CI logs. LGTM but let's wait for my PR to finish CI.

\o/ it seems the cache option indeed is evil.

@edsantiago
Copy link
Member

Sigh. I'm sorry. f37 rootless, two "UNIQUE constraint" failures. I've double-checked that my PR is correctly rebased on top of 17889, but please feel free to check again.

@vrothberg
Copy link
Member Author

Sigh. I'm sorry. f37 rootless, two "UNIQUE constraint" failures. I've double-checked that my PR is correctly rebased on top of 17889, but please feel free to check again.

🤯

Thanks, Ed. I'll continue digging tomorrow.

@vrothberg
Copy link
Member Author

There's still a race it seems.

@edsantiago
Copy link
Member

So, from my pov: even if this doesn't fix the UNIQUE bug, I'm OK with this merging. It seems to cause no harm, and I trust your investigation about shared-cache. I don't know enough sqlite to be the one to merge it, though. Maybe @mheon can look at it tomorrow. Thanks, everyone, for your persistence on this.

@edsantiago
Copy link
Member

Timing results (redacted, to only show difference between boltdb and sqlite):

type distro user DB local remote container
int fedora-37 root 30:39 33:14 30:17
int fedora-37 root sqlite 28:50 33:38
int fedora-37 rootless 28:55
int fedora-37 rootless sqlite 27:23

@vrothberg
Copy link
Member Author

vrothberg commented Mar 22, 2023

So, from my pov: even if this doesn't fix the UNIQUE bug, I'm OK with this merging. It seems to cause no harm, and I trust your investigation about shared-cache. I don't know enough sqlite to be the one to merge it, though. Maybe @mheon can look at it tomorrow. Thanks, everyone, for your persistence on this.

I am also OK with merging the PR as is. Maybe @mheon will find the needle in the haystack and get validation code healthy.

@edsantiago
Copy link
Member

OK, this is weird. Two more UNIQUE constraint errors ... but in the same task (f37 rootless) and, Twilight Zone music, the same two subtests, and almost the same delta-T (10s vs 12s) between the failures. I'm tempted to conclude that there's something weird happening when those two tests run simultaneously. But what?? All tests use a different root and runroot. Could there be some other shared directory when sqlite is involved? And only rootless??

@edsantiago
Copy link
Member

edsantiago commented Mar 22, 2023

Another CI run, here's f36 and f37, both rootless, both in "restart running container" test.

[EDITED TO ADD: yes, f36. I modified my PR to just hammer on sqlite, almost all jobs, including f36 and debian]

@mheon
Copy link
Member

mheon commented Mar 23, 2023

This LGTM. I know what the other issue is (TOCTOU in the validation code, two Podman processes can race to create the row) but it's a bit annoying to fix to it will have to wait until tomorrow afternoon.

@edsantiago
Copy link
Member

Here's a really, really nice one. f37 rootless, no surprise, but what's sweet is that it's a triple-failure in the restart test (single-failure is more common). Not only that, but there's a triple-failure in the stop podman.service test. I thought I had fixed that by adding --runroot to the podman invocation, but nope. (TL;DR the stop podman.service failure is a brand-new one, happening only in sqlite tests. I've never seen it until this week).

I am over 90% convinced that there is something happening with concurrent sqlite processes, some shared file that should not be shared.

@edsantiago
Copy link
Member

@mheon thanks for checking in. I'll assume you have what you need, and I'll stop my repeat testing. Good night!

@vrothberg
Copy link
Member Author

@baude @mheon @edsantiago merge me please

@edsantiago
Copy link
Member

My comment from yesterday still stands:

So, from my pov: even if this doesn't fix the UNIQUE bug, I'm OK with this merging. It seems to cause no harm, and I trust your investigation about shared-cache. I don't know enough sqlite to be the one to merge it, though. Maybe @mheon can look at it tomorrow. Thanks, everyone, for your persistence on this.

@baude
Copy link
Member

baude commented Mar 23, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 23, 2023
@openshift-merge-robot openshift-merge-robot merged commit cb18a33 into containers:main Mar 23, 2023
@vrothberg vrothberg deleted the sqlite-fixes branch March 23, 2023 14:01
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 4, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants