Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop podman.service: inspecting object: no such object: "top_etc" #17904

Closed
edsantiago opened this issue Mar 23, 2023 · 48 comments
Closed

stop podman.service: inspecting object: no such object: "top_etc" #17904

edsantiago opened this issue Mar 23, 2023 · 48 comments
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. sqlite Bugs pertaining to the sqlite database backend

Comments

@edsantiago
Copy link
Member

Happens very, very often when using sqlite:

  stop podman.service
...
Execing [...]/podman --root .../server_root --runroot .../runroot inspect --format={{.State.Running}} top_WthZBtzo

Error: inspecting object: no such object: "top_WthZBtzo"

Important note: this is an "Execing" invocation of podman, completely different from the normal e2e test "Running" invocation. It does not have the usual e2e --this --that options. sqlite is selected via containers.conf in this setup.

@edsantiago edsantiago added flakes Flakes from Continuous Integration sqlite Bugs pertaining to the sqlite database backend labels Mar 23, 2023
@edsantiago
Copy link
Member Author

From the last few days

@edsantiago edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023
@edsantiago
Copy link
Member Author

Most recent: last night, f36 root, in a CI run that included my ignore-enoent patch.

@vrothberg
Copy link
Member

OK, I think we have a really bad problem in the e2e tests which may have been hidden by the 3 restarts.

Setting environment variables impacts all goroutines. So doing an os.Setenv() impacts everything. With a bad timing, tests can shoot each other.

Instead, we need to pass environment variables via exec.Command.

@Luap99 can you double check my theory? I am still a bit slow these days.

@Luap99
Copy link
Member

Luap99 commented Mar 29, 2023

I am not 100% on this but I do not think this is a problem, if it were would would see much more weird things.

Reading https://onsi.github.io/ginkgo/#spec-cleanup-aftereach-and-defercleanup shows a pattern similar to ours.
Environment variables are per process and as I see it locally when you run things in parallel (-nodes 3) it will actually run them as separate processes:

pholzing   41096  1.4  0.1 1854584 36736 pts/1   Sl+  16:16   0:00 /home/pholzing/go/src/github.com/containers/podman/test/e2e/e2e.test --test.timeout=1h30m0s --test.coverprofile=e2e.coverprofile.3 --ginkgo.seed=1680099351 --ginkgo.flakeAttempts=3 --ginkgo.progress --ginkgo.debug --ginkgo.parallel.node=3 --ginkgo.parallel.total=3 --ginkgo.parallel.streamhost=http://127.0.0.1:42835 --ginkgo.parallel.synchost=http://127.0.0.1:42835 --ginkgo.noColor --ginkgo.slowSpecThreshold=5.00000 --ginkgo.v --ginkgo.trace
pholzing   41097  1.4  0.1 1780340 36416 pts/1   Sl+  16:16   0:00 /home/pholzing/go/src/github.com/containers/podman/test/e2e/e2e.test --test.timeout=1h30m0s --test.coverprofile=e2e.coverprofile.1 --ginkgo.seed=1680099351 --ginkgo.flakeAttempts=3 --ginkgo.progress --ginkgo.debug --ginkgo.parallel.node=1 --ginkgo.parallel.total=3 --ginkgo.parallel.streamhost=http://127.0.0.1:42835 --ginkgo.parallel.synchost=http://127.0.0.1:42835 --ginkgo.noColor --ginkgo.slowSpecThreshold=5.00000 --ginkgo.v --ginkgo.trace
pholzing   41099  1.3  0.1 1780596 36580 pts/1   Sl+  16:16   0:00 /home/pholzing/go/src/github.com/containers/podman/test/e2e/e2e.test --test.timeout=1h30m0s --test.coverprofile=e2e.coverprofile.2 --ginkgo.seed=1680099351 --ginkgo.flakeAttempts=3 --ginkgo.progress --ginkgo.debug --ginkgo.parallel.node=2 --ginkgo.parallel.total=3 --ginkgo.parallel.streamhost=http://127.0.0.1:42835 --ginkgo.parallel.synchost=http://127.0.0.1:42835 --ginkgo.noColor --ginkgo.slowSpecThreshold=5.00000 --ginkgo.v --ginkgo.trace

Therefore each node has it own set of envs and they should not collide. However if a test forgets Unsetenv() then yes we are leaking the env into following tests.

@edsantiago
Copy link
Member Author

Would it help to instrument ginkgo such that it dumps ENV on failure? Noisy, but if someone points me at a starting point I'll add it to my test PR.

@edsantiago
Copy link
Member Author

Here we go - this is why we can't enable sqlite system tests.

This is a failure in my enable-sqlite PR. It's an int failure, not sys, and it's a triple-ginkgo failure, and this is what I see when sqlite is enabled globally via containers.conf.

As I mentioned in comment 0, but this is important and needs repeating, the podman that runs in this test is not the usual e2e podman. It does not have the long boring series of --this-and-that. Hence, in normal int tests, it's actually running boltdb. Now, in #17954, it will run sqlite, and will consistently triple-fail. The podman Exec has --root but no --runroot, so I thought maybe I can fix it by adding --runroot, but nope, still fails. That's as far as I got.

Anyhow. #17954 can't merge until this is fixed.

@vrothberg
Copy link
Member

That is great news actually. I couldn't put the pieces together when thinking this was a flake. A consistent failure makes more sense.

@edsantiago
Copy link
Member Author

It is a flake. If it were consistent, none of the int sqlite tests would pass. But last night before bed I force-pushed onto #17954, no change whatsoever, and all tests passed on the first try.

@vrothberg
Copy link
Member

What the ...

I didn´t take a look yet but why would it flake with containers.conf being set? That is something static. Given the tests don´t have an unexpected interference (see my theory on CONTAINERS_CONF), then it may be a bug/race in libpod?

@vrothberg
Copy link
Member

vrothberg commented Mar 30, 2023

@edsantiago can you please try cherry-picking #17937 ontop of your PR?

I have a theory that WAL mode may not write/sync immediately such that the container is kept in memory but isn't written to disk yet.

That means that podman system service is OK and remote calls work but a podman inspect doesn't see the object yet.

@edsantiago
Copy link
Member Author

I did. Not in #17954, but in #17831. I ran that all day yesterday, with #17937, and still see the flake. You can confirm by scrolling to the top, clicking the Build ID, and clicking the Commit SHA.

Note that this is a single-fail, because in that PR I disable triple-retries.

@vrothberg
Copy link
Member

Apologies, I feel almost like abusing you trying out different ideas, @edsantiago.

To test whether tests interfere, could we try running with GINKGONODES=1 such that tests run in sequence and not in parallel?

@edsantiago
Copy link
Member Author

Oh, good idea. Will try that and report back. Thanks.

@vrothberg
Copy link
Member

I am more or less thinking out loud at the moment and hope for our collective swarm brain to find a solution. One thing that strikes me:

         Execing /var/tmp/go/src/github.com/containers/podman/bin/podman --root /tmp/podman_test1092595828/server_root --runroot /tmp/podman_test1092595828/runroot inspect --format={{.State.Running}} top_WPsMuMFf
         
         Error: inspecting object: no such object: "top_WPsMuMFf"
         Execing /var/tmp/go/src/github.com/containers/podman/bin/podman --root /tmp/podman_test1092595828/server_root --runroot /tmp/podman_test1092595828/runroot rm -f top_WPsMuMFf
         
         top_WPsMuMFf

While the podman inspect $CTR fails. The defered podman rm $CTR just works fine (see https://github.com/containers/podman/blob/a7d0408761e0649dad419bd7dd447f68a0662cd4/test/e2e/systemd_activate_test.go#L129).

@edsantiago, can you confirm that it's always the podman inspect that fails? That would somehow rather strongly point to a race inside of Podman rather than test interference.

@edsantiago
Copy link
Member Author

New variant (I think): seeing in the rm. f36 root:

Execing podman [...] inspect --format={{.State.Running}} top_gAXKGPfR
Error: inspecting object: no such object: "top_gAXKGPfR"
Execing podman [...] rm -f top_gAXKGPfR
Error: removing storage for container "top_gAXKGPfR": 1 error occurred:
	* unlinkat /tmp/podman_test3195926079/server_root/overlay-containers/CID/userdata/shm: device or resource busy

@vrothberg
Copy link
Member

I think that's the EBUSY flake. I've seen it a while ago in this PR (#17904 (comment)).

@edsantiago
Copy link
Member Author

Yes, it's unlinkat/EBUSY ... but this is happening in Execing podman, the one without --this --that. It's a brand-new failure mode, because this is happening in a simple clean setup: custom rootdir, very few distracting command-line options. (Which of course is the case in system tests, where we mostly see unlinkat/EBUSY, but usually in system tests we see it near the end of tests, and of course system tests use the default /var/containers/whatever, so there's always been the possibility that some previous test screws up state for subsequent ones).

Anyhow, this does not give me any useful information, it's just a hmmmm.

@edsantiago
Copy link
Member Author

Still seeing it on main which currently includes #18083 as well as sync=EXTRA.

vrothberg added a commit to vrothberg/libpod that referenced this issue Apr 12, 2023
While debugging containers#17904 we found the test to be missing the common podman
flags.  Add them to the podman invocations and remove some clutter.

Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg
Copy link
Member

With #18056 merged, let's close this issue.

edsantiago added a commit to edsantiago/libpod that referenced this issue Apr 12, 2023
In setup, write a containers.conf.d file with db_backend
as specified in .cirrus.yml.

This is actually much scarier and more achy-breaky than
merely "sqlite system tests": it enables sqlite in e2e
tests. ("But wait, we already do that!" -- no, not really.
sqlite in e2e is being done via --db-backend option, and
some podman commands in e2e do not use the standard options.
See containers#17904.

This is unlikely to get merged any time soon (March, maybe
even April) because sqlite is still too fragile; this will
trigger more flakes than are currently acceptable. Also,
the nasty auto-update flake seems to trigger much more
reliably with sqlite. We need that one fixed.

Signed-off-by: Ed Santiago <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 27, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. sqlite Bugs pertaining to the sqlite database backend
Projects
None yet
Development

No branches or pull requests

3 participants