stop podman.service: inspecting object: no such object: "top_etc" #17904

edsantiago · 2023-03-23T20:44:58Z

Happens very, very often when using sqlite:

  stop podman.service
...
Execing [...]/podman --root .../server_root --runroot .../runroot inspect --format={{.State.Running}} top_WthZBtzo

Error: inspecting object: no such object: "top_WthZBtzo"

Important note: this is an "Execing" invocation of podman, completely different from the normal e2e test "Running" invocation. It does not have the usual e2e --this --that options. sqlite is selected via containers.conf in this setup.

The text was updated successfully, but these errors were encountered:

edsantiago · 2023-03-27T15:05:42Z

From the last few days

debian-12 : int podman debian-12 root host sqlite
- PR Ed's pet PR with no flake retries #17831
  - 03-24 11:25 in Systemd activate [It] stop podman.service
  - 03-24 11:25 in Systemd activate [It] stop podman.service
  - 03-24 11:25 in Systemd activate [It] stop podman.service
debian-12 : int podman debian-12 rootless host sqlite
- PR Ed's pet PR with no flake retries #17831
  - 03-24 19:39 in Systemd activate [It] stop podman.service
fedora-36 : int podman fedora-36 rootless host sqlite
- PR Ed's pet PR with no flake retries #17831
  - 03-23 18:43 in Systemd activate [It] stop podman.service
  - 03-23 18:43 in Systemd activate [It] stop podman.service
  - 03-23 18:43 in Systemd activate [It] stop podman.service
fedora-37 : int podman fedora-37 root container sqlite
- PR Ed's pet PR with no flake retries #17831
  - 03-24 16:57 in Systemd activate [It] stop podman.service
fedora-37 : int podman fedora-37 rootless host sqlite
- PR Ed's pet PR with no flake retries #17831
  - 03-25 08:16 in Systemd activate [It] stop podman.service
  - 03-24 22:38 in Systemd activate [It] stop podman.service
  - 03-23 16:35 in Systemd activate [It] stop podman.service
  - 03-23 16:35 in Systemd activate [It] stop podman.service
  - 03-23 16:35 in Systemd activate [It] stop podman.service
  - 03-22 20:13 in Systemd activate [It] stop podman.service
  - 03-22 20:13 in Systemd activate [It] stop podman.service
  - 03-22 20:13 in Systemd activate [It] stop podman.service

edsantiago · 2023-03-29T14:02:58Z

Most recent: last night, f36 root, in a CI run that included my ignore-enoent patch.

vrothberg · 2023-03-29T14:07:47Z

OK, I think we have a really bad problem in the e2e tests which may have been hidden by the 3 restarts.

Setting environment variables impacts all goroutines. So doing an os.Setenv() impacts everything. With a bad timing, tests can shoot each other.

Instead, we need to pass environment variables via exec.Command.

@Luap99 can you double check my theory? I am still a bit slow these days.

Luap99 · 2023-03-29T14:19:56Z

I am not 100% on this but I do not think this is a problem, if it were would would see much more weird things.

Reading https://onsi.github.io/ginkgo/#spec-cleanup-aftereach-and-defercleanup shows a pattern similar to ours.
Environment variables are per process and as I see it locally when you run things in parallel (-nodes 3) it will actually run them as separate processes:

pholzing   41096  1.4  0.1 1854584 36736 pts/1   Sl+  16:16   0:00 /home/pholzing/go/src/github.com/containers/podman/test/e2e/e2e.test --test.timeout=1h30m0s --test.coverprofile=e2e.coverprofile.3 --ginkgo.seed=1680099351 --ginkgo.flakeAttempts=3 --ginkgo.progress --ginkgo.debug --ginkgo.parallel.node=3 --ginkgo.parallel.total=3 --ginkgo.parallel.streamhost=http://127.0.0.1:42835 --ginkgo.parallel.synchost=http://127.0.0.1:42835 --ginkgo.noColor --ginkgo.slowSpecThreshold=5.00000 --ginkgo.v --ginkgo.trace
pholzing   41097  1.4  0.1 1780340 36416 pts/1   Sl+  16:16   0:00 /home/pholzing/go/src/github.com/containers/podman/test/e2e/e2e.test --test.timeout=1h30m0s --test.coverprofile=e2e.coverprofile.1 --ginkgo.seed=1680099351 --ginkgo.flakeAttempts=3 --ginkgo.progress --ginkgo.debug --ginkgo.parallel.node=1 --ginkgo.parallel.total=3 --ginkgo.parallel.streamhost=http://127.0.0.1:42835 --ginkgo.parallel.synchost=http://127.0.0.1:42835 --ginkgo.noColor --ginkgo.slowSpecThreshold=5.00000 --ginkgo.v --ginkgo.trace
pholzing   41099  1.3  0.1 1780596 36580 pts/1   Sl+  16:16   0:00 /home/pholzing/go/src/github.com/containers/podman/test/e2e/e2e.test --test.timeout=1h30m0s --test.coverprofile=e2e.coverprofile.2 --ginkgo.seed=1680099351 --ginkgo.flakeAttempts=3 --ginkgo.progress --ginkgo.debug --ginkgo.parallel.node=2 --ginkgo.parallel.total=3 --ginkgo.parallel.streamhost=http://127.0.0.1:42835 --ginkgo.parallel.synchost=http://127.0.0.1:42835 --ginkgo.noColor --ginkgo.slowSpecThreshold=5.00000 --ginkgo.v --ginkgo.trace

Therefore each node has it own set of envs and they should not collide. However if a test forgets Unsetenv() then yes we are leaking the env into following tests.

edsantiago · 2023-03-29T15:02:04Z

Would it help to instrument ginkgo such that it dumps ENV on failure? Noisy, but if someone points me at a starting point I'll add it to my test PR.

edsantiago · 2023-03-29T23:12:46Z

Here we go - this is why we can't enable sqlite system tests.

This is a failure in my enable-sqlite PR. It's an int failure, not sys, and it's a triple-ginkgo failure, and this is what I see when sqlite is enabled globally via containers.conf.

As I mentioned in comment 0, but this is important and needs repeating, the podman that runs in this test is not the usual e2e podman. It does not have the long boring series of --this-and-that. Hence, in normal int tests, it's actually running boltdb. Now, in #17954, it will run sqlite, and will consistently triple-fail. The podman Exec has --root but no --runroot, so I thought maybe I can fix it by adding --runroot, but nope, still fails. That's as far as I got.

Anyhow. #17954 can't merge until this is fixed.

vrothberg · 2023-03-30T07:24:46Z

That is great news actually. I couldn't put the pieces together when thinking this was a flake. A consistent failure makes more sense.

edsantiago · 2023-03-30T11:19:19Z

It is a flake. If it were consistent, none of the int sqlite tests would pass. But last night before bed I force-pushed onto #17954, no change whatsoever, and all tests passed on the first try.

vrothberg · 2023-03-30T11:22:40Z

What the ...

I didn´t take a look yet but why would it flake with containers.conf being set? That is something static. Given the tests don´t have an unexpected interference (see my theory on CONTAINERS_CONF), then it may be a bug/race in libpod?

vrothberg · 2023-03-30T11:45:12Z

@edsantiago can you please try cherry-picking #17937 ontop of your PR?

I have a theory that WAL mode may not write/sync immediately such that the container is kept in memory but isn't written to disk yet.

That means that podman system service is OK and remote calls work but a podman inspect doesn't see the object yet.

edsantiago · 2023-03-30T11:52:20Z

I did. Not in #17954, but in #17831. I ran that all day yesterday, with #17937, and still see the flake. You can confirm by scrolling to the top, clicking the Build ID, and clicking the Commit SHA.

Note that this is a single-fail, because in that PR I disable triple-retries.

vrothberg · 2023-03-30T12:03:35Z

Apologies, I feel almost like abusing you trying out different ideas, @edsantiago.

To test whether tests interfere, could we try running with GINKGONODES=1 such that tests run in sequence and not in parallel?

edsantiago · 2023-03-30T12:04:47Z

Oh, good idea. Will try that and report back. Thanks.

vrothberg · 2023-03-30T12:10:21Z

I am more or less thinking out loud at the moment and hope for our collective swarm brain to find a solution. One thing that strikes me:

         Execing /var/tmp/go/src/github.com/containers/podman/bin/podman --root /tmp/podman_test1092595828/server_root --runroot /tmp/podman_test1092595828/runroot inspect --format={{.State.Running}} top_WPsMuMFf
         
         Error: inspecting object: no such object: "top_WPsMuMFf"
         Execing /var/tmp/go/src/github.com/containers/podman/bin/podman --root /tmp/podman_test1092595828/server_root --runroot /tmp/podman_test1092595828/runroot rm -f top_WPsMuMFf
         
         top_WPsMuMFf

While the podman inspect $CTR fails. The defered podman rm $CTR just works fine (see https://github.com/containers/podman/blob/a7d0408761e0649dad419bd7dd447f68a0662cd4/test/e2e/systemd_activate_test.go#L129).

@edsantiago, can you confirm that it's always the podman inspect that fails? That would somehow rather strongly point to a race inside of Podman rather than test interference.

edsantiago · 2023-04-10T11:56:57Z

New variant (I think): seeing in the rm. f36 root:

Execing podman [...] inspect --format={{.State.Running}} top_gAXKGPfR
Error: inspecting object: no such object: "top_gAXKGPfR"
Execing podman [...] rm -f top_gAXKGPfR
Error: removing storage for container "top_gAXKGPfR": 1 error occurred:
	* unlinkat /tmp/podman_test3195926079/server_root/overlay-containers/CID/userdata/shm: device or resource busy

vrothberg · 2023-04-11T06:51:37Z

I think that's the EBUSY flake. I've seen it a while ago in this PR (#17904 (comment)).

edsantiago · 2023-04-11T11:42:30Z

Yes, it's unlinkat/EBUSY ... but this is happening in Execing podman, the one without --this --that. It's a brand-new failure mode, because this is happening in a simple clean setup: custom rootdir, very few distracting command-line options. (Which of course is the case in system tests, where we mostly see unlinkat/EBUSY, but usually in system tests we see it near the end of tests, and of course system tests use the default /var/containers/whatever, so there's always been the possibility that some previous test screws up state for subsequent ones).

Anyhow, this does not give me any useful information, it's just a hmmmm.

edsantiago · 2023-04-11T16:39:12Z

Still seeing it on main which currently includes #18083 as well as sync=EXTRA.

While debugging containers#17904 we found the test to be missing the common podman flags. Add them to the podman invocations and remove some clutter. Signed-off-by: Valentin Rothberg <[email protected]>

vrothberg · 2023-04-12T13:13:33Z

With #18056 merged, let's close this issue.

In setup, write a containers.conf.d file with db_backend as specified in .cirrus.yml. This is actually much scarier and more achy-breaky than merely "sqlite system tests": it enables sqlite in e2e tests. ("But wait, we already do that!" -- no, not really. sqlite in e2e is being done via --db-backend option, and some podman commands in e2e do not use the standard options. See containers#17904. This is unlikely to get merged any time soon (March, maybe even April) because sqlite is still too fragile; this will trigger more flakes than are currently acceptable. Also, the nasty auto-update flake seems to trigger much more reliably with sqlite. We need that one fixed. Signed-off-by: Ed Santiago <[email protected]>

edsantiago added flakes Flakes from Continuous Integration sqlite Bugs pertaining to the sqlite database backend labels Mar 23, 2023

edsantiago mentioned this issue Mar 23, 2023

Fix a race around SQLite DB config validation #17902

Merged

edsantiago mentioned this issue Mar 28, 2023

CI: enable sqlite system tests #17954

Merged

edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023

vrothberg closed this as completed Apr 12, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 27, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stop podman.service: inspecting object: no such object: "top_etc" #17904

stop podman.service: inspecting object: no such object: "top_etc" #17904

edsantiago commented Mar 23, 2023

edsantiago commented Mar 27, 2023

edsantiago commented Mar 29, 2023

vrothberg commented Mar 29, 2023

Luap99 commented Mar 29, 2023

edsantiago commented Mar 29, 2023

edsantiago commented Mar 29, 2023

vrothberg commented Mar 30, 2023

edsantiago commented Mar 30, 2023

vrothberg commented Mar 30, 2023

vrothberg commented Mar 30, 2023 •

edited

Loading

edsantiago commented Mar 30, 2023

vrothberg commented Mar 30, 2023

edsantiago commented Mar 30, 2023

vrothberg commented Mar 30, 2023

edsantiago commented Apr 10, 2023

vrothberg commented Apr 11, 2023

edsantiago commented Apr 11, 2023

edsantiago commented Apr 11, 2023

vrothberg commented Apr 12, 2023

stop podman.service: inspecting object: no such object: "top_etc" #17904

stop podman.service: inspecting object: no such object: "top_etc" #17904

Comments

edsantiago commented Mar 23, 2023

edsantiago commented Mar 27, 2023

edsantiago commented Mar 29, 2023

vrothberg commented Mar 29, 2023

Luap99 commented Mar 29, 2023

edsantiago commented Mar 29, 2023

edsantiago commented Mar 29, 2023

vrothberg commented Mar 30, 2023

edsantiago commented Mar 30, 2023

vrothberg commented Mar 30, 2023

vrothberg commented Mar 30, 2023 • edited Loading

edsantiago commented Mar 30, 2023

vrothberg commented Mar 30, 2023

edsantiago commented Mar 30, 2023

vrothberg commented Mar 30, 2023

edsantiago commented Apr 10, 2023

vrothberg commented Apr 11, 2023

edsantiago commented Apr 11, 2023

edsantiago commented Apr 11, 2023

vrothberg commented Apr 12, 2023

vrothberg commented Mar 30, 2023 •

edited

Loading