Avoid RemoteSocket collisions in e2e tests #12168

mtrmac · 2021-11-03T15:44:53Z

What this PR does / why we need it:

Per #12155, we see RemoteSocket collisions. So,

Add a lock file mechanism to ensure we only generate a RemoteSocket value once, removing even the slightest possibility of a flake due to a collision in ID values
Add logging on when/where the RemoteSocket values are generated, to maybe help diagnose unexpected RemoteSocket reuse.
Add somewhat desperate logging when a collision in the generated ID happens, to maybe explain how it can happen (with such a fairly high frequency) in the first place.

How to verify it

CI should continue to run. (I didn’t test this manually even once.) Probably new log entries.

Which issue(s) this PR fixes:

Possibly #12155 , unknown.

Special notes for your reviewer:

openshift-ci · 2021-11-03T15:44:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtrmac

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mtrmac]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

test/e2e/common_test.go

cevich · 2021-11-03T16:21:51Z

Opened test PR that incorporates the failing PR and this one, then only runs the remote tests where collisions were observed.

cevich · 2021-11-04T14:57:11Z

@mtrmac (and @edsantiago if you're interested) so the good news is your changes here work, they allow the tests to pass collision free. The bad news is...searching for "RemoteSocket collision" log messages finds a TON of hits. Comment with links to annotated logs.

cevich · 2021-11-04T15:15:01Z

Ultra-curios: I searched through the logs for duplicate mentions of the seed and time from two collision messages. I found none. So somehow we're generating (or leaking) a duplicate IDs without hitting the new logging code 😦

mtrmac · 2021-11-04T15:33:23Z

Ultra-curios: I searched through the logs for duplicate mentions of the seed and time from two collision messages. I found none

If this is about the cryptoseed values, that’s not too surprising: those values are freshly generated, not actually used to generate the IDs. They are logged primarily for the accompanying error value (and perhaps to see if the seed values are weird, like all zeros or something).

At this point, for the hypothesis in #12155 (comment) , it would be useful to log:

GinkgoRandomSeed + PID, to confirm all tests share the seed
Every math/rand.seed call (and possibly the value) + PID, to see which of the seed calls actually matters.

mtrmac · 2021-11-04T15:46:22Z

I have updated this PR to include such logs, and marked it as draft because it modifies vendored code, which we definitely don’t want in a production release.

(@cevich the vendor edits might also cause test failures, if so I’d appreciate if you could include the extra logs and bypass those failures in your #12169 .)

cevich · 2021-11-04T15:50:00Z

I’d appreciate if you could include the extra logs and bypass those failures

Happy to, that's why I made that PR...

cevich · 2021-11-04T15:52:31Z

Clarification: "include the extra logs" - do I need to do anything special for this? I think we already run ginkgo in -v (verbose) mode, so will that capture all the warning-level messages?

mtrmac · 2021-11-04T16:08:37Z

Clarification: "include the extra logs" - do I need to do anything special for this?

Include the today’s version of the changes in this PR (and on top, add something that bypasses the failures already visible in this PR, caused by edits to vendor/*)

cevich · 2021-11-04T16:33:51Z

OIC, yes I simply rebased the test PR ontop of this one. The test PR already skips all tasks except for 'build' and 'remote XYZ' (so validate, consistency, etc. are all skipped already). And just for FYI (in case I'm not around) this is done by getting out your butter-knife, and spreading around skip: $CI == $CI everywhere in .cirrus.yml (and clobbering all only_if lines as well) 😀

Sadly, doing something logical like 1 == 0 will break the universe 😢

cevich · 2021-11-04T17:54:39Z

Results:

So as I suspected, the ginko seed is constant, but maybe all the "collision" message are helpful? I searched for a few, but didn't find any duplicate seeds mentioned 😢

cevich · 2021-11-04T18:00:42Z

Focusing a bit, from int remote fedora-34 root host, I see two clashes near the beginning, with different PIDs:

mtrmac · 2021-11-04T18:02:34Z

I guess the log messages need more improvement.

We don’t really care about how the Podman being tested seeds the RNG (which is all over the log), we want to see how the test runners seed their RNG. Looking at the logs, I’m not sure that is actually recorded; is it?
The seed logs should all include PIDs, so that we can actually tell which one is the latest.

cevich · 2021-11-04T18:06:59Z

Well if we're just going based on PIDs, it looks like podman pod container share Namespaces happened first then podman pod container dontshare PIDNS.

Taking a step back...we still haven't answered the most basic question of: Are we somehow unexpectedly sharing state w/in ginkgo, such that the socket filename is leaking from one test to another (i.e. the string, not the random id). In other words, maybe it's not a RNG problem at all.

cevich · 2021-11-04T18:10:31Z

we want to see how the test runners seed their RNG. Looking at the logs, I’m not sure that is actually recorded; is it?

If it's there, I don't see it. Ginkgo does report:

[+0135s] Running Suite: Libpod Suite
         ===========================
         Random Seed: 1636041696

Which matches the log messages, but this is all expected. My understanding is ginkgo uses this seed to randomize the order in which tests are run. You can specify it on the command-line (-seed) to force a specific order. But I'm not sure whether/not/how this seed relates to our tests, are you suggesting it is?

mtrmac · 2021-11-04T18:12:15Z

I’d like to see logrus output by the test process itself (the Go code that calls …/ginkgo/… packages); not logrus output of the Podman being tested (like podman container create). I’m not sure whether that’s the same output as the primary Ginkgo log tracking steps like It/BeforeEach, and listing what failed and how, or whether it’s a separate stream.

mtrmac · 2021-11-04T18:13:28Z

My hypothesis is that of 5 five rand.Seed calls, the Ginkgo -seed value is, via GinkgoRandomSeed, the one that “wins” and is used by all test runners, causing them to generate the same random values.

cevich · 2021-11-04T18:20:44Z

the Go code that calls …/ginkgo/… packages

Ginkgo is called by the Makefile

causing them to generate the same random values.

I think we can prove this by (for example) forcing it to always call ginkgo ... -seed 12345, and see if the socket filenames are all the same from one run to the next. Ya?

cevich · 2021-11-04T18:23:27Z

Maybe this is a dumb question...Shouldn't there be a call to stringid.init() somewhere in test/e2e/*.go? I find none.

Edit (answer): golang calls these init() functions automatically.

mtrmac · 2021-11-18T00:14:33Z

I have dropped all the extra logging. I don’t know how to find out the exact cause of the collisions, but in #12155 we at least have pointers to the seed only being 31 bits wide, which feels low enough that collisions just might happen from time to time. And anyway, this code should work regardless of the exact cause or exact frequency of the collisions.

mheon · 2021-11-18T16:34:05Z

Changes LGTM. Tests are red, I think this needs a rebase to pick up the upstream testing fix.

cevich · 2021-11-18T18:28:01Z

This LGTM also, but I think it should also include a change to test/utils.go to not use the super-duper evil GinkgoRandomSeed(). We don't want that anywhere in our code.

mtrmac · 2021-11-19T04:19:02Z

This LGTM also, but I think it should also include a change to test/utils.go to not use the super-duper evil GinkgoRandomSeed(). We don't want that anywhere in our code.

It might be interesting to discuss what RNG behavior we want,

A constant one, to maximize likelihood of collisions
The semi-constant GinkgoRandomSeed, to have likely collisions but to test different outcomes over time
An uniformly random one, to minimize collisions

… ultimately, at least per the log in #12155 (comment) (on macOS; I guess it’s possible that Linux uses a different order), nothing we do in test/utils.go actually matters.

If we want to control the RNG seeding, AFAICS all relevant code must use a private RNG, not the shared one. Until then, removing that line would look good but probably not change any outcome.

cevich · 2021-11-19T15:16:32Z

nothing we do in test/utils.go actually matters.

So it's effectively dead-code, let's just remove it then. Avoid causing any confusion and desk-pounding to some future developer who encounters it.

cevich · 2021-11-19T15:19:37Z

I didn't dig into it, but you may need to rebase again to fix the checkpoint tests failures (log). There was a flurry of PRs that went in right before my F35 one merged. Not sure if you picked them all up or not.

Edit: nevermind, you've got them. Those are new/flakes.

mtrmac · 2021-11-29T18:01:20Z

nothing we do in test/utils.go actually matters.

So it's effectively dead-code, let's just remove it then.

That’s a good point; adding a commit to drop that, as suggested.

github-actions · 2021-12-30T00:04:36Z

A friendly reminder that this PR had no activity for 30 days.

rhatdan · 2021-12-31T14:02:54Z

@cevich @edsantiago @mtrmac Any reason not to merge this?

Separate the code that determines the directory and file prefix from the code that chooses and applies a UUID; we will make the second part more complex in a bit. Should not change behavior. Signed-off-by: Miloslav Trmač <[email protected]>

Add lock files and re-generate the UUID if we are not a known-unique user of the socket path. Signed-off-by: Miloslav Trmač <[email protected]>

- It probably doesn't actually make a difference: in experiments, the github.com/containers/storage/pkg/stringid RNG initialization has been happening later - This makes the RNG caller-controlled (which we don't benefit from), but also the same on all nodes of multi-process Ginkgo execution. So, if it works at all, it may make collisions of random ID values more likely, and our tests are not robust against that. So don't go out of our way to make collisions more likely. Signed-off-by: Miloslav Trmač <[email protected]>

rhatdan · 2022-01-05T16:51:59Z

/lgtm

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2021

This was referenced Nov 3, 2021

Fix use of rand + improve entropy-exhaustion containers/storage#1051

Closed

Socket unexpectedly shared by multiple rootless tests #12155

Closed

cevich reviewed Nov 3, 2021

View reviewed changes

test/e2e/common_test.go Show resolved Hide resolved

cevich reviewed Nov 3, 2021

View reviewed changes

test/e2e/common_test.go Outdated Show resolved Hide resolved

edsantiago reviewed Nov 3, 2021

View reviewed changes

test/e2e/common_test.go Show resolved Hide resolved

cevich mentioned this pull request Nov 3, 2021

DO NOT MERGE: TESTING REMOTE SOCKET COLLISIONS #12169

Closed

mtrmac force-pushed the socket-collision branch from 23c040c to 2ebcc5c Compare November 3, 2021 16:56

mtrmac force-pushed the socket-collision branch from 2ebcc5c to 39eb455 Compare November 4, 2021 15:45

mtrmac marked this pull request as draft November 4, 2021 15:47

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 4, 2021

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 9, 2021

mtrmac force-pushed the socket-collision branch from 39eb455 to f82f58f Compare November 17, 2021 23:44

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 17, 2021

mtrmac linked an issue Nov 17, 2021 that may be closed by this pull request

Socket unexpectedly shared by multiple rootless tests #12155

Closed

mtrmac marked this pull request as ready for review November 18, 2021 00:14

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 18, 2021

mtrmac force-pushed the socket-collision branch from f82f58f to 5215c8e Compare November 19, 2021 04:12

mtrmac force-pushed the socket-collision branch from 5215c8e to f1714da Compare November 29, 2021 18:00

github-actions bot added the stale-pr label Dec 30, 2021

rhatdan removed the stale-pr label Dec 31, 2021

mtrmac added 3 commits January 3, 2022 12:31

Avoid collisions on RemoteSocket paths

90e74e7

Add lock files and re-generate the UUID if we are not a known-unique user of the socket path. Signed-off-by: Miloslav Trmač <[email protected]>

mtrmac force-pushed the socket-collision branch from f1714da to f6a3edd Compare January 3, 2022 11:32

openshift-ci bot assigned rhatdan Jan 5, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 5, 2022

openshift-merge-robot merged commit 2157414 into containers:main Jan 5, 2022

mtrmac deleted the socket-collision branch January 5, 2022 16:56

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid RemoteSocket collisions in e2e tests #12168

Avoid RemoteSocket collisions in e2e tests #12168

mtrmac commented Nov 3, 2021

openshift-ci bot commented Nov 3, 2021

cevich commented Nov 3, 2021 •

edited

Loading

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021

mtrmac commented Nov 4, 2021

mtrmac commented Nov 4, 2021

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021

mtrmac commented Nov 4, 2021

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021

mtrmac commented Nov 4, 2021

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021

mtrmac commented Nov 4, 2021

mtrmac commented Nov 4, 2021

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021 •

edited

Loading

mtrmac commented Nov 18, 2021

mheon commented Nov 18, 2021

cevich commented Nov 18, 2021 •

edited

Loading

mtrmac commented Nov 19, 2021

cevich commented Nov 19, 2021

cevich commented Nov 19, 2021 •

edited

Loading

mtrmac commented Nov 29, 2021

github-actions bot commented Dec 30, 2021

rhatdan commented Dec 31, 2021

rhatdan commented Jan 5, 2022

Avoid RemoteSocket collisions in e2e tests #12168

Avoid RemoteSocket collisions in e2e tests #12168

Conversation

mtrmac commented Nov 3, 2021

What this PR does / why we need it:

How to verify it

Which issue(s) this PR fixes:

Special notes for your reviewer:

openshift-ci bot commented Nov 3, 2021

cevich commented Nov 3, 2021 • edited Loading

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021

mtrmac commented Nov 4, 2021

mtrmac commented Nov 4, 2021

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021

mtrmac commented Nov 4, 2021

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021

mtrmac commented Nov 4, 2021

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021

mtrmac commented Nov 4, 2021

mtrmac commented Nov 4, 2021

cevich commented Nov 4, 2021

cevich commented Nov 4, 2021 • edited Loading

mtrmac commented Nov 18, 2021

mheon commented Nov 18, 2021

cevich commented Nov 18, 2021 • edited Loading

mtrmac commented Nov 19, 2021

cevich commented Nov 19, 2021

cevich commented Nov 19, 2021 • edited Loading

mtrmac commented Nov 29, 2021

github-actions bot commented Dec 30, 2021

rhatdan commented Dec 31, 2021

rhatdan commented Jan 5, 2022

cevich commented Nov 3, 2021 •

edited

Loading

cevich commented Nov 4, 2021 •

edited

Loading

cevich commented Nov 18, 2021 •

edited

Loading

cevich commented Nov 19, 2021 •

edited

Loading