Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid RemoteSocket collisions in e2e tests #12168

Merged
merged 3 commits into from
Jan 5, 2022

Conversation

mtrmac
Copy link
Collaborator

@mtrmac mtrmac commented Nov 3, 2021

What this PR does / why we need it:

Per #12155, we see RemoteSocket collisions. So,

  • Add a lock file mechanism to ensure we only generate a RemoteSocket value once, removing even the slightest possibility of a flake due to a collision in ID values
  • Add logging on when/where the RemoteSocket values are generated, to maybe help diagnose unexpected RemoteSocket reuse.
  • Add somewhat desperate logging when a collision in the generated ID happens, to maybe explain how it can happen (with such a fairly high frequency) in the first place.

How to verify it

CI should continue to run. (I didn’t test this manually even once.) Probably new log entries.

Which issue(s) this PR fixes:

Possibly #12155 , unknown.

Special notes for your reviewer:

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 3, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtrmac

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2021
test/e2e/common_test.go Outdated Show resolved Hide resolved
@cevich
Copy link
Member

cevich commented Nov 3, 2021

Opened test PR that incorporates the failing PR and this one, then only runs the remote tests where collisions were observed.

@cevich
Copy link
Member

cevich commented Nov 4, 2021

@mtrmac (and @edsantiago if you're interested) so the good news is your changes here work, they allow the tests to pass collision free. The bad news is...searching for "RemoteSocket collision" log messages finds a TON of hits. Comment with links to annotated logs.

@cevich
Copy link
Member

cevich commented Nov 4, 2021

Ultra-curios: I searched through the logs for duplicate mentions of the seed and time from two collision messages. I found none. So somehow we're generating (or leaking) a duplicate IDs without hitting the new logging code 😦

@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 4, 2021

Ultra-curios: I searched through the logs for duplicate mentions of the seed and time from two collision messages. I found none

If this is about the cryptoseed values, that’s not too surprising: those values are freshly generated, not actually used to generate the IDs. They are logged primarily for the accompanying error value (and perhaps to see if the seed values are weird, like all zeros or something).

At this point, for the hypothesis in #12155 (comment) , it would be useful to log:

  • GinkgoRandomSeed + PID, to confirm all tests share the seed
  • Every math/rand.seed call (and possibly the value) + PID, to see which of the seed calls actually matters.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 4, 2021

I have updated this PR to include such logs, and marked it as draft because it modifies vendored code, which we definitely don’t want in a production release.

(@cevich the vendor edits might also cause test failures, if so I’d appreciate if you could include the extra logs and bypass those failures in your #12169 .)

@mtrmac mtrmac marked this pull request as draft November 4, 2021 15:47
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 4, 2021
@cevich
Copy link
Member

cevich commented Nov 4, 2021

I’d appreciate if you could include the extra logs and bypass those failures

Happy to, that's why I made that PR...

@cevich
Copy link
Member

cevich commented Nov 4, 2021

Clarification: "include the extra logs" - do I need to do anything special for this? I think we already run ginkgo in -v (verbose) mode, so will that capture all the warning-level messages?

@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 4, 2021

Clarification: "include the extra logs" - do I need to do anything special for this?

Include the today’s version of the changes in this PR (and on top, add something that bypasses the failures already visible in this PR, caused by edits to vendor/*)

@cevich
Copy link
Member

cevich commented Nov 4, 2021

OIC, yes I simply rebased the test PR ontop of this one. The test PR already skips all tasks except for 'build' and 'remote XYZ' (so validate, consistency, etc. are all skipped already). And just for FYI (in case I'm not around) this is done by getting out your butter-knife, and spreading around skip: $CI == $CI everywhere in .cirrus.yml (and clobbering all only_if lines as well) 😀

Sadly, doing something logical like 1 == 0 will break the universe 😢

@cevich
Copy link
Member

cevich commented Nov 4, 2021

Results:

So as I suspected, the ginko seed is constant, but maybe all the "collision" message are helpful? I searched for a few, but didn't find any duplicate seeds mentioned 😢

@cevich
Copy link
Member

cevich commented Nov 4, 2021

Focusing a bit, from int remote fedora-34 root host, I see two clashes near the beginning, with different PIDs:

  1. podman pod container dontshare PIDNS
  2. podman pod container share Namespaces

@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 4, 2021

I guess the log messages need more improvement.

  • We don’t really care about how the Podman being tested seeds the RNG (which is all over the log), we want to see how the test runners seed their RNG. Looking at the logs, I’m not sure that is actually recorded; is it?
  • The seed logs should all include PIDs, so that we can actually tell which one is the latest.

@cevich
Copy link
Member

cevich commented Nov 4, 2021

Well if we're just going based on PIDs, it looks like podman pod container share Namespaces happened first then podman pod container dontshare PIDNS.

Taking a step back...we still haven't answered the most basic question of: Are we somehow unexpectedly sharing state w/in ginkgo, such that the socket filename is leaking from one test to another (i.e. the string, not the random id). In other words, maybe it's not a RNG problem at all.

@cevich
Copy link
Member

cevich commented Nov 4, 2021

we want to see how the test runners seed their RNG. Looking at the logs, I’m not sure that is actually recorded; is it?

If it's there, I don't see it. Ginkgo does report:

[+0135s] Running Suite: Libpod Suite
         ===========================
         Random Seed: 1636041696

Which matches the log messages, but this is all expected. My understanding is ginkgo uses this seed to randomize the order in which tests are run. You can specify it on the command-line (-seed) to force a specific order. But I'm not sure whether/not/how this seed relates to our tests, are you suggesting it is?

@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 4, 2021

I’d like to see logrus output by the test process itself (the Go code that calls …/ginkgo/… packages); not logrus output of the Podman being tested (like podman container create). I’m not sure whether that’s the same output as the primary Ginkgo log tracking steps like It/BeforeEach, and listing what failed and how, or whether it’s a separate stream.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 4, 2021

My hypothesis is that of 5 five rand.Seed calls, the Ginkgo -seed value is, via GinkgoRandomSeed, the one that “wins” and is used by all test runners, causing them to generate the same random values.

@cevich
Copy link
Member

cevich commented Nov 4, 2021

the Go code that calls …/ginkgo/… packages

Ginkgo is called by the Makefile

causing them to generate the same random values.

I think we can prove this by (for example) forcing it to always call ginkgo ... -seed 12345, and see if the socket filenames are all the same from one run to the next. Ya?

@cevich
Copy link
Member

cevich commented Nov 4, 2021

Maybe this is a dumb question...Shouldn't there be a call to stringid.init() somewhere in test/e2e/*.go? I find none.

Edit (answer): golang calls these init() functions automatically.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 9, 2021
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 17, 2021
@mtrmac mtrmac linked an issue Nov 17, 2021 that may be closed by this pull request
@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 18, 2021

I have dropped all the extra logging. I don’t know how to find out the exact cause of the collisions, but in #12155 we at least have pointers to the seed only being 31 bits wide, which feels low enough that collisions just might happen from time to time. And anyway, this code should work regardless of the exact cause or exact frequency of the collisions.

@mtrmac mtrmac marked this pull request as ready for review November 18, 2021 00:14
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 18, 2021
@mheon
Copy link
Member

mheon commented Nov 18, 2021

Changes LGTM. Tests are red, I think this needs a rebase to pick up the upstream testing fix.

@cevich
Copy link
Member

cevich commented Nov 18, 2021

This LGTM also, but I think it should also include a change to test/utils.go to not use the super-duper evil GinkgoRandomSeed(). We don't want that anywhere in our code.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 19, 2021

This LGTM also, but I think it should also include a change to test/utils.go to not use the super-duper evil GinkgoRandomSeed(). We don't want that anywhere in our code.

It might be interesting to discuss what RNG behavior we want,

  1. A constant one, to maximize likelihood of collisions
  2. The semi-constant GinkgoRandomSeed, to have likely collisions but to test different outcomes over time
  3. An uniformly random one, to minimize collisions

… ultimately, at least per the log in #12155 (comment) (on macOS; I guess it’s possible that Linux uses a different order), nothing we do in test/utils.go actually matters.

If we want to control the RNG seeding, AFAICS all relevant code must use a private RNG, not the shared one. Until then, removing that line would look good but probably not change any outcome.

@cevich
Copy link
Member

cevich commented Nov 19, 2021

nothing we do in test/utils.go actually matters.

So it's effectively dead-code, let's just remove it then. Avoid causing any confusion and desk-pounding to some future developer who encounters it.

@cevich
Copy link
Member

cevich commented Nov 19, 2021

I didn't dig into it, but you may need to rebase again to fix the checkpoint tests failures (log). There was a flurry of PRs that went in right before my F35 one merged. Not sure if you picked them all up or not.

Edit: nevermind, you've got them. Those are new/flakes.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 29, 2021

nothing we do in test/utils.go actually matters.

So it's effectively dead-code, let's just remove it then.

That’s a good point; adding a commit to drop that, as suggested.

@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Dec 31, 2021

@cevich @edsantiago @mtrmac Any reason not to merge this?

@rhatdan rhatdan removed the stale-pr label Dec 31, 2021
mtrmac added 3 commits January 3, 2022 12:31
Separate the code that determines the directory and file prefix
from the code that chooses and applies a UUID; we will make the
second part more complex in a bit.

Should not change behavior.

Signed-off-by: Miloslav Trmač <[email protected]>
Add lock files and re-generate the UUID if we
are not a known-unique user of the socket path.

Signed-off-by: Miloslav Trmač <[email protected]>
- It probably doesn't actually make a difference: in experiments,
  the github.com/containers/storage/pkg/stringid RNG initialization
  has been happening later
- This makes the RNG caller-controlled (which we don't benefit from),
  but also the same on all nodes of multi-process Ginkgo execution.
  So, if it works at all, it may make collisions of random ID values
  more likely, and our tests are not robust against that. So don't
  go out of our way to make collisions more likely.

Signed-off-by: Miloslav Trmač <[email protected]>
@rhatdan
Copy link
Member

rhatdan commented Jan 5, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 5, 2022
@openshift-merge-robot openshift-merge-robot merged commit 2157414 into containers:main Jan 5, 2022
@mtrmac mtrmac deleted the socket-collision branch January 5, 2022 16:56
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Socket unexpectedly shared by multiple rootless tests
6 participants