-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid RemoteSocket collisions in e2e tests #12168
Avoid RemoteSocket collisions in e2e tests #12168
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mtrmac The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Opened test PR that incorporates the failing PR and this one, then only runs the remote tests where collisions were observed. |
23c040c
to
2ebcc5c
Compare
@mtrmac (and @edsantiago if you're interested) so the good news is your changes here work, they allow the tests to pass collision free. The bad news is...searching for "RemoteSocket collision" log messages finds a TON of hits. Comment with links to annotated logs. |
Ultra-curios: I searched through the logs for duplicate mentions of the seed and time from two collision messages. I found none. So somehow we're generating (or leaking) a duplicate IDs without hitting the new logging code 😦 |
If this is about the At this point, for the hypothesis in #12155 (comment) , it would be useful to log:
|
2ebcc5c
to
39eb455
Compare
I have updated this PR to include such logs, and marked it as draft because it modifies vendored code, which we definitely don’t want in a production release. (@cevich the vendor edits might also cause test failures, if so I’d appreciate if you could include the extra logs and bypass those failures in your #12169 .) |
Happy to, that's why I made that PR... |
Clarification: "include the extra logs" - do I need to do anything special for this? I think we already run ginkgo in |
Include the today’s version of the changes in this PR (and on top, add something that bypasses the failures already visible in this PR, caused by edits to |
OIC, yes I simply rebased the test PR ontop of this one. The test PR already skips all tasks except for 'build' and 'remote XYZ' (so validate, consistency, etc. are all skipped already). And just for FYI (in case I'm not around) this is done by getting out your butter-knife, and spreading around Sadly, doing something logical like |
Results: So as I suspected, the ginko seed is constant, but maybe all the "collision" message are helpful? I searched for a few, but didn't find any duplicate seeds mentioned 😢 |
Focusing a bit, from |
I guess the log messages need more improvement.
|
Well if we're just going based on PIDs, it looks like podman pod container share Namespaces happened first then podman pod container dontshare PIDNS. Taking a step back...we still haven't answered the most basic question of: Are we somehow unexpectedly sharing state w/in ginkgo, such that the socket filename is leaking from one test to another (i.e. the string, not the random id). In other words, maybe it's not a RNG problem at all. |
If it's there, I don't see it. Ginkgo does report:
Which matches the log messages, but this is all expected. My understanding is ginkgo uses this seed to randomize the order in which tests are run. You can specify it on the command-line ( |
I’d like to see |
My hypothesis is that of 5 five |
Ginkgo is called by the Makefile
I think we can prove this by (for example) forcing it to always call |
Maybe this is a dumb question...Shouldn't there be a call to Edit (answer): golang calls these |
39eb455
to
f82f58f
Compare
I have dropped all the extra logging. I don’t know how to find out the exact cause of the collisions, but in #12155 we at least have pointers to the seed only being 31 bits wide, which feels low enough that collisions just might happen from time to time. And anyway, this code should work regardless of the exact cause or exact frequency of the collisions. |
Changes LGTM. Tests are red, I think this needs a rebase to pick up the upstream testing fix. |
This LGTM also, but I think it should also include a change to |
f82f58f
to
5215c8e
Compare
It might be interesting to discuss what RNG behavior we want,
… ultimately, at least per the log in #12155 (comment) (on macOS; I guess it’s possible that Linux uses a different order), nothing we do in If we want to control the RNG seeding, AFAICS all relevant code must use a private RNG, not the shared one. Until then, removing that line would look good but probably not change any outcome. |
So it's effectively dead-code, let's just remove it then. Avoid causing any confusion and desk-pounding to some future developer who encounters it. |
I didn't dig into it, but you may need to rebase again to fix the checkpoint tests failures (log). There was a flurry of PRs that went in right before my F35 one merged. Not sure if you picked them all up or not. Edit: nevermind, you've got them. Those are new/flakes. |
5215c8e
to
f1714da
Compare
That’s a good point; adding a commit to drop that, as suggested. |
A friendly reminder that this PR had no activity for 30 days. |
@cevich @edsantiago @mtrmac Any reason not to merge this? |
Separate the code that determines the directory and file prefix from the code that chooses and applies a UUID; we will make the second part more complex in a bit. Should not change behavior. Signed-off-by: Miloslav Trmač <[email protected]>
Add lock files and re-generate the UUID if we are not a known-unique user of the socket path. Signed-off-by: Miloslav Trmač <[email protected]>
- It probably doesn't actually make a difference: in experiments, the github.com/containers/storage/pkg/stringid RNG initialization has been happening later - This makes the RNG caller-controlled (which we don't benefit from), but also the same on all nodes of multi-process Ginkgo execution. So, if it works at all, it may make collisions of random ID values more likely, and our tests are not robust against that. So don't go out of our way to make collisions more likely. Signed-off-by: Miloslav Trmač <[email protected]>
f1714da
to
f6a3edd
Compare
/lgtm |
What this PR does / why we need it:
Per #12155, we see
RemoteSocket
collisions. So,RemoteSocket
reuse.How to verify it
CI should continue to run. (I didn’t test this manually even once.) Probably new log entries.
Which issue(s) this PR fixes:
Possibly #12155 , unknown.
Special notes for your reviewer: