Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI flake: podman run --conmon-pidfile: conmon not running? #7580

Closed
edsantiago opened this issue Sep 10, 2020 · 10 comments
Closed

CI flake: podman run --conmon-pidfile: conmon not running? #7580

edsantiago opened this issue Sep 10, 2020 · 10 comments
Assignees
Labels
flakes Flakes from Continuous Integration In Progress This issue is actively being worked by the assignee, please do not work on this at this time. kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@edsantiago
Copy link
Member

One of the system tests does podman run --conmon-pidfile=PATH then reads pid from PATH and confirms that /proc/pid/exe is a symlink pointing to .*/conmon. This test has flaked twice in the past month, with:

#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: conmon pidfile (= PID 100491) points to conmon process
#| expected: '.*/conmon'
#|   actual: ''
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The two flakes:

Source code:

is "$(readlink /proc/$conmon_pid/exe)" ".*/conmon" \

This is so infrequent that it's hard to know where to start. I'm filing because:

  • Someone might perhaps think of a race condition in the conmon-pidfile code (e.g. perhaps the code is not flushing/closing the output file, so my test is reading an incomplete PID? Maybe the PID above should've been 100491X?)
  • Someone could suggest a way for me to instrument the test, to make the next failure easier to diagnose?
@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Sep 10, 2020
@rhatdan
Copy link
Member

rhatdan commented Sep 10, 2020

This looks like the old race condition where we were not getting stdout back on the remote end. I believe that is fixed?

@edsantiago
Copy link
Member Author

You mean #7195? That seems doubtful because one of the failures is non-remote, and regardless, the missing output is from readlink which is not running under podman. Did I misunderstand?

@rhatdan
Copy link
Member

rhatdan commented Sep 10, 2020

Ok, I did not read the actual code, I just saw the failure and it looked familiar.

@rhatdan
Copy link
Member

rhatdan commented Sep 10, 2020

Would this happen if $conmon_pid nolonger existed?

@edsantiago
Copy link
Member Author

Yes. Is that possible in a sleep infinity container?

edsantiago added a commit to edsantiago/libpod that referenced this issue Sep 10, 2020
- run tests: better "skip" message for docker-archive test;
  remove FIXME, document that podman-remote doesn't support it

- run tests: instrument the --conmon-pidfile test in hopes
  of tracking down flake containers#7580: cross-check pidfile against
  output of 'podman inspect', and add some debug messages
  that will only be seen on test failure.

- load tests: the pipe test: save and load a temporary tag,
  not $IMAGE. Primary reason is because of containers#7371, in which
  'podman load' assigns a new image ID (instead of preserving
  the saved one). This messes with our image management, and
  it turns out to be nonfixable.

Signed-off-by: Ed Santiago <[email protected]>
@rhatdan
Copy link
Member

rhatdan commented Sep 11, 2020

One would hope not, but that is the symptom

@lsm5 lsm5 self-assigned this Sep 22, 2020
@vrothberg vrothberg added the In Progress This issue is actively being worked by the assignee, please do not work on this at this time. label Sep 23, 2020
@rhatdan rhatdan added kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. labels Oct 7, 2020
@github-actions
Copy link

github-actions bot commented Nov 7, 2020

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Nov 7, 2020

@edsantiago is this still an issue? I see a couple of mergers that reference this issue?

@edsantiago
Copy link
Member Author

@giuseppe explicitly mentions in containers/crun#505 that it does not address this issue.

But I have not seen it since Sept 10, when I added two statements (an echo and an ls) intended to help debug the flake. My hunch is that the flake is a race condition, very likely caused by trying to read the pidfile before conmon has written it, and that my debug statements take just long enough to mask this. Absent an explicit diagnosis and fix, I'm not willing to close this, because that just seems to me like sweeping it under the rug. I will not interfere with you or someone else closing it, though. If it bites someone in the community, I trust they will comment here and someone can reopen.

@rhatdan
Copy link
Member

rhatdan commented Nov 10, 2020

I will close and we can reopen when it happens again.

@rhatdan rhatdan closed this as completed Nov 10, 2020
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration In Progress This issue is actively being worked by the assignee, please do not work on this at this time. kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

4 participants