Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI flake: sdnotify-conmon test: getting MAINPID=1 instead of READY #8718

Closed
edsantiago opened this issue Dec 14, 2020 · 6 comments
Closed

CI flake: sdnotify-conmon test: getting MAINPID=1 instead of READY #8718

edsantiago opened this issue Dec 14, 2020 · 6 comments
Assignees
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@edsantiago
Copy link
Member

Symptom: the sdnotify : conmon test in 260-sdnotify.bats is flaking a lot:

# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: final output from sdnotify
# #| expected: 'READY=1'
# #|   actual: 'MAINPID=98229'
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is a reminder for Ed to take a look at

is "${lines[-1]}" "READY=1" "final output from sdnotify"

@edsantiago edsantiago self-assigned this Dec 14, 2020
@edsantiago edsantiago added flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. labels Dec 14, 2020
@greg-flexidao
Copy link

Didn't create issue yet for that, but it looks like it's belong here...

I just upgraded from 2.2.0 to 2.2.1, I get (timeout) error on all (6) different services running with systemd and sdnotify, that in turn makes all services to restart every 90s.

systemd config:

[Unit]
Description=Podman $SERVICE_NAME.service
Wants=network.target
After=network-online.target

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%N
Environment=XDG_RUNTIME_DIR=/run/user/$USER_ID
User=$USER_SERVICE_NAME
Group=$USER_SERVICE_NAME
Restart=always
ExecStartPre=-/usr/bin/podman rm --ignore -f $SERVICE_NAME
ExecStart=/usr/bin/podman run --log-level=debug --sdnotify=conmon \\
      --name $SERVICE_NAME \\
      $SERVICE_IMAGE_ARGS \\
      $SERVICE_IMAGE:$SERVICE_IMAGE_TAG \\
      $SERVICE_RUN_ARGS
ExecStop=/usr/bin/podman stop -t 10 $SERVICE_NAME
ExecStopPost=-/usr/bin/podman rm --ignore -f $SERVICE_NAME

Type=notify
KillMode=none
NotifyAccess=all

[Install]
WantedBy=multi-user.target

Logs are also not showing anything in particular, the services work normally for ~90s, yet all are stuck in activating (start) stage, before being restarted with following error:

Dec 14 16:27:05 ethnode podman[18730]: level=info ts=2020-12-14T16:27:05.617Z caller=main.go:694 msg="Server is ready to receive web requests."
Dec 14 16:28:35 ethnode systemd[1]: prometheus.service: start operation timed out. Terminating.
Dec 14 16:28:35 ethnode systemd[1]: prometheus.service: Failed with result 'timeout'.
Dec 14 16:28:35 ethnode systemd[1]: Failed to start Podman prometheus.service.
Dec 14 16:28:35 ethnode systemd[1]: prometheus.service: Scheduled restart job, restart counter is at 1.
Dec 14 16:28:35 ethnode systemd[1]: Stopped Podman prometheus.service.

@edsantiago
Copy link
Member Author

@bytniak could you please file this as a new issue? I'm 99.99% sure that what you're seeing is unrelated to the flake reported in this issue. sdnotify hanging is a classic failure mode with many, many different possible causes (eg wrong setting of $NOTIFY_SOCKET, wrong version of conmon). This issue is about a very particular race condition in CI testing. Thank you!

@edsantiago
Copy link
Member Author

Got lucky with #8720: the flake triggered and shows that we are seeing:

# socat log:
# MAINPID=103530
# READY=1
# MAINPID=103530

edsantiago added a commit to edsantiago/libpod that referenced this issue Dec 14, 2020
- run test: minor cleanup to .containerenv test. Basically,
  make it do only two podman-runs (they're expensive) and
  tighten up the results checks

- ps test: add ps -a --storage. Requires small tweak to
  run_podman helper, so we can have "timeout" be an expected
  result

- sdnotify test: workaround for containers#8718 (seeing MAINPID=xxx as
  last output line instead of READY=1). As found by the
  newly-added debugging echos, what we are seeing is:

      MAINPID=103530
      READY=1
      MAINPID=103530

  It's not supposed to be that way; it's supposed to be just
  the first two. But when faced with reality, we must bend
  to accommodate it, so let's accept READY=1 anywhere in
  the output stream, not just as the last line.

Signed-off-by: Ed Santiago <[email protected]>
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

This has not recurred since #8720 merged - but that was a sweep-it-under-the-rug workaround. The real problem may or may not persist (I don't know, because there's no way to track it); and if it does, it may or may not be serious (I don't know, because I'm not systemd-savvy enough); and if it's serious, I don't know if it's a podman problem, or conmon, or systemd itself.

I suspect our only rational option here is to close but I'd like someone else to confirm. @giuseppe ? @mheon ?

@mheon
Copy link
Member

mheon commented Jan 14, 2021

I think we can close until and unless we see something similar in the wild - chasing sdnotify bugs is difficult and does not seem to have much payoff for us in general.

@rhatdan rhatdan closed this as completed Jan 14, 2021
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

4 participants