Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CI flake on time of shutdown for API service #11927

Merged
merged 1 commit into from
Oct 12, 2021

Conversation

jwhonce
Copy link
Member

@jwhonce jwhonce commented Oct 12, 2021

  • To aid in debugging add PID to shutdown package logging
  • Added new message for forced service shutdown
  • Always wait for HTTP server to shutdown, duration of 0 not friendly
    to clients

Note: The log event

IdleTracker: StateClosed transition by connection marked un-managed

denotes a TCP connection has been initiated but no HTTP request was sent.

Fixes #11921

Signed-off-by: Jhon Honce [email protected]

What this PR does / why we need it:

How to verify it

Which issue(s) this PR fixes:

Special notes for your reviewer:

@jwhonce jwhonce added flakes Flakes from Continuous Integration kind/test-flake Categorizes issue or PR as related to test flakes. labels Oct 12, 2021
@jwhonce jwhonce requested review from edsantiago and mheon October 12, 2021 01:17
@jwhonce jwhonce self-assigned this Oct 12, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 12, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jwhonce

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2021
@edsantiago
Copy link
Member

Hmmmm, is this (f34) helpful?

time="2021-10-11T20:54:30-05:00" level=info msg="Received shutdown signal \"interrupt\", terminating!" PID=127323
time="2021-10-11T20:54:30-05:00" level=info msg="Invoking shutdown handler \"libpod\"" PID=127323
time="2021-10-11T20:54:36-05:00" level=debug msg="Completed shutdown handler \"service\"" PID=127341
time="2021-10-11T20:54:36-05:00" level=info msg="Invoking shutdown handler \"libpod\"" PID=127341

Or this (ubuntu)?

time="2021-10-12T01:47:47Z" level=debug msg="API service shutdown by /usr/local/go/src/sync/once.go:66, 0/4 connection(s)"
time="2021-10-12T01:47:47Z" level=debug msg="API service forced shutdown, ignoring timeout Duration"
time="2021-10-12T01:47:51Z" level=debug msg="Completed shutdown handler \"service\"" PID=70478
time="2021-10-12T01:47:51Z" level=info msg="Invoking shutdown handler \"libpod\"" PID=70478

@jwhonce
Copy link
Member Author

jwhonce commented Oct 12, 2021

@edsantiago For debugging those messages tell me the which shutdown handlers have been called and properly returned. The libpod one never returns since it calls os.Exit().

@edsantiago
Copy link
Member

I meant the 6-second delay in one, 4-second in the other

@jwhonce
Copy link
Member Author

jwhonce commented Oct 12, 2021

@edsantiago Ah, I misunderstood. I wanted to get this PR in to take care of the CI flake. There will be a follow on PR to resolve what is taking so long to shutdown.

@jwhonce jwhonce force-pushed the issues/11921 branch 2 times, most recently from a4a849f to 3d25836 Compare October 12, 2021 15:41
@edsantiago
Copy link
Member

I can reproduce this on my laptop, 100%, by changing the timeout back to 2 (or even 5) and running:

$ ~/go/bin/ginkgo -trace -debug -r -focus "are not available" test/e2e
...
time="2021-10-12T10:26:27-06:00" level=debug msg="API service shutdown request ignored as timeout Duration is UnlimitedService"
time="2021-10-12T10:26:28-06:00" level=debug msg="IdleTracker:new 0m+0h/0t connection(s)" X-Reference-Id=0xc0006aadd0
time="2021-10-12T10:26:28-06:00" level=debug msg="IdleTracker:new 0m+0h/1t connection(s)" X-Reference-Id=0xc0006aadd8
time="2021-10-12T10:26:28-06:00" level=debug msg="IdleTracker:new 0m+0h/2t connection(s)" X-Reference-Id=0xc0006aade0
time="2021-10-12T10:26:28-06:00" level=debug msg="IdleTracker:new 0m+0h/3t connection(s)" X-Reference-Id=0xc0006aade8
time="2021-10-12T10:26:28-06:00" level=info msg="Received shutdown signal \"interrupt\", terminating!" PID=3186527
time="2021-10-12T10:26:28-06:00" level=info msg="Invoking shutdown handler \"service\"" PID=3186527
time="2021-10-12T10:26:28-06:00" level=debug msg="API service forced shutdown, ignoring timeout Duration"
time="2021-10-12T10:26:28-06:00" level=debug msg="API service shutdown, 0/4 connection(s)"
time="2021-10-12T10:26:28-06:00" level=debug msg="IdleTracker:closed 0m+0h/4t connection(s)" X-Reference-Id=0xc0006aade8
time="2021-10-12T10:26:28-06:00" level=warning msg="IdleTracker: StateClosed transition by connection marked un-managed" X-Reference-Id=0xc0006aade8
time="2021-10-12T10:26:28-06:00" level=debug msg="API service forced shutdown, ignoring timeout Duration"

------------------------------
• Failure [6.735 seconds]
podman system service
/home/esm/src/atomic/2018-02.podman/libpod/test/e2e/system_service_test.go:18
  verify pprof endpoints
  /home/esm/src/atomic/2018-02.podman/libpod/test/e2e/system_service_test.go:54
    are not available [It]
    /home/esm/src/atomic/2018-02.podman/libpod/test/e2e/system_service_test.go:97

    Timed out after 5.001s.
    Expected process to exit.  It did not.

Unfortunately, the debug messages aren't helpful to me; but maybe you can try to reproduce, and if you can, track down the source of the hang?

PS I focus on the "are not available" test because it is the (much) simpler cleaner one.

* Increase timeout for tests to 10s
* To aid in debugging add PID to shutdown package logging
* Added new message for forced service shutdown
* Always wait for HTTP server to shutdown, duration of 0 not friendly
  to clients

Note: The log event

"IdleTracker: StateClosed transition by connection marked un-managed"

denotes a TCP connection has been initiated but no HTTP request was sent.
And is expected during these tests.

Fixes containers#11921

Signed-off-by: Jhon Honce <[email protected]>
@edsantiago
Copy link
Member

This is the source of the 6-second delay: https://github.com/containers/podman/blob/main/pkg/api/server/server.go#L292

And, I think it's a problem with cleanup here: https://github.com/containers/podman/blob/main/test/e2e/system_service_test.go#L121-L135

If I replace this function with time.Sleep(1 * time.Second), there is no shutdown delay. Is it possible that conn.Close() is not sufficient to fully close the connection? Or that net.http.Server is thinking that this connection is still active?

@jwhonce
Copy link
Member Author

jwhonce commented Oct 12, 2021

@edsantiago Turns out Shutdown() was waiting for the length of context.WithTimeout() rather than exiting early. For --time=0, I have changed the default to 1s and tests to Wait(2*time.Second). I have seen no flakes on my laptop. I'm leaving the extra debug logging in the code in case this becomes an issue on some upgrade of Go.

Additionally, I removed some expensive debugging calls that no longer reported meaningful information.

@edsantiago
Copy link
Member

/lgtm
/hold

@jwhonce thank you!

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2021
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 12, 2021
@rhatdan
Copy link
Member

rhatdan commented Oct 12, 2021

Nice work @jwhonce and @edsantiago

@edsantiago
Copy link
Member

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2021
@openshift-merge-robot openshift-merge-robot merged commit 03095bd into containers:main Oct 12, 2021
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. flakes Flakes from Continuous Integration kind/test-flake Categorizes issue or PR as related to test flakes. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

podman system service: takes 4-6s to clean up on interrupt
4 participants