Fix CI flake on time of shutdown for API service #11927

jwhonce · 2021-10-12T01:17:38Z

To aid in debugging add PID to shutdown package logging
Added new message for forced service shutdown
Always wait for HTTP server to shutdown, duration of 0 not friendly
to clients

Note: The log event

IdleTracker: StateClosed transition by connection marked un-managed

denotes a TCP connection has been initiated but no HTTP request was sent.

Fixes #11921

Signed-off-by: Jhon Honce [email protected]

What this PR does / why we need it:

How to verify it

Which issue(s) this PR fixes:

Special notes for your reviewer:

openshift-ci · 2021-10-12T01:17:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jwhonce

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jwhonce]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

edsantiago · 2021-10-12T11:47:28Z

Hmmmm, is this (f34) helpful?

time="2021-10-11T20:54:30-05:00" level=info msg="Received shutdown signal \"interrupt\", terminating!" PID=127323
time="2021-10-11T20:54:30-05:00" level=info msg="Invoking shutdown handler \"libpod\"" PID=127323
time="2021-10-11T20:54:36-05:00" level=debug msg="Completed shutdown handler \"service\"" PID=127341
time="2021-10-11T20:54:36-05:00" level=info msg="Invoking shutdown handler \"libpod\"" PID=127341

Or this (ubuntu)?

time="2021-10-12T01:47:47Z" level=debug msg="API service shutdown by /usr/local/go/src/sync/once.go:66, 0/4 connection(s)"
time="2021-10-12T01:47:47Z" level=debug msg="API service forced shutdown, ignoring timeout Duration"
time="2021-10-12T01:47:51Z" level=debug msg="Completed shutdown handler \"service\"" PID=70478
time="2021-10-12T01:47:51Z" level=info msg="Invoking shutdown handler \"libpod\"" PID=70478

jwhonce · 2021-10-12T14:52:33Z

@edsantiago For debugging those messages tell me the which shutdown handlers have been called and properly returned. The libpod one never returns since it calls os.Exit().

edsantiago · 2021-10-12T14:55:22Z

I meant the 6-second delay in one, 4-second in the other

jwhonce · 2021-10-12T15:04:58Z

@edsantiago Ah, I misunderstood. I wanted to get this PR in to take care of the CI flake. There will be a follow on PR to resolve what is taking so long to shutdown.

edsantiago · 2021-10-12T16:28:36Z

I can reproduce this on my laptop, 100%, by changing the timeout back to 2 (or even 5) and running:

$ ~/go/bin/ginkgo -trace -debug -r -focus "are not available" test/e2e
...
time="2021-10-12T10:26:27-06:00" level=debug msg="API service shutdown request ignored as timeout Duration is UnlimitedService"
time="2021-10-12T10:26:28-06:00" level=debug msg="IdleTracker:new 0m+0h/0t connection(s)" X-Reference-Id=0xc0006aadd0
time="2021-10-12T10:26:28-06:00" level=debug msg="IdleTracker:new 0m+0h/1t connection(s)" X-Reference-Id=0xc0006aadd8
time="2021-10-12T10:26:28-06:00" level=debug msg="IdleTracker:new 0m+0h/2t connection(s)" X-Reference-Id=0xc0006aade0
time="2021-10-12T10:26:28-06:00" level=debug msg="IdleTracker:new 0m+0h/3t connection(s)" X-Reference-Id=0xc0006aade8
time="2021-10-12T10:26:28-06:00" level=info msg="Received shutdown signal \"interrupt\", terminating!" PID=3186527
time="2021-10-12T10:26:28-06:00" level=info msg="Invoking shutdown handler \"service\"" PID=3186527
time="2021-10-12T10:26:28-06:00" level=debug msg="API service forced shutdown, ignoring timeout Duration"
time="2021-10-12T10:26:28-06:00" level=debug msg="API service shutdown, 0/4 connection(s)"
time="2021-10-12T10:26:28-06:00" level=debug msg="IdleTracker:closed 0m+0h/4t connection(s)" X-Reference-Id=0xc0006aade8
time="2021-10-12T10:26:28-06:00" level=warning msg="IdleTracker: StateClosed transition by connection marked un-managed" X-Reference-Id=0xc0006aade8
time="2021-10-12T10:26:28-06:00" level=debug msg="API service forced shutdown, ignoring timeout Duration"

------------------------------
• Failure [6.735 seconds]
podman system service
/home/esm/src/atomic/2018-02.podman/libpod/test/e2e/system_service_test.go:18
  verify pprof endpoints
  /home/esm/src/atomic/2018-02.podman/libpod/test/e2e/system_service_test.go:54
    are not available [It]
    /home/esm/src/atomic/2018-02.podman/libpod/test/e2e/system_service_test.go:97

    Timed out after 5.001s.
    Expected process to exit.  It did not.

Unfortunately, the debug messages aren't helpful to me; but maybe you can try to reproduce, and if you can, track down the source of the hang?

PS I focus on the "are not available" test because it is the (much) simpler cleaner one.

* Increase timeout for tests to 10s * To aid in debugging add PID to shutdown package logging * Added new message for forced service shutdown * Always wait for HTTP server to shutdown, duration of 0 not friendly to clients Note: The log event "IdleTracker: StateClosed transition by connection marked un-managed" denotes a TCP connection has been initiated but no HTTP request was sent. And is expected during these tests. Fixes containers#11921 Signed-off-by: Jhon Honce <[email protected]>

edsantiago · 2021-10-12T16:58:43Z

This is the source of the 6-second delay: https://github.com/containers/podman/blob/main/pkg/api/server/server.go#L292

And, I think it's a problem with cleanup here: https://github.com/containers/podman/blob/main/test/e2e/system_service_test.go#L121-L135

If I replace this function with time.Sleep(1 * time.Second), there is no shutdown delay. Is it possible that conn.Close() is not sufficient to fully close the connection? Or that net.http.Server is thinking that this connection is still active?

jwhonce · 2021-10-12T17:05:22Z

@edsantiago Turns out Shutdown() was waiting for the length of context.WithTimeout() rather than exiting early. For --time=0, I have changed the default to 1s and tests to Wait(2*time.Second). I have seen no flakes on my laptop. I'm leaving the extra debug logging in the code in case this becomes an issue on some upgrade of Go.

Additionally, I removed some expensive debugging calls that no longer reported meaningful information.

edsantiago · 2021-10-12T17:13:09Z

/lgtm
/hold

@jwhonce thank you!

rhatdan · 2021-10-12T17:32:07Z

Nice work @jwhonce and @edsantiago

edsantiago · 2021-10-12T19:07:25Z

/hold cancel

jwhonce added flakes Flakes from Continuous Integration kind/test-flake Categorizes issue or PR as related to test flakes. labels Oct 12, 2021

jwhonce requested review from edsantiago and mheon October 12, 2021 01:17

jwhonce self-assigned this Oct 12, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2021

jwhonce force-pushed the issues/11921 branch 2 times, most recently from a4a849f to 3d25836 Compare October 12, 2021 15:41

jwhonce force-pushed the issues/11921 branch from 3d25836 to 0b0cde6 Compare October 12, 2021 16:51

jwhonce force-pushed the issues/11921 branch from 0b0cde6 to 0459484 Compare October 12, 2021 16:53

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2021

openshift-ci bot assigned edsantiago Oct 12, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 12, 2021

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2021

openshift-merge-robot merged commit 03095bd into containers:main Oct 12, 2021

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CI flake on time of shutdown for API service #11927

Fix CI flake on time of shutdown for API service #11927

jwhonce commented Oct 12, 2021 •

edited

Loading

openshift-ci bot commented Oct 12, 2021

edsantiago commented Oct 12, 2021

jwhonce commented Oct 12, 2021

edsantiago commented Oct 12, 2021

jwhonce commented Oct 12, 2021

edsantiago commented Oct 12, 2021

edsantiago commented Oct 12, 2021

jwhonce commented Oct 12, 2021

edsantiago commented Oct 12, 2021

rhatdan commented Oct 12, 2021

edsantiago commented Oct 12, 2021

Fix CI flake on time of shutdown for API service #11927

Fix CI flake on time of shutdown for API service #11927

Conversation

jwhonce commented Oct 12, 2021 • edited Loading

What this PR does / why we need it:

How to verify it

Which issue(s) this PR fixes:

Special notes for your reviewer:

openshift-ci bot commented Oct 12, 2021

edsantiago commented Oct 12, 2021

jwhonce commented Oct 12, 2021

edsantiago commented Oct 12, 2021

jwhonce commented Oct 12, 2021

edsantiago commented Oct 12, 2021

edsantiago commented Oct 12, 2021

jwhonce commented Oct 12, 2021

edsantiago commented Oct 12, 2021

rhatdan commented Oct 12, 2021

edsantiago commented Oct 12, 2021

jwhonce commented Oct 12, 2021 •

edited

Loading