Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint/Restore test fixes #11955

Conversation

adrianreber
Copy link
Collaborator

Moving to Fedora 35 showed test failures (time outs) in the test

"podman checkpoint and restore container with different port mappings"

The test starts a container and maps the internal port 6379 to the local
port 1234 ('-p 1234:6379') and then tries to connect to localhost:1234

On Fedora 35 this failed and blocked the test because the container was
not yet ready. The test was trying to connect to localhost:1234 but
nothing was running there. So the error was not checkpointing related.

Another problem with this test and running ginkgo in parallel was that
it was possible that the port was already in use. Now for each run a
random port is selected to decrease the chance of collisions.

This tries to fix problems seen in #11795

CC: @cevich

@mheon
Copy link
Member

mheon commented Oct 13, 2021

@edsantiago PTAL

@cevich
Copy link
Member

cevich commented Oct 13, 2021

My only concern is that perhaps the 3-second timeout (w/ 5 retries) is too long. However, the code also isn't waiting for a 'ready' signal from the container (can it?), and the net.dial() docs do say this timeout includes DNS (which is "special" in case of a container). So maybe it's okay?

test/e2e/checkpoint_test.go Outdated Show resolved Hide resolved
@adrianreber adrianreber force-pushed the 2021-10-13-f35-checkpoint-test-fix branch from ee739d7 to 40ef16c Compare October 13, 2021 15:52
@adrianreber
Copy link
Collaborator Author

Switched to WaitContainerReady().

Copy link
Member

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one suggestion.

test/e2e/checkpoint_test.go Outdated Show resolved Hide resolved
@cevich
Copy link
Member

cevich commented Oct 14, 2021

LGTM

@adrianreber adrianreber force-pushed the 2021-10-13-f35-checkpoint-test-fix branch from 40ef16c to b33845f Compare October 14, 2021 13:59
@cevich
Copy link
Member

cevich commented Oct 14, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2021
@cevich
Copy link
Member

cevich commented Oct 14, 2021

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2021
@cevich
Copy link
Member

cevich commented Oct 14, 2021

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2021
Copy link
Member

@jwhonce jwhonce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issues I see are mostly modifying a known good port. Those new ports could be used elsewhere which cause CI flakes. Generate all "new" port numbers using GetRandomPort()

fmt.Fprintf(os.Stderr, "Trying to connect to redis server at localhost:%d", orginalPort)
// Open a network connection to the redis server via initial port mapping
conn, err := net.DialTimeout("tcp4", fmt.Sprintf("localhost:%d", orginalPort), time.Duration(3)*time.Second)
Expect(err).To(BeNil())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Expect(err).To(BeNil())
Expect(err).ShouldNot(HaveOccurred())

}
fmt.Fprintf(os.Stderr, "Trying to reconnect to redis server at localhost:%d", orginalPort+100)
conn, err = net.DialTimeout("tcp4", fmt.Sprintf("localhost:%d", orginalPort+100), time.Duration(3)*time.Second)
Expect(err).To(BeNil())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Expect(err).To(BeNil())
Expect(err).ShouldNot(HaveOccurred())

localRunString := getRunString([]string{"-p", "1234:6379", "--rm", redis})
port, err := utils.GetRandomPort()
Expect(err).ShouldNot(HaveOccurred())
orginalPort := port + GinkgoParallelProcess()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, you get a random port (that is "safe") and then add something to it?

}

fmt.Fprintf(os.Stderr, "Trying to connect to redis server at localhost:%d", orginalPort)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging information?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, the test was hanging just after this point, so knowing what it was trying to do may also be helpful in the future. Ginkgo doesn't give us this detail otherwise 😕

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. It was not added for debugging just for better feedback during the test to know what is happening.

@@ -958,7 +966,7 @@ var _ = Describe("Podman checkpoint", func() {
Expect(podmanTest.NumberOfContainers()).To(Equal(0))

// Restore container with different port mapping
result = podmanTest.Podman([]string{"container", "restore", "-p", "1235:6379", "-i", fileName})
result = podmanTest.Podman([]string{"container", "restore", "-p", fmt.Sprintf("%d:6379", orginalPort+100), "-i", fileName})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, adding a number to a "safe" port?

@@ -967,13 +975,12 @@ var _ = Describe("Podman checkpoint", func() {

// Open a network connection to the redis server via initial port mapping
// This should fail
conn, err = net.Dial("tcp", "localhost:1234")
conn, err = net.DialTimeout("tcp4", fmt.Sprintf("localhost:%d", orginalPort), time.Duration(3)*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to steal this one. :-)

test/e2e/checkpoint_test.go Outdated Show resolved Hide resolved
@jwhonce jwhonce added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 14, 2021
@cevich
Copy link
Member

cevich commented Oct 14, 2021

Those new ports could be used elsewhere which cause CI flakes

Oooohhhh, I understand. So the concern is if this test goes south/hangs/leaks, another test might also ask for a port and receive the same one (since it was port+number for this test). Very observant of you, and thanks for mentioning this. Seems like an easy enough fix too.

Moving to Fedora 35 showed test failures (time outs) in the test

"podman checkpoint and restore container with different port mappings"

The test starts a container and maps the internal port 6379 to the local
port 1234 ('-p 1234:6379') and then tries to connect to localhost:1234

On Fedora 35 this failed and blocked the test because the container was
not yet ready. The test was trying to connect to localhost:1234 but
nothing was running there. So the error was not checkpointing related.

Before trying to connect to the container the test is now waiting for
the container to be ready.

Another problem with this test and running ginkgo in parallel was that
it was possible that the port was already in use. Now for each run a
random port is selected to decrease the chance of collisions.

Signed-off-by: Adrian Reber <[email protected]>
@adrianreber adrianreber force-pushed the 2021-10-13-f35-checkpoint-test-fix branch from b33845f to 8439a6d Compare October 14, 2021 18:53
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2021
@adrianreber
Copy link
Collaborator Author

@jwhonce Thanks for the review. I adapted the PR to fix what you commented on. Besides the debug output which was added to know that the test is trying to connect to the container. I can remove it but I think it can be helpful understanding what is going on.

@jwhonce jwhonce removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2021
@jwhonce
Copy link
Member

jwhonce commented Oct 15, 2021

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 15, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianreber, cevich, jwhonce

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 15, 2021
@openshift-merge-robot openshift-merge-robot merged commit 171f7b8 into containers:main Oct 15, 2021
@cevich
Copy link
Member

cevich commented Oct 15, 2021

Thanks everyone.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants