Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework container.execute to use multiprocessing as watchdog #50

Merged
merged 1 commit into from
Mar 13, 2024

Conversation

jmtd
Copy link
Collaborator

@jmtd jmtd commented Nov 30, 2023

I'm going to ask for some of the other behave test users I know of to try this out since it's quite a significant change. But it seems to be critical in getting GitHub Actions CI working again for our images: an individual test that blocks will now fail without causing the whole test run to be aborted.

I've pushed this to my fork's v1 branch too, to make it easier to try out.

(commit message follows)


Workaround docker.APIClient.exec_start sometimes blocking indefinitely by running in a sub-process and throwing an exception if the sub-process does not complete within a given timeout.

Remove the existing post-exec code which polled the value of docker.APIClient.exec_inspect for 15 seconds to determine if the command had completed. This is effectively performed by the new sub-process waiting. I've set the timeout to 30 seconds, up from 15, which (from experimentation) seems to be necessary to account for the extra time it takes to invoke exec_start within the timeout period.

A future change should make this timeout configurable.

This general pattern (of watchdogging the docker library code) might be useful elsewhere, in particular for any future efforts to support parallel test execution.

Workaround `docker.APIClient.exec_start` sometimes blocking indefinitely
by running in a sub-process and throwing an exception if the sub-process
does not complete within a given timeout.

Remove the existing post-exec code which polled the value of
`docker.APIClient.exec_inspect` for 15 seconds to determine if the
command had completed. This is effectively performed by the new
sub-process waiting. I've set the timeout to 30 seconds, up from 15,
which (from experimentation) seems to be necessary to account for the
extra time it takes to invoke `exec_start` within the timeout period.

A future change should make this timeout configurable.

This general pattern (of watchdogging the docker library code) might
be useful elsewhere, in particular for any future efforts to support
parallel test execution.

Signed-off-by: Jonathan Dowland <[email protected]>
jmtd added a commit to jmtd/redhat-openjdk-container-images that referenced this pull request Nov 30, 2023
The jmtd fork of behave-test-steps has its v1 branch matching this
PR: cekit/behave-test-steps#50

This adds a multiprocess watchdog around invoking
`docker.APIClient.exec_inspect`, which will abort the current step
if that call has not returned within 30 seconds.

This means a lock-up during a step will cause that test to fail and
not the whole test run.

Signed-off-by: Jonathan Dowland <[email protected]>
@spolti
Copy link
Contributor

spolti commented Nov 30, 2023

IIRC there is a environment variable that you can set to increase the timeout, BEHAVE_TIMEOUT I guess.
@rnc .

jmtd added a commit to jmtd/redhat-openjdk-container-images that referenced this pull request Jan 8, 2024
The jmtd fork of behave-test-steps has its v1 branch matching this
PR: cekit/behave-test-steps#50

This adds a multiprocess watchdog around invoking
`docker.APIClient.exec_inspect`, which will abort the current step
if that call has not returned within 30 seconds.

This means a lock-up during a step will cause that test to fail and
not the whole test run.

Signed-off-by: Jonathan Dowland <[email protected]>
jmtd added a commit to jmtd/redhat-openjdk-container-images that referenced this pull request Jan 8, 2024
The jmtd fork of behave-test-steps has its v1 branch matching this
PR: cekit/behave-test-steps#50

This adds a multiprocess watchdog around invoking
`docker.APIClient.exec_inspect`, which will abort the current step
if that call has not returned within 30 seconds.

This means a lock-up during a step will cause that test to fail and
not the whole test run.

Signed-off-by: Jonathan Dowland <[email protected]>
@rnc
Copy link
Contributor

rnc commented Mar 12, 2024

@jmtd Is this ready to merge?

@jmtd
Copy link
Collaborator Author

jmtd commented Mar 12, 2024

yes. sorry for the delay

@rnc rnc merged commit bf40440 into cekit:v1 Mar 13, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants