-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16100 test: Fix stopping daos_test during timeout #15275
Conversation
Properly dstop the daos_test process if the test encounters a timeout while running. Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: test_daos_management Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Ticket title is 'daos_test/suite.py:DaosCoreTest.test_daos_degraded_ec - cart_ctl error due to errored rank 0' |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: test_daos_management Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/2/execution/node/938/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: test_daos_management Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: test_daos_management Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/4/execution/node/938/log |
When stopping cmocka commands only use the executable name to find a pkill match. Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/5/execution/node/800/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/7/execution/node/826/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management MultiEnginesPerSocketTest FaultDomain Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management MultiEnginesPerSocketTest FaultDomain Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/9/execution/node/826/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management MultiEnginesPerSocketTest FaultDomain Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/10/execution/node/825/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/10/execution/node/1063/log |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/10/execution/node/1126/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management MultiEnginesPerSocketTest FaultDomain Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management MultiEnginesPerSocketTest FaultDomain Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15275/11/display/redirect |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/12/execution/node/826/log |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/12/execution/node/1034/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/12/execution/node/1080/log |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/18/execution/node/825/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/18/execution/node/1033/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/18/execution/node/1087/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/19/execution/node/825/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/20/execution/node/825/log |
FYI, created https://daosio.atlassian.net/browse/DAOS-16825 to address using register cleanup to stop agents/servers independently of this ticket. |
Remove stopping agents when stopping servers as DAOS-6873 is resolved. Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest ConfigGenerateRun Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/21/execution/node/825/log |
Waiting on #15530 to merge before continuing with this PR. |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest Allow-unstable-test: true Required-githooks: true Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/23/execution/node/826/log |
The 7 failures in https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15275/23/testReport/ are expected:
This test is not tagged to run in normal pr or timed runs, but was added as part of this PR to verify commands run by |
Sample log from
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nits
killed. | ||
executable (str): the command executable. Also the string used to search for the process | ||
when it is killed. | ||
keywords (list): list of words used to mark the command as failed if any are found in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call keywords
check_results
to be consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's not a bad idea.
@@ -72,6 +72,9 @@ def __init__(self, namespace, command, path="", subprocess=False, check_results= | |||
# used to check on the progress or terminate the command. | |||
self._exe_names = [self.command] | |||
|
|||
# If set use the full command string when returning the 'command_regex' property | |||
self.full_command_regex = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: since command_regex
already exists and is a string, full_command_regex
implies this is some sort of "full" version of command_regex
. It's more verbose but I'd recommend something like use_full_command_regex
or similar so the variable name reflects that it's a bool and not a string / "full" version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do prefer use_full_command_regex
in this context.
if self.job.full_command_regex: | ||
regex = f"'{str(self.job)}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my previous point, the variable name implies this should be (which is wrong)
if self.job.full_command_regex: | |
regex = f"'{str(self.job)}'" | |
if self.job.full_command_regex: | |
regex = self.job.full_command_regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, using self.job.use_full_command_regex
would make more sense.
@@ -542,6 +544,11 @@ def stop_processes(log, hosts, pattern, verbose=True, timeout=60, exclude=None, | |||
force (bool, optional): if set use the KILL signal to immediately stop any running | |||
processes. Defaults to False which will attempt to kill w/o a signal, then with the ABRT | |||
signal, and finally with the KILL signal. | |||
full_command (bool, optional): if set match the pattern using the full command with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same recommendation
full_command (bool, optional): if set match the pattern using the full command with | |
use_full_command (bool, optional): if set match the pattern using the full command with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of this method, where full_command
is indicating if the pattern
is part of a command or the entire (full) command, I'm not sure we need use_full_command
. If anything, it's more like match_full_pattern
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match_full_pattern
could work too. But full_command
isn't as misleading in this context, compared to the others
# Catch any attempt to kill process 1. | ||
if "1" in re.findall(r"^(\d+)\s+", result.joined_stdout, re.MULTILINE): | ||
raise ValueError(f"Attempting to kill process 1 as a match for {pattern}!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI if you wanted to avoid regex. But safer to just stick with what you have
if any(map(lambda line: line == '1', list_of_things_to_search))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the list_of_things_to_search
need to be just the pids from result.joined_stdout
- which is what the regex provides?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would need to be a list of just the pids. It's really a matter of dealing with a regex like r"^(\d+)\s+"
or using lambda or list comprehension.
"1" in ... r"^(\d+)\s+"
effectively means "1" == line
Fix stopping timed out processes run by a JobManager class by only searching for and killing the command executable being run by clush, orterun, mpirun, etc. Add a new harness/cmocka.py test to verify the stopping of the processes with a test timeout. Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest Allow-unstable-test: true Signed-off-by: Phil Henderson <[email protected]>
Fix stopping timed out processes run by a JobManager class by only searching for and killing the command executable being run by clush, orterun, mpirun, etc. Add a new harness/cmocka.py test to verify the stopping of the processes with a test timeout. Signed-off-by: Phil Henderson <[email protected]>
Fix stopping timed out processes run by a JobManager class by only
searching for and killing the command executable being run by clush,
orterun, mpirun, etc. Add a new harness/cmocka.py test to verify the
stopping of the processes with a test timeout.
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: test_daos_management
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: