Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16100 test: Fix stopping daos_test during timeout #15275

Merged
merged 30 commits into from
Nov 26, 2024

Conversation

phender
Copy link
Contributor

@phender phender commented Oct 8, 2024

Fix stopping timed out processes run by a JobManager class by only
searching for and killing the command executable being run by clush,
orterun, mpirun, etc. Add a new harness/cmocka.py test to verify the
stopping of the processes with a test timeout.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: test_daos_management

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Properly dstop the daos_test process if the test encounters a timeout
while running.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: test_daos_management

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@phender phender requested review from a team as code owners October 8, 2024 17:44
@phender phender marked this pull request as draft October 8, 2024 17:44
Copy link

github-actions bot commented Oct 8, 2024

Ticket title is 'daos_test/suite.py:DaosCoreTest.test_daos_degraded_ec - cart_ctl error due to errored rank 0'
Status is 'In Review'
Labels: '2.6.0rc1,ci_impact,pr_test,tcp_provider,testp2'
https://daosio.atlassian.net/browse/DAOS-16100

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: test_daos_management

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/2/execution/node/938/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: test_daos_management

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: test_daos_management

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/4/execution/node/938/log

When stopping cmocka commands only use the executable name to find a
pkill match.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/5/execution/node/800/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/7/execution/node/826/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management MultiEnginesPerSocketTest FaultDomain
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management MultiEnginesPerSocketTest FaultDomain
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/9/execution/node/826/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management MultiEnginesPerSocketTest FaultDomain
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/10/execution/node/825/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/10/execution/node/1063/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/10/execution/node/1126/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management MultiEnginesPerSocketTest FaultDomain
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: DaosCoreTestDfs DaosCoreTestDfuse harness_cmocka test_daos_management MultiEnginesPerSocketTest FaultDomain
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15275/11/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/12/execution/node/826/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/12/execution/node/1034/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/12/execution/node/1080/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/18/execution/node/825/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/18/execution/node/1033/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/18/execution/node/1087/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/19/execution/node/825/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/20/execution/node/825/log

@phender
Copy link
Contributor Author

phender commented Nov 21, 2024

FYI, created https://daosio.atlassian.net/browse/DAOS-16825 to address using register cleanup to stop agents/servers independently of this ticket.

Remove stopping agents when stopping servers as DAOS-6873 is resolved.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest ConfigGenerateRun
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/21/execution/node/825/log

@phender
Copy link
Contributor Author

phender commented Nov 25, 2024

FYI, created https://daosio.atlassian.net/browse/DAOS-16825 to address using register cleanup to stop agents/servers independently of this ticket.

Waiting on #15530 to merge before continuing with this PR.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest
Allow-unstable-test: true

Required-githooks: true

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15275/23/execution/node/826/log

@phender
Copy link
Contributor Author

phender commented Nov 26, 2024

The 7 failures in https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15275/23/testReport/ are expected:

  • The 1-./harness/cmocka.py:HarnessCmockaTest.test_no_cmocka_xml test verifies launch.py detects a missing cmocka file (1 failure)
  • The 2-./harness/cmocka.py:HarnessCmockaTest.test_clush_manager_timeout test purposely times out and verifies launch.py detects a missing cmocka file (2 failures)
  • The 3-./harness/cmocka.py:HarnessCmockaTest.test_orterun_manager_timeout test purposely times out and verifies launch.py detects a missing cmocka file (2 failures)
  • The 4-./harness/cmocka.py:HarnessCmockaTest.test_mpirun_manager_timeout test purposely times out and verifies launch.py detects a missing cmocka file (2 failures)

This test is not tagged to run in normal pr or timed runs, but was added as part of this PR to verify commands run by JobManager based classes are properly stopped during test timeouts.

@phender phender marked this pull request as ready for review November 26, 2024 14:13
@phender phender added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Nov 26, 2024
@phender phender requested a review from mjean308 November 26, 2024 17:08
@phender
Copy link
Contributor Author

phender commented Nov 26, 2024

Sample log from 4-./harness/cmocka.py:HarnessCmockaTest.test_mpirun_manager_timeout where tearDown now stops the command being run by the JobManager class:

2024-11-26 00:21:15,659 stacktrace       L0045 ERROR| RuntimeError: Test interrupted by SIGTERM
2024-11-26 00:21:15,659 stacktrace       L0046 ERROR| 
2024-11-26 00:21:15,660 test             L0772 DEBUG| Local variables:
2024-11-26 00:21:15,718 test             L0775 DEBUG|  -> self <class 'cmocka.HarnessCmockaTest'>: 4-./harness/cmocka.py:HarnessCmockaTest.test_mpirun_manager_timeout;run-4712
2024-11-26 00:21:15,718 test             L0357 INFO | ====================================================================================================
2024-11-26 00:21:15,719 test             L0471 INFO | ==> Step 1: tearDown(): Called due to exceeding the 10s test timeout [elapsed since last step: 10.16s]
2024-11-26 00:21:15,719 test             L0369 INFO | test execution has been terminated by avocado
2024-11-26 00:21:15,719 test             L0402 DEBUG| Register: 1 cleanup methods detected
2024-11-26 00:21:15,719 test             L0409 DEBUG| [1] Register: Calling cleanup method stop_job_manager(job_manager=/usr/lib64/mpich/bin/mpirun -hostfile /tmp/hostfile_nrw0ked9 -np 1 sleep 60)
2024-11-26 00:21:15,719 job_manager_util L0085 INFO | Stopping mpirun job manager: /usr/lib64/mpich/bin/mpirun -hostfile /tmp/hostfile_nrw0ked9 -np 1 sleep 60
2024-11-26 00:21:15,720 run_utils        L0570 DEBUG| Searching for any processes on wolf-102vm1 that match 'sleep 60'
2024-11-26 00:21:15,723 run_utils        L0471 DEBUG| Running on wolf-102vm1 with a 60 second timeout: /usr/bin/pgrep --list-full --full -x 'sleep 60'
2024-11-26 00:21:16,102 run_utils        L0335 DEBUG|   wolf-102vm1 (rc=0): 42585 sleep 60
2024-11-26 00:21:16,104 run_utils        L0594 DEBUG| Killing any processes on wolf-102vm1 that match 'sleep 60' and then waiting 5 seconds
2024-11-26 00:21:16,105 run_utils        L0471 DEBUG| Running on wolf-102vm1 with a 60 second timeout: sudo /usr/bin/pkill --full -x 'sleep 60'
2024-11-26 00:21:16,500 process          L0416 DEBUG| [stdout] 
2024-11-26 00:21:16,501 process          L0416 DEBUG| [stdout] ===================================================================================
2024-11-26 00:21:16,502 process          L0416 DEBUG| [stdout] =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
2024-11-26 00:21:16,502 process          L0416 DEBUG| [stdout] =   PID 42585 RUNNING AT wolf-102vm1
2024-11-26 00:21:16,502 process          L0416 DEBUG| [stdout] =   EXIT CODE: 15
2024-11-26 00:21:16,502 process          L0416 DEBUG| [stdout] =   CLEANING UP REMAINING PROCESSES
2024-11-26 00:21:16,503 process          L0416 DEBUG| [stdout] =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
2024-11-26 00:21:16,503 process          L0416 DEBUG| [stdout] ===================================================================================
2024-11-26 00:21:16,503 process          L0416 DEBUG| [stdout] YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
2024-11-26 00:21:16,505 process          L0416 DEBUG| [stdout] This typically refers to a problem with your application.
2024-11-26 00:21:16,507 run_utils        L0333 DEBUG|   wolf-102vm1 (rc=0): <no output>
2024-11-26 00:21:16,509 process          L0416 DEBUG| [stdout] Please see the FAQ page for debugging suggestions
2024-11-26 00:21:21,513 run_utils        L0471 DEBUG| Running on wolf-102vm1 with a 60 second timeout: /usr/bin/pgrep --list-full --full -x 'sleep 60'
2024-11-26 00:21:26,903 run_utils        L0333 DEBUG|   wolf-102vm1 (rc=1): <no output>
2024-11-26 00:21:26,904 run_utils        L0605 DEBUG| All processes running on wolf-102vm1 that match 'sleep 60' have been stopped.
2024-11-26 00:21:26,905 job_manager_util L0337 INFO | ***At least one remote 'sleep 60' process needed to be killed on wolf-102vm1! Please investigate/report.***

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nits

killed.
executable (str): the command executable. Also the string used to search for the process
when it is killed.
keywords (list): list of words used to mark the command as failed if any are found in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call keywords check_results to be consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not a bad idea.

@@ -72,6 +72,9 @@ def __init__(self, namespace, command, path="", subprocess=False, check_results=
# used to check on the progress or terminate the command.
self._exe_names = [self.command]

# If set use the full command string when returning the 'command_regex' property
self.full_command_regex = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: since command_regex already exists and is a string, full_command_regex implies this is some sort of "full" version of command_regex. It's more verbose but I'd recommend something like use_full_command_regex or similar so the variable name reflects that it's a bool and not a string / "full" version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do prefer use_full_command_regex in this context.

Comment on lines +323 to +324
if self.job.full_command_regex:
regex = f"'{str(self.job)}'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my previous point, the variable name implies this should be (which is wrong)

Suggested change
if self.job.full_command_regex:
regex = f"'{str(self.job)}'"
if self.job.full_command_regex:
regex = self.job.full_command_regex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, using self.job.use_full_command_regex would make more sense.

@@ -542,6 +544,11 @@ def stop_processes(log, hosts, pattern, verbose=True, timeout=60, exclude=None,
force (bool, optional): if set use the KILL signal to immediately stop any running
processes. Defaults to False which will attempt to kill w/o a signal, then with the ABRT
signal, and finally with the KILL signal.
full_command (bool, optional): if set match the pattern using the full command with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same recommendation

Suggested change
full_command (bool, optional): if set match the pattern using the full command with
use_full_command (bool, optional): if set match the pattern using the full command with

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of this method, where full_command is indicating if the pattern is part of a command or the entire (full) command, I'm not sure we need use_full_command. If anything, it's more like match_full_pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match_full_pattern could work too. But full_command isn't as misleading in this context, compared to the others

Comment on lines +576 to +578
# Catch any attempt to kill process 1.
if "1" in re.findall(r"^(\d+)\s+", result.joined_stdout, re.MULTILINE):
raise ValueError(f"Attempting to kill process 1 as a match for {pattern}!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI if you wanted to avoid regex. But safer to just stick with what you have

if any(map(lambda line: line == '1', list_of_things_to_search))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the list_of_things_to_search need to be just the pids from result.joined_stdout - which is what the regex provides?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would need to be a list of just the pids. It's really a matter of dealing with a regex like r"^(\d+)\s+" or using lambda or list comprehension.
"1" in ... r"^(\d+)\s+" effectively means "1" == line

@phender phender requested a review from a team November 26, 2024 21:27
@daltonbohning daltonbohning merged commit 1eda125 into master Nov 26, 2024
44 of 48 checks passed
@daltonbohning daltonbohning deleted the pahender/DAOS-16100 branch November 26, 2024 21:35
phender added a commit that referenced this pull request Dec 11, 2024
Fix stopping timed out processes run by a JobManager class by only
searching for and killing the command executable being run by clush,
orterun, mpirun, etc. Add a new harness/cmocka.py test to verify the
stopping of the processes with a test timeout.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: pr daos_test dfuse_test test_load_mpi HarnessCmockaTest
Allow-unstable-test: true

Signed-off-by: Phil Henderson <[email protected]>
phender added a commit that referenced this pull request Dec 12, 2024
Fix stopping timed out processes run by a JobManager class by only
searching for and killing the command executable being run by clush,
orterun, mpirun, etc. Add a new harness/cmocka.py test to verify the
stopping of the processes with a test timeout.

Signed-off-by: Phil Henderson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.
Development

Successfully merging this pull request may close these issues.

4 participants