Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16167 test: update soak test to use internal job scheduler #14775

Merged
merged 49 commits into from
Dec 11, 2024

Conversation

mjean308
Copy link
Contributor

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

Ticket title is 'Soak: update soak test to use internal job scheduler instead of depending on slurm'
Status is 'In Progress'
Labels: 'daos_framework,soak'
https://daosio.atlassian.net/browse/DAOS-16167

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/1/execution/node/313/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/1/execution/node/357/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/414/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/364/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/360/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/361/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/367/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/383/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/363/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/365/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/364/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/353/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/341/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/344/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/467/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/398/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/336/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/351/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/350/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/357/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/518/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/352/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke

Required-githooks: true

Signed-off-by: Maureen Jean <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke

Signed-off-by: Maureen Jean <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/43/execution/node/936/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke

Required-githooks: true
Skipped-githooks: flake,pylint

Signed-off-by: Maureen Jean <[email protected]>
@mjean308 mjean308 requested a review from daltonbohning October 8, 2024 20:22
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke

Required-githooks: true
Skipped-githooks: flake,pylint

Signed-off-by: Maureen Jean <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/49/execution/node/825/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/50/execution/node/936/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke

Signed-off-by: Maureen Jean <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/52/execution/node/937/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/53/execution/node/936/log

if self.host_info.clients.partition.reservation:
self.srun_params["reservation"] = self.host_info.clients.partition.reservation
# Include test node for log cleanup; remove from client list
self.job_scheduler = self.params.get("job_scheduler", "/run/*", default="slurm")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this is more tested and stable I think the default should be the internal, and phase out slurm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree;

Comment on lines 304 to 310
lib_path = os.getenv("LD_LIBRARY_PATH")
path = os.getenv("PATH")
v_env = os.getenv("VIRTUAL_ENV")
env = ";".join([f"export LD_LIBRARY_PATH={lib_path}",
f"export PATH={path}"])
if v_env:
env = ";".join([env, f"export VIRTUAL_ENV={v_env}"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do for lib_path what we do for v_env since LD_LIBRARY_PATH might not be set

src/tests/ftest/util/soak_test_base.py Show resolved Hide resolved
src/tests/ftest/util/soak_test_base.py Show resolved Hide resolved
src/tests/ftest/util/soak_utils.py Outdated Show resolved Hide resolved
src/tests/ftest/util/soak_utils.py Outdated Show resolved Hide resolved
Comment on lines +470 to +472
job_queue.put(results)
# give time to update the queue before exiting
time.sleep(0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? As far as I'm aware, the queue should be updated as soon as put returns

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke

Signed-off-by: Maureen Jean <[email protected]>
Copy link
Contributor

@phender phender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits.

@@ -144,7 +148,8 @@ def pre_tear_down(self):
if self.all_failed_jobs:
errors.append("SOAK FAILED: The following jobs failed {} ".format(
" ,".join(str(j_id) for j_id in self.all_failed_jobs)))

# cleanup any remaining jobs
job_cleanup(self.log, self.hostlist_clients)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future it would be nice to remove the pre_tear_down method now that we can use register_cleanup to more effectively handle tearDown operations. Just a note; not a requested change.

Comment on lines +347 to +351
job_node_list = node_list[:node_count]
debug_logging(
self.log,
self.enable_debug_msg,
f"DBG: node_list before launch_job {node_list}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be logging job_node_list? Otherwise it doesn't seem to add anything different than the previous debug message.

return next(id_counter)


def debug_logging(log, enable_debug_msg, log_msg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be better defined as a SoakTestBase method, where calls would look like self.debug_logging("some message") or test.debug_logging("some message").

err_msg = f"Slurm failed to submit job for {script}"
job_id_list = []
raise SoakTestError(f"<<FAILED: Soak {self.test_name}: {err_msg}>>")
if self.job_scheduler == "slurm":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have been nice to have job scheduler classes to avoid multiple if/else statements.

@@ -559,7 +701,7 @@ def run_soak(self, test_param):
resv_bytes = self.params.get("resv_bytes", test_param + "*", 500000000)
ignore_soak_errors = self.params.get("ignore_soak_errors", test_param + "*", False)
self.enable_il = self.params.get("enable_intercept_lib", test_param + "*", False)
self.sudo_cmd = "sudo" if enable_sudo else ""
self.sudo_cmd = "sudo -n" if enable_sudo else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider replacing uses of self.sudo_cmd with command_as_user() in the future to simplify maintenance like this.

from test_utils_container import add_container

H_LOCK = threading.Lock()
id_counter = count(start=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I agree with Dalton, especially if the id_counter is used in file names.

job_log (str): job std out
error_log (str): job std error
timeout (int): job timeout
test (TestObj): soak test obj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to line 433 this should be:

Suggested change
test (TestObj): soak test obj
test (SoakTestBase): soak test obj

Comment on lines +443 to +446
if isinstance(host_list, str):
# assume one host in list
hosts = host_list
rhost = host_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we support a list or string here? We should keep all host information in a NodeSet object.

@mjean308 mjean308 requested a review from a team December 11, 2024 15:00
@daltonbohning daltonbohning added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Dec 11, 2024
@daltonbohning daltonbohning merged commit 908f20e into master Dec 11, 2024
42 checks passed
@daltonbohning daltonbohning deleted the mjean/DAOS-16167 branch December 11, 2024 19:11
daltonbohning pushed a commit that referenced this pull request Dec 11, 2024
Update soak to support using an internal job scheduler.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke

Required-githooks: true

Signed-off-by: Maureen Jean <[email protected]>
phender pushed a commit that referenced this pull request Dec 12, 2024
…) (#15595)

Update soak to support using an internal job scheduler.

Signed-off-by: Maureen Jean <[email protected]>
Co-authored-by: mjean308 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.
Development

Successfully merging this pull request may close these issues.

5 participants