-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16167 test: update soak test to use internal job scheduler #14775
Conversation
Ticket title is 'Soak: update soak test to use internal job scheduler instead of depending on slurm' |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/1/execution/node/415/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/1/execution/node/313/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/1/execution/node/357/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/373/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/414/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/364/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/360/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/361/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/367/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/383/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/387/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/363/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/365/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/364/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/353/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/359/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/341/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/344/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/467/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/398/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/336/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/383/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/351/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/350/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/357/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/518/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/352/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Required-githooks: true Signed-off-by: Maureen Jean <[email protected]>
015b254
to
47e9f7c
Compare
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <[email protected]>
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/43/execution/node/936/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Required-githooks: true Skipped-githooks: flake,pylint Signed-off-by: Maureen Jean <[email protected]>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Required-githooks: true Skipped-githooks: flake,pylint Signed-off-by: Maureen Jean <[email protected]>
Signed-off-by: Maureen Jean <[email protected]>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/49/execution/node/825/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/50/execution/node/936/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <[email protected]>
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/52/execution/node/937/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/53/execution/node/936/log |
if self.host_info.clients.partition.reservation: | ||
self.srun_params["reservation"] = self.host_info.clients.partition.reservation | ||
# Include test node for log cleanup; remove from client list | ||
self.job_scheduler = self.params.get("job_scheduler", "/run/*", default="slurm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this is more tested and stable I think the default should be the internal, and phase out slurm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree;
lib_path = os.getenv("LD_LIBRARY_PATH") | ||
path = os.getenv("PATH") | ||
v_env = os.getenv("VIRTUAL_ENV") | ||
env = ";".join([f"export LD_LIBRARY_PATH={lib_path}", | ||
f"export PATH={path}"]) | ||
if v_env: | ||
env = ";".join([env, f"export VIRTUAL_ENV={v_env}"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do for lib_path
what we do for v_env
since LD_LIBRARY_PATH
might not be set
job_queue.put(results) | ||
# give time to update the queue before exiting | ||
time.sleep(0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? As far as I'm aware, the queue should be updated as soon as put
returns
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits.
@@ -144,7 +148,8 @@ def pre_tear_down(self): | |||
if self.all_failed_jobs: | |||
errors.append("SOAK FAILED: The following jobs failed {} ".format( | |||
" ,".join(str(j_id) for j_id in self.all_failed_jobs))) | |||
|
|||
# cleanup any remaining jobs | |||
job_cleanup(self.log, self.hostlist_clients) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future it would be nice to remove the pre_tear_down method now that we can use register_cleanup to more effectively handle tearDown operations. Just a note; not a requested change.
job_node_list = node_list[:node_count] | ||
debug_logging( | ||
self.log, | ||
self.enable_debug_msg, | ||
f"DBG: node_list before launch_job {node_list}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be logging job_node_list
? Otherwise it doesn't seem to add anything different than the previous debug message.
return next(id_counter) | ||
|
||
|
||
def debug_logging(log, enable_debug_msg, log_msg): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be better defined as a SoakTestBase
method, where calls would look like self.debug_logging("some message")
or test.debug_logging("some message")
.
err_msg = f"Slurm failed to submit job for {script}" | ||
job_id_list = [] | ||
raise SoakTestError(f"<<FAILED: Soak {self.test_name}: {err_msg}>>") | ||
if self.job_scheduler == "slurm": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have been nice to have job scheduler classes to avoid multiple if/else statements.
@@ -559,7 +701,7 @@ def run_soak(self, test_param): | |||
resv_bytes = self.params.get("resv_bytes", test_param + "*", 500000000) | |||
ignore_soak_errors = self.params.get("ignore_soak_errors", test_param + "*", False) | |||
self.enable_il = self.params.get("enable_intercept_lib", test_param + "*", False) | |||
self.sudo_cmd = "sudo" if enable_sudo else "" | |||
self.sudo_cmd = "sudo -n" if enable_sudo else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider replacing uses of self.sudo_cmd
with command_as_user()
in the future to simplify maintenance like this.
from test_utils_container import add_container | ||
|
||
H_LOCK = threading.Lock() | ||
id_counter = count(start=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw I agree with Dalton, especially if the id_counter
is used in file names.
job_log (str): job std out | ||
error_log (str): job std error | ||
timeout (int): job timeout | ||
test (TestObj): soak test obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to line 433 this should be:
test (TestObj): soak test obj | |
test (SoakTestBase): soak test obj |
if isinstance(host_list, str): | ||
# assume one host in list | ||
hosts = host_list | ||
rhost = host_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we support a list or string here? We should keep all host information in a NodeSet object.
Update soak to support using an internal job scheduler. Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Required-githooks: true Signed-off-by: Maureen Jean <[email protected]>
…) (#15595) Update soak to support using an internal job scheduler. Signed-off-by: Maureen Jean <[email protected]> Co-authored-by: mjean308 <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: