-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16464 test: improve online_rebuild_mdtest.py #15108
Conversation
Ticket title is 'erasurecode/online_rebuild_mdtest.py:EcodOnlineRebuildMdtest.test_ec_online_rebuild_mdtest - time out waiting for mdtest after server stop' |
Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15108/1/testReport/ |
f37a473
to
d628240
Compare
Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15108/2/testReport/ |
872cbcc
to
07ef8e7
Compare
a176256
to
4c1252a
Compare
Run with a stonewall and stop ranks after half of the stonewall time so the timing is more reliable than arbitrarily sleeping for 30 seconds. Test-tag: EcodOnlineRebuildMdtest Test-repeat: 3 Skip-unit-tests: true Skip-fault-injection-test: true Required-githooks: true Signed-off-by: Dalton Bohning <[email protected]>
4c1252a
to
34173c2
Compare
Test-tag: EcodOnlineRebuildMdtest Test-repeat: 3 Skip-unit-tests: true Skip-fault-injection-test: true Required-githooks: true Signed-off-by: Padmanabhan <[email protected]>
@rpadma2 It looks to run out of space during archiving: https://build.hpdd.intel.com/blue/rest/organizations/jenkins/pipelines/daos-stack/pipelines/daos/branches/PR-15108/runs/12/nodes/809/log/?start=0 It looks like many other tests use either |
Yes. We can't have DEBUG log mask... It maybe an issue. Let me change it to ERR. |
Test-tag: EcodOnlineRebuildMdtest Test-repeat: 3 Skip-unit-tests: true Skip-fault-injection-test: true Signed-off-by: Padmanabhan <[email protected]>
Test-tag: EcodOnlineRebuildMdtest Test-repeat: 3 Skip-unit-tests: true Skip-fault-injection-test: true
Test-tag: EcodOnlineRebuildMdtest Test-repeat: 3 Skip-unit-tests: true Skip-fault-injection-test: true Signed-off-by: Padmanabhan <[email protected]>
Test-tag: EcodOnlineRebuildMdtest Test-repeat: 3 Skip-unit-tests: true Skip-fault-injection-test: true Signed-off-by: Padmanabhan <[email protected]>
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15108/21/execution/node/925/log |
Test-tag: EcodOnlineRebuildMdtest Test-repeat: 3 Skip-unit-tests: true Skip-fault-injection-test: true Signed-off-by: Padmanabhan <[email protected]>
src/tests/ftest/util/ec_utils.py
Outdated
""" | ||
try: | ||
result = self.execute_mdtest(mdtest_result_queue) | ||
except (CommandFailure, DaosApiError, DaosTestError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to catch all exceptions because we don't know what might go wrong
except (CommandFailure, DaosApiError, DaosTestError): | |
except Exception: # pylint: disable=broad-except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/tests/ftest/util/ec_utils.py
Outdated
self.container = self.get_mdtest_container(self.pool) | ||
if self.container is None: | ||
self.fail("Container Create Failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default behavior of get_mdtest_container/get_container will raise an exception so we don't need to check this
self.container = self.get_mdtest_container(self.pool) | |
if self.container is None: | |
self.fail("Container Create Failed") | |
self.container = self.get_mdtest_container(self.pool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@daltonbohning : Can you take a look at the latest changes? If things are fine, we can move this PR to "Ready for review" state. |
num_of_files_dirs: 10000000 | ||
stonewall_timer: 10 | ||
stonewall_statusfile: stoneWallingStatusFile | ||
num_of_files_dirs: 100000009 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the last digit 9? It seems random.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must be a mistake... Will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Test-tag: EcodOnlineRebuildMdtest Test-repeat: 3 Skip-unit-tests: true Skip-fault-injection-test: true Signed-off-by: Padmanabhan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code LGTM. Thanks!
@daos-stack/daos-gatekeeper : This PR is ready for merge. |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15108/23/execution/node/925/log |
@rpadma2 Did you retrigger the testing? It was retriggered and failed 1 time: https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-15108/23/tests |
|
There is a ticket already related to this issue: https://daosio.atlassian.net/issues/DAOS-16737?jql=textfields%20~%20%22pool%20destroy%2A%22 . Looks like a known issue, |
- Run with a stonewall and stop ranks after half of the stonewall time so the timing is more reliable than arbitrarily sleeping for 30 seconds. - Catch exceptions raised in the mdtest thread. - Reduce logging. - Misc refactoring improvements Signed-off-by: Dalton Bohning <[email protected]> Signed-off-by: Padmanabhan <[email protected]>
Test-tag: EcodOnlineRebuildMdtest Test-repeat: 3 Skip-unit-tests: true Skip-fault-injection-test: true - Run with a stonewall and stop ranks after half of the stonewall time so the timing is more reliable than arbitrarily sleeping for 30 seconds. - Catch exceptions raised in the mdtest thread. - Reduce logging. - Misc refactoring improvements Signed-off-by: Dalton Bohning <[email protected]> Signed-off-by: Padmanabhan <[email protected]>
- Run with a stonewall and stop ranks after half of the stonewall time so the timing is more reliable than arbitrarily sleeping for 30 seconds. - Catch exceptions raised in the mdtest thread. - Reduce logging. - Misc refactoring improvements Signed-off-by: Dalton Bohning <[email protected]> Signed-off-by: Padmanabhan <[email protected]>
Run with a stonewall and stop ranks after half of the stonewall time so the timing is more reliable than arbitrarily sleeping for 30 seconds.
Test-tag: EcodOnlineRebuildMdtest
Test-repeat: 3
Skip-unit-tests: true
Skip-fault-injection-test: true
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: