-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-3885 test: Fix crash in shutdown and try dmg shutdown in NLT #4022
Conversation
Signed-off-by: Ashley Pittman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
FYI: Errors found in lines not modified in the patch:
utils/node_local_test.py:440:
(lint) Remove this block when daos_server shutdown works.
utils/node_local_test.py:488:
(lint) Enable memleak checking when server shutdown works.
utils/node_local_test.py:1233:
(lint) This doesn't work with two pools, partly related to
utils/node_local_test.py:1277:
(lint) change this to something else, md5sum uses fread which isn't
utils/node_local_test.py:1482:
(pylint-line-too-long) Line too long (81/80)
utils/node_local_test.py:519:
(pylint-protected-access) Access to a protected member _daos of a client class
utils/node_local_test.py:1393:
(pylint-protected-access) Access to a protected member _cleanup of a client class
Signed-off-by: Ashley Pittman <[email protected]>
This doesn't really resolve anything, but does avoid the segfault/stack trace in the NLT output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
FYI: Errors found in lines not modified in the patch:
utils/node_local_test.py:443:
(lint) Remove this block when daos_server shutdown works.
utils/node_local_test.py:487:
(lint) Enable memleak checking when server shutdown works.
utils/node_local_test.py:1232:
(lint) This doesn't work with two pools, partly related to
utils/node_local_test.py:1276:
(lint) change this to something else, md5sum uses fread which isn't
utils/node_local_test.py:1481:
(pylint-line-too-long) Line too long (81/80)
utils/node_local_test.py:518:
(pylint-protected-access) Access to a protected member _daos of a client class
utils/node_local_test.py:1392:
(pylint-protected-access) Access to a protected member _cleanup of a client class
start = time.time() | ||
while True: | ||
time.sleep(0.5) | ||
rc = self.run_dmg(['system', 'query']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I am changing this in my PR to use JSON output instead of doing string matching: https://github.com/daos-stack/daos/pull/4064/files#diff-74f1d143169d1655c88e727ed677f50cb035b054c6fc1ace2c615b3c91210725R392
We should probably pull the run_dmg()/json.loads() into a helper that appends the --json argument and parses the response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth knowing. This code still uses the --recreate-superblocks option so does need updating.
I still can't get daos to stop though, it seems to return "Stopping" continuously and never progress to "Stopped" so I almost left out this part of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I think we may have a regression here that wasn't noticed because we haven't been using stop without --force. @tanabarr, please take a look at this. I tried Ashley's PR on master and on your latest events branch and in both cases the second instance on my system never moved from stopping->stopped. I can also see that there is still a daos_io_server process running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the other change in this PR daos_io_server would probably segfault, and therefore wouldn't be running afterwards, although I doubt that's how we intend it to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of stop without force normally results in errored state but this PR catches the segfault as Ashley has mentioned, the behaviour you are describing is an artifact of suppressing that. happy to look into it further if you could raise a ticket and assign to me. I've worked with Ashley and we've got this PR working now without the use of recreate-superblocks.
Signed-off-by: Ashley Pittman <[email protected]>
start = time.time() | ||
while True: | ||
time.sleep(0.5) | ||
rc = self.run_dmg(['system', 'query']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of stop without force normally results in errored state but this PR catches the segfault as Ashley has mentioned, the behaviour you are describing is an artifact of suppressing that. happy to look into it further if you could raise a ticket and assign to me. I've worked with Ashley and we've got this PR working now without the use of recreate-superblocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
FYI: Errors found in lines not modified in the patch:
utils/node_local_test.py:458:
(lint) Remove this block when daos_server shutdown works.
utils/node_local_test.py:502:
(lint) Enable memleak checking when server shutdown works.
utils/node_local_test.py:1249:
(lint) This doesn't work with two pools, partly related to
utils/node_local_test.py:1293:
(lint) change this to something else, md5sum uses fread which isn't
utils/node_local_test.py:1498:
(pylint-line-too-long) Line too long (81/80)
utils/node_local_test.py:535:
(pylint-protected-access) Access to a protected member _daos of a client class
utils/node_local_test.py:1409:
(pylint-protected-access) Access to a protected member _cleanup of a client class
Signed-off-by: Ashley Pittman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style warning(s) for job https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-4022/4/
Please review https://wiki.hpdd.intel.com/display/DC/Coding+Rules
FYI: Errors found in lines not modified in the patch:
utils/node_local_test.py:466:
(lint) Remove this block when daos_server shutdown works.
utils/node_local_test.py:510:
(lint) Enable memleak checking when server shutdown works.
utils/node_local_test.py:1257:
(lint) This doesn't work with two pools, partly related to
utils/node_local_test.py:1301:
(lint) change this to something else, md5sum uses fread which isn't
utils/node_local_test.py:1506:
(pylint-line-too-long) Line too long (81/80)
utils/node_local_test.py:543:
(pylint-protected-access) Access to a protected member _daos of a client class
utils/node_local_test.py:1417:
(pylint-protected-access) Access to a protected member _cleanup of a client class
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4022/4/execution/node/61/log |
Rebased, because there was a test failure I couldn't explain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style warning(s) for job https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-4022/5/
Please review https://wiki.hpdd.intel.com/display/DC/Coding+Rules
FYI: Errors found in lines not modified in the patch:
utils/node_local_test.py:466:
(lint) Remove this block when daos_server shutdown works.
utils/node_local_test.py:510:
(lint) Enable memleak checking when server shutdown works.
utils/node_local_test.py:1257:
(lint) This doesn't work with two pools, partly related to
utils/node_local_test.py:1301:
(lint) change this to something else, md5sum uses fread which isn't
utils/node_local_test.py:1506:
(pylint-line-too-long) Line too long (81/80)
utils/node_local_test.py:543:
(pylint-protected-access) Access to a protected member _daos of a client class
utils/node_local_test.py:1417:
(pylint-protected-access) Access to a protected member _cleanup of a client class
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4022/5/execution/node/60/log |
Can I request re-review of this please? I think it's ready to land as-is, but it may conflict with the change to json that @mjmac linked to. |
This PR includes one regression in NLT however that is to be expected, it replaces a known and ignored SEGV with a error message so it is a step forwards. It also hit https://jira.hpdd.intel.com/browse/DAOS-6409 in a previous build which I retriggered and has now passed, so that might be a factor in why it's listed as having failed tests. |
…os-stack#4022) * Fix crash in shutdown and trace memleaks in server. * Remove old workaround, and do not check for leaks. * Use dmg to format, rather than --recreate-superblocks. * Handle the case where /mnt/daos is mounted, but empty. Signed-off-by: Ashley Pittman <[email protected]>
Signed-off-by: Ashley Pittman [email protected]