Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-3885 test: Fix crash in shutdown and try dmg shutdown in NLT #4022

Merged
merged 7 commits into from
Jan 5, 2021

Conversation

ashleypittman
Copy link
Contributor

Signed-off-by: Ashley Pittman [email protected]

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

FYI: Errors found in lines not modified in the patch:

utils/node_local_test.py:440:
(lint) Remove this block when daos_server shutdown works.

utils/node_local_test.py:488:
(lint) Enable memleak checking when server shutdown works.

utils/node_local_test.py:1233:
(lint) This doesn't work with two pools, partly related to

utils/node_local_test.py:1277:
(lint) change this to something else, md5sum uses fread which isn't

utils/node_local_test.py:1482:
(pylint-line-too-long) Line too long (81/80)

utils/node_local_test.py:519:
(pylint-protected-access) Access to a protected member _daos of a client class

utils/node_local_test.py:1393:
(pylint-protected-access) Access to a protected member _cleanup of a client class

@ashleypittman ashleypittman changed the title Fix crash in shutdown and trace memleaks in server. DAOS-5885 test: Fix crash in shutdown and try dmg shutdown in NLT Dec 22, 2020
@ashleypittman ashleypittman requested a review from mjmac December 22, 2020 16:01
@ashleypittman
Copy link
Contributor Author

This doesn't really resolve anything, but does avoid the segfault/stack trace in the NLT output.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

FYI: Errors found in lines not modified in the patch:

utils/node_local_test.py:443:
(lint) Remove this block when daos_server shutdown works.

utils/node_local_test.py:487:
(lint) Enable memleak checking when server shutdown works.

utils/node_local_test.py:1232:
(lint) This doesn't work with two pools, partly related to

utils/node_local_test.py:1276:
(lint) change this to something else, md5sum uses fread which isn't

utils/node_local_test.py:1481:
(pylint-line-too-long) Line too long (81/80)

utils/node_local_test.py:518:
(pylint-protected-access) Access to a protected member _daos of a client class

utils/node_local_test.py:1392:
(pylint-protected-access) Access to a protected member _cleanup of a client class

start = time.time()
while True:
time.sleep(0.5)
rc = self.run_dmg(['system', 'query'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I am changing this in my PR to use JSON output instead of doing string matching: https://github.com/daos-stack/daos/pull/4064/files#diff-74f1d143169d1655c88e727ed677f50cb035b054c6fc1ace2c615b3c91210725R392

We should probably pull the run_dmg()/json.loads() into a helper that appends the --json argument and parses the response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth knowing. This code still uses the --recreate-superblocks option so does need updating.

I still can't get daos to stop though, it seems to return "Stopping" continuously and never progress to "Stopped" so I almost left out this part of the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I think we may have a regression here that wasn't noticed because we haven't been using stop without --force. @tanabarr, please take a look at this. I tried Ashley's PR on master and on your latest events branch and in both cases the second instance on my system never moved from stopping->stopped. I can also see that there is still a daos_io_server process running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the other change in this PR daos_io_server would probably segfault, and therefore wouldn't be running afterwards, although I doubt that's how we intend it to work.

Copy link
Contributor

@tanabarr tanabarr Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of stop without force normally results in errored state but this PR catches the segfault as Ashley has mentioned, the behaviour you are describing is an artifact of suppressing that. happy to look into it further if you could raise a ticket and assign to me. I've worked with Ashley and we've got this PR working now without the use of recreate-superblocks.

tanabarr
tanabarr previously approved these changes Dec 23, 2020
start = time.time()
while True:
time.sleep(0.5)
rc = self.run_dmg(['system', 'query'])
Copy link
Contributor

@tanabarr tanabarr Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of stop without force normally results in errored state but this PR catches the segfault as Ashley has mentioned, the behaviour you are describing is an artifact of suppressing that. happy to look into it further if you could raise a ticket and assign to me. I've worked with Ashley and we've got this PR working now without the use of recreate-superblocks.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

FYI: Errors found in lines not modified in the patch:

utils/node_local_test.py:458:
(lint) Remove this block when daos_server shutdown works.

utils/node_local_test.py:502:
(lint) Enable memleak checking when server shutdown works.

utils/node_local_test.py:1249:
(lint) This doesn't work with two pools, partly related to

utils/node_local_test.py:1293:
(lint) change this to something else, md5sum uses fread which isn't

utils/node_local_test.py:1498:
(pylint-line-too-long) Line too long (81/80)

utils/node_local_test.py:535:
(pylint-protected-access) Access to a protected member _daos of a client class

utils/node_local_test.py:1409:
(pylint-protected-access) Access to a protected member _cleanup of a client class

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style warning(s) for job https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-4022/4/
Please review https://wiki.hpdd.intel.com/display/DC/Coding+Rules

FYI: Errors found in lines not modified in the patch:

utils/node_local_test.py:466:
(lint) Remove this block when daos_server shutdown works.

utils/node_local_test.py:510:
(lint) Enable memleak checking when server shutdown works.

utils/node_local_test.py:1257:
(lint) This doesn't work with two pools, partly related to

utils/node_local_test.py:1301:
(lint) change this to something else, md5sum uses fread which isn't

utils/node_local_test.py:1506:
(pylint-line-too-long) Line too long (81/80)

utils/node_local_test.py:543:
(pylint-protected-access) Access to a protected member _daos of a client class

utils/node_local_test.py:1417:
(pylint-protected-access) Access to a protected member _cleanup of a client class

utils/node_local_test.py Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

@ashleypittman
Copy link
Contributor Author

Rebased, because there was a test failure I couldn't explain.

@daosbuild1 daosbuild1 dismissed their stale review December 24, 2020 09:33

Updated patch

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style warning(s) for job https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-4022/5/
Please review https://wiki.hpdd.intel.com/display/DC/Coding+Rules

FYI: Errors found in lines not modified in the patch:

utils/node_local_test.py:466:
(lint) Remove this block when daos_server shutdown works.

utils/node_local_test.py:510:
(lint) Enable memleak checking when server shutdown works.

utils/node_local_test.py:1257:
(lint) This doesn't work with two pools, partly related to

utils/node_local_test.py:1301:
(lint) change this to something else, md5sum uses fread which isn't

utils/node_local_test.py:1506:
(pylint-line-too-long) Line too long (81/80)

utils/node_local_test.py:543:
(pylint-protected-access) Access to a protected member _daos of a client class

utils/node_local_test.py:1417:
(pylint-protected-access) Access to a protected member _cleanup of a client class

utils/node_local_test.py Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

@ashleypittman
Copy link
Contributor Author

Can I request re-review of this please? I think it's ready to land as-is, but it may conflict with the change to json that @mjmac linked to.

@ashleypittman ashleypittman requested a review from a team January 5, 2021 17:00
@ashleypittman
Copy link
Contributor Author

This PR includes one regression in NLT however that is to be expected, it replaces a known and ignored SEGV with a error message so it is a step forwards.

It also hit https://jira.hpdd.intel.com/browse/DAOS-6409 in a previous build which I retriggered and has now passed, so that might be a factor in why it's listed as having failed tests.

@ashleypittman ashleypittman changed the title DAOS-5885 test: Fix crash in shutdown and try dmg shutdown in NLT DAOS-3885 test: Fix crash in shutdown and try dmg shutdown in NLT Jan 5, 2021
@jolivier23 jolivier23 merged commit b066801 into master Jan 5, 2021
@jolivier23 jolivier23 deleted the nlt-shutdown branch January 5, 2021 17:21
dsikich pushed a commit to dsikich/daos that referenced this pull request Jan 5, 2021
…os-stack#4022)

* Fix crash in shutdown and trace memleaks in server.
* Remove old workaround, and do not check for leaks.
* Use dmg to format, rather than --recreate-superblocks.
* Handle the case where /mnt/daos is mounted, but empty.

Signed-off-by: Ashley Pittman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants