-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-13548 client: enable one NLT fault injection test for libpil4dfs #12664
base: master
Are you sure you want to change the base?
Changes from all commits
cb4f615
5360139
806b59a
f144333
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -274,7 +274,7 @@ rec_free(struct d_hash_table *htable, d_list_t *rlink) | |
|
||
rc = dfs_release(hdl->oh); | ||
if (rc == ENOMEM) | ||
rc = dfs_release(hdl->oh); | ||
dfs_release(hdl->oh); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was there a problem with the code here before? It's always a good idea to review logging and see if it's appropriate so if you're removing logging as you think it improves the end result that's fine. The presence of logging itself however will not cause failures in NLT, as long as the messages are well-formed which this one looked to be. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The log here does not cause any failure. I will put it back. Thank you! |
||
if (rc) | ||
D_ERROR("dfs_release() failed: %d (%s)\n", rc, strerror(rc)); | ||
D_FREE(hdl); | ||
|
@@ -315,7 +315,7 @@ lookup_insert_dir(struct dfs_mt *mt, const char *name, size_t len, dfs_obj_t **o | |
dfs_obj_t *oh; | ||
d_list_t *rlink; | ||
mode_t mode; | ||
int rc; | ||
int rc, rc2; | ||
|
||
/* TODO: Remove this after testing. */ | ||
D_ASSERT(strlen(name) == len); | ||
|
@@ -333,7 +333,9 @@ lookup_insert_dir(struct dfs_mt *mt, const char *name, size_t len, dfs_obj_t **o | |
|
||
if (!S_ISDIR(mode)) { | ||
/* Not a directory, return ENOENT*/ | ||
dfs_release(oh); | ||
rc = dfs_release(oh); | ||
if (rc == ENOMEM) | ||
dfs_release(oh); | ||
return ENOTDIR; | ||
} | ||
|
||
|
@@ -357,7 +359,9 @@ lookup_insert_dir(struct dfs_mt *mt, const char *name, size_t len, dfs_obj_t **o | |
return 0; | ||
|
||
out_release: | ||
dfs_release(oh); | ||
rc2 = dfs_release(oh); | ||
if (rc2 == ENOMEM) | ||
dfs_release(oh); | ||
D_FREE(hdl); | ||
return rc; | ||
} | ||
|
@@ -619,8 +623,10 @@ discover_daos_mount_with_env(void) | |
D_FATAL("DAOS_CONTAINER is not set.\n"); | ||
D_GOTO(out, rc = EINVAL); | ||
} | ||
|
||
D_STRNDUP(dfs_list[num_dfs].fs_root, fs_root, len_fs_root); | ||
/* Added to avoid issues due to fault injection. Try again in case of ENOMEM */ | ||
if (dfs_list[num_dfs].fs_root == NULL) | ||
D_STRNDUP(dfs_list[num_dfs].fs_root, fs_root, len_fs_root); | ||
Comment on lines
+627
to
+629
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change also isn't right. The fault injection test in NLT doesn't check that a use-case or function works, but rather than if it doesn't then there's no undesired behaviour. There's no need to put in hacks for this in order to make the test complete. What happens if you simply return the error here, as the code did before? As long as there's no crash and log lines are well-formed then that's enough to pass. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this case, returning error "ENOMEM" from discover_daos_mount_with_env() leads to calling abort(). NLT fault injection will fail. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The correct thing for pil4dfs to do in case of startup errors (of which this is an example) is to fail with an abort() and a helpful error message. If you need to then update NLT to detect that error message and not check the exit code in that case. We typically still try and resolve all memory leaks however. If you take the approach I've described then it means the product behaves in a defined way, and the testing is verifying that. Adding this kind of hack is making the product worst to appease a test which is the wrong approach. |
||
if (dfs_list[num_dfs].fs_root == NULL) | ||
D_GOTO(out, rc = ENOMEM); | ||
|
||
|
@@ -1381,6 +1387,8 @@ free_fd(int idx) | |
|
||
if (saved_obj) { | ||
rc = dfs_release(saved_obj->file); | ||
if (rc == ENOMEM) | ||
dfs_release(saved_obj->file); | ||
/** This memset() is not necessary. It is left here intended. In case of duplicated | ||
* fd exists, multiple fd could point to same struct file_obj. struct file_obj is | ||
* supposed to be freed only when reference count reaches zero. With zeroing out | ||
|
@@ -1391,8 +1399,6 @@ free_fd(int idx) | |
*/ | ||
memset(saved_obj, 0, sizeof(struct file_obj)); | ||
D_FREE(saved_obj); | ||
if (rc) | ||
D_ERROR("dfs_release() failed: %d (%s)\n", rc, strerror(rc)); | ||
} | ||
} | ||
|
||
|
@@ -1429,8 +1435,8 @@ free_dirfd(int idx) | |
D_FREE(saved_obj->path); | ||
D_FREE(saved_obj->ents); | ||
rc = dfs_release(saved_obj->dir); | ||
if (rc) | ||
D_ERROR("dfs_release() failed: %d (%s)\n", rc, strerror(rc)); | ||
if (rc == ENOMEM) | ||
dfs_release(saved_obj->dir); | ||
/** This memset() is not necessary. It is left here intended. In case of duplicated | ||
* fd exists, multiple fd could point to same struct dir_obj. struct dir_obj is | ||
* supposed to be freed only when reference count reaches zero. With zeroing out | ||
|
@@ -2641,19 +2647,19 @@ opendir(const char *path) | |
D_ALLOC_ARRAY(dir_list[idx_dirfd]->ents, READ_DIR_BATCH_SIZE); | ||
if (dir_list[idx_dirfd]->ents == NULL) { | ||
free_dirfd(idx_dirfd); | ||
D_GOTO(out_err, rc = ENOMEM); | ||
D_GOTO(out_err_ret, rc = ENOMEM); | ||
} | ||
/* allocate memory for path[]. */ | ||
D_ASPRINTF(dir_list[idx_dirfd]->path, "%s%s", dfs_mt->fs_root, full_path); | ||
if (dir_list[idx_dirfd]->path == NULL) { | ||
free_dirfd(idx_dirfd); | ||
D_GOTO(out_err, rc = ENOMEM); | ||
D_GOTO(out_err_ret, rc = ENOMEM); | ||
} | ||
if (strnlen(dir_list[idx_dirfd]->path, DFS_MAX_PATH) >= DFS_MAX_PATH) { | ||
D_DEBUG(DB_ANY, "path is longer than DFS_MAX_PATH: %d (%s)\n", ENAMETOOLONG, | ||
strerror(ENAMETOOLONG)); | ||
free_dirfd(idx_dirfd); | ||
D_GOTO(out_err, rc = ENAMETOOLONG); | ||
D_GOTO(out_err_ret, rc = ENAMETOOLONG); | ||
} | ||
|
||
FREE(parent_dir); | ||
|
@@ -3197,6 +3203,8 @@ readlink(const char *path, char *buf, size_t size) | |
if (rc) | ||
goto out_release; | ||
rc = dfs_release(obj); | ||
if (rc == ENOMEM) | ||
rc = dfs_release(obj); | ||
if (rc) | ||
goto out_err; | ||
/* The NULL at the end should not be included in the length */ | ||
|
@@ -3209,8 +3217,8 @@ readlink(const char *path, char *buf, size_t size) | |
|
||
out_release: | ||
rc2 = dfs_release(obj); | ||
if (rc2) | ||
D_ERROR("dfs_release() failed: %d (%s)\n", rc2, strerror(rc2)); | ||
if (rc2 == ENOMEM) | ||
dfs_release(obj); | ||
|
||
out_err: | ||
FREE(parent_dir); | ||
|
@@ -3627,8 +3635,8 @@ rename(const char *old_name, const char *new_name) | |
D_FREE(buff); | ||
errno_save = rc; | ||
rc = dfs_release(obj_old); | ||
if (rc) | ||
D_ERROR("dfs_release() failed: %d (%s)\n", rc, strerror(rc)); | ||
if (rc == ENOMEM) | ||
dfs_release(obj_old); | ||
errno = errno_save; | ||
return (-1); | ||
|
||
|
@@ -3639,8 +3647,8 @@ rename(const char *old_name, const char *new_name) | |
fclose(fIn); | ||
errno_save = rc; | ||
rc = dfs_release(obj_new); | ||
if (rc) | ||
D_ERROR("dfs_release() failed: %d (%s)\n", rc, strerror(rc)); | ||
if (rc == ENOMEM) | ||
dfs_release(obj_new); | ||
errno = errno_save; | ||
return (-1); | ||
|
||
|
@@ -5498,7 +5506,7 @@ finalize_dfs(void) | |
continue; | ||
} | ||
rc = daos_pool_disconnect(dfs_list[i].poh, NULL); | ||
if (rc != 0) { | ||
if (rc != -DER_SUCCESS) { | ||
D_ERROR("error in daos_pool_disconnect(%s): " DF_RC "\n", | ||
dfs_list[i].fs_root, DP_RC(rc)); | ||
continue; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1414,6 +1414,8 @@ dc_obj_open(tse_task_t *task) | |
|
||
args = dc_task_get_args(task); | ||
D_ASSERTF(args != NULL, "Task Argument OPC does not match DC OPC\n"); | ||
/* initialize with an invalid handle */ | ||
(*args->oh).cookie = 0; | ||
Comment on lines
+1417
to
+1418
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Much better. I still can't see where this was being used undefined, can you point it out to me? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here is the calling stack when the handle is accessed in dc_obj_close(). #0 d_hhash_link_lookup (hhash=0x55b881a7e4d0, key=) at src/gurt/hash.c:1269 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That looks to be at https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L675 which is in error handling code, do you know what allocation failed to get to that point? Perhaps https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L618 should just return instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The allocation target of fault injection is, #1 0x00007f85ce6d8234 in d_should_fail (fault_attr=0x55585b488230) at src/gurt/fault_inject.c:792 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L661 open_handle_cb() is triggered to call dc_obj_close(). In case of uninitialized "oh", the content is non-zero. "daos_handle_is_valid(*args->oh)" returns true. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a non-zero rc means the handle is invalid, then the error type is not relevant. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or rather, there's been an error so the handles not valid so it doesn't need closing. What the error is is not a factor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since I do not know how long time this PR can be landed, I created a separate ticket DAOS-14030 and PR #12691 to address this issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add it to this PR to see if it passes the test? Happy to land it separately but I'm like to see it tested here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok. I will add it to this PR soon. It does help to avoid the issue encountered in close in my test on boro. |
||
|
||
obj = obj_alloc(); | ||
if (obj == NULL) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This again is an example of a bug in dfs that ideally we'd fix. I'm not keen on this style of hack but there is a bug here that we should look at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/daos-stack/daos/blob/master/src/client/api/array.c#L118-L120
When the fault injection causes dc_task_create() returns -DER_NOMEM, dc_task_schedule() is not called. The leaked memory was allocated by "D_ALLOC_PTR(array);" at line 91 in array_alloc() in dc_array.c,
https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L91
Shall we retry "dc_task_create(dc_array_close, NULL, ev, &task);" in case it returns -DER_NOMEM inside daos_array_close()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we retry "dc_task_create(dc_array_close, NULL, ev, &task);" in case it returns -DER_NOMEM inside daos_array_close(), we need to retry "dc_task_schedule(task, true)" too. How do you want to proceed? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrying "rc = daos_array_close(file_oh, NULL);" may be reasonable to me. It is similar to retrying dfs_release(). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
such hacks are really getting out of hand in the code. let's not add more please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. What should we do about the caused memory leak due to fault injection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, however this is a bug and it's impossible to address it at a higher level than dfs as the context has been lost by that point. Outside the scope of this PR however.