Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-13548 client: enable one NLT fault injection test for libpil4dfs #12664

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/client/dfs/dfs.c
Original file line number Diff line number Diff line change
Expand Up @@ -1078,6 +1078,10 @@ entry_stat(dfs_t *dfs, daos_handle_t th, daos_handle_t oh, const char *name, siz
}

rc = daos_array_close(file_oh, NULL);
/* Try again in case of out of memory. Needed to avoid memory leak in
* NLT fault injection */
if (rc == -DER_NOMEM)
rc = daos_array_close(file_oh, NULL);
Comment on lines +1081 to +1084
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This again is an example of a bug in dfs that ideally we'd fix. I'm not keen on this style of hack but there is a bug here that we should look at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/daos-stack/daos/blob/master/src/client/api/array.c#L118-L120
When the fault injection causes dc_task_create() returns -DER_NOMEM, dc_task_schedule() is not called. The leaked memory was allocated by "D_ALLOC_PTR(array);" at line 91 in array_alloc() in dc_array.c,
https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L91

Shall we retry "dc_task_create(dc_array_close, NULL, ev, &task);" in case it returns -DER_NOMEM inside daos_array_close()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we retry "dc_task_create(dc_array_close, NULL, ev, &task);" in case it returns -DER_NOMEM inside daos_array_close(), we need to retry "dc_task_schedule(task, true)" too. How do you want to proceed? Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying "rc = daos_array_close(file_oh, NULL);" may be reasonable to me. It is similar to retrying dfs_release(). What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such hacks are really getting out of hand in the code. let's not add more please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. What should we do about the caused memory leak due to fault injection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such hacks are really getting out of hand in the code. let's not add more please.

I agree, however this is a bug and it's impossible to address it at a higher level than dfs as the context has been lost by that point. Outside the scope of this PR however.

if (rc)
return daos_der2errno(rc);
}
Expand Down
46 changes: 27 additions & 19 deletions src/client/dfuse/pil4dfs/int_dfs.c
Original file line number Diff line number Diff line change
Expand Up @@ -274,7 +274,7 @@ rec_free(struct d_hash_table *htable, d_list_t *rlink)

rc = dfs_release(hdl->oh);
if (rc == ENOMEM)
rc = dfs_release(hdl->oh);
dfs_release(hdl->oh);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a problem with the code here before?

It's always a good idea to review logging and see if it's appropriate so if you're removing logging as you think it improves the end result that's fine. The presence of logging itself however will not cause failures in NLT, as long as the messages are well-formed which this one looked to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log here does not cause any failure. I will put it back. Thank you!

if (rc)
D_ERROR("dfs_release() failed: %d (%s)\n", rc, strerror(rc));
D_FREE(hdl);
Expand Down Expand Up @@ -315,7 +315,7 @@ lookup_insert_dir(struct dfs_mt *mt, const char *name, size_t len, dfs_obj_t **o
dfs_obj_t *oh;
d_list_t *rlink;
mode_t mode;
int rc;
int rc, rc2;

/* TODO: Remove this after testing. */
D_ASSERT(strlen(name) == len);
Expand All @@ -333,7 +333,9 @@ lookup_insert_dir(struct dfs_mt *mt, const char *name, size_t len, dfs_obj_t **o

if (!S_ISDIR(mode)) {
/* Not a directory, return ENOENT*/
dfs_release(oh);
rc = dfs_release(oh);
if (rc == ENOMEM)
dfs_release(oh);
return ENOTDIR;
}

Expand All @@ -357,7 +359,9 @@ lookup_insert_dir(struct dfs_mt *mt, const char *name, size_t len, dfs_obj_t **o
return 0;

out_release:
dfs_release(oh);
rc2 = dfs_release(oh);
if (rc2 == ENOMEM)
dfs_release(oh);
D_FREE(hdl);
return rc;
}
Expand Down Expand Up @@ -619,8 +623,10 @@ discover_daos_mount_with_env(void)
D_FATAL("DAOS_CONTAINER is not set.\n");
D_GOTO(out, rc = EINVAL);
}

D_STRNDUP(dfs_list[num_dfs].fs_root, fs_root, len_fs_root);
/* Added to avoid issues due to fault injection. Try again in case of ENOMEM */
if (dfs_list[num_dfs].fs_root == NULL)
D_STRNDUP(dfs_list[num_dfs].fs_root, fs_root, len_fs_root);
Comment on lines +627 to +629
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change also isn't right. The fault injection test in NLT doesn't check that a use-case or function works, but rather than if it doesn't then there's no undesired behaviour. There's no need to put in hacks for this in order to make the test complete.

What happens if you simply return the error here, as the code did before? As long as there's no crash and log lines are well-formed then that's enough to pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, returning error "ENOMEM" from discover_daos_mount_with_env() leads to calling abort(). NLT fault injection will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct thing for pil4dfs to do in case of startup errors (of which this is an example) is to fail with an abort() and a helpful error message. If you need to then update NLT to detect that error message and not check the exit code in that case. We typically still try and resolve all memory leaks however.

If you take the approach I've described then it means the product behaves in a defined way, and the testing is verifying that. Adding this kind of hack is making the product worst to appease a test which is the wrong approach.

if (dfs_list[num_dfs].fs_root == NULL)
D_GOTO(out, rc = ENOMEM);

Expand Down Expand Up @@ -1381,6 +1387,8 @@ free_fd(int idx)

if (saved_obj) {
rc = dfs_release(saved_obj->file);
if (rc == ENOMEM)
dfs_release(saved_obj->file);
/** This memset() is not necessary. It is left here intended. In case of duplicated
* fd exists, multiple fd could point to same struct file_obj. struct file_obj is
* supposed to be freed only when reference count reaches zero. With zeroing out
Expand All @@ -1391,8 +1399,6 @@ free_fd(int idx)
*/
memset(saved_obj, 0, sizeof(struct file_obj));
D_FREE(saved_obj);
if (rc)
D_ERROR("dfs_release() failed: %d (%s)\n", rc, strerror(rc));
}
}

Expand Down Expand Up @@ -1429,8 +1435,8 @@ free_dirfd(int idx)
D_FREE(saved_obj->path);
D_FREE(saved_obj->ents);
rc = dfs_release(saved_obj->dir);
if (rc)
D_ERROR("dfs_release() failed: %d (%s)\n", rc, strerror(rc));
if (rc == ENOMEM)
dfs_release(saved_obj->dir);
/** This memset() is not necessary. It is left here intended. In case of duplicated
* fd exists, multiple fd could point to same struct dir_obj. struct dir_obj is
* supposed to be freed only when reference count reaches zero. With zeroing out
Expand Down Expand Up @@ -2641,19 +2647,19 @@ opendir(const char *path)
D_ALLOC_ARRAY(dir_list[idx_dirfd]->ents, READ_DIR_BATCH_SIZE);
if (dir_list[idx_dirfd]->ents == NULL) {
free_dirfd(idx_dirfd);
D_GOTO(out_err, rc = ENOMEM);
D_GOTO(out_err_ret, rc = ENOMEM);
}
/* allocate memory for path[]. */
D_ASPRINTF(dir_list[idx_dirfd]->path, "%s%s", dfs_mt->fs_root, full_path);
if (dir_list[idx_dirfd]->path == NULL) {
free_dirfd(idx_dirfd);
D_GOTO(out_err, rc = ENOMEM);
D_GOTO(out_err_ret, rc = ENOMEM);
}
if (strnlen(dir_list[idx_dirfd]->path, DFS_MAX_PATH) >= DFS_MAX_PATH) {
D_DEBUG(DB_ANY, "path is longer than DFS_MAX_PATH: %d (%s)\n", ENAMETOOLONG,
strerror(ENAMETOOLONG));
free_dirfd(idx_dirfd);
D_GOTO(out_err, rc = ENAMETOOLONG);
D_GOTO(out_err_ret, rc = ENAMETOOLONG);
}

FREE(parent_dir);
Expand Down Expand Up @@ -3197,6 +3203,8 @@ readlink(const char *path, char *buf, size_t size)
if (rc)
goto out_release;
rc = dfs_release(obj);
if (rc == ENOMEM)
rc = dfs_release(obj);
if (rc)
goto out_err;
/* The NULL at the end should not be included in the length */
Expand All @@ -3209,8 +3217,8 @@ readlink(const char *path, char *buf, size_t size)

out_release:
rc2 = dfs_release(obj);
if (rc2)
D_ERROR("dfs_release() failed: %d (%s)\n", rc2, strerror(rc2));
if (rc2 == ENOMEM)
dfs_release(obj);

out_err:
FREE(parent_dir);
Expand Down Expand Up @@ -3627,8 +3635,8 @@ rename(const char *old_name, const char *new_name)
D_FREE(buff);
errno_save = rc;
rc = dfs_release(obj_old);
if (rc)
D_ERROR("dfs_release() failed: %d (%s)\n", rc, strerror(rc));
if (rc == ENOMEM)
dfs_release(obj_old);
errno = errno_save;
return (-1);

Expand All @@ -3639,8 +3647,8 @@ rename(const char *old_name, const char *new_name)
fclose(fIn);
errno_save = rc;
rc = dfs_release(obj_new);
if (rc)
D_ERROR("dfs_release() failed: %d (%s)\n", rc, strerror(rc));
if (rc == ENOMEM)
dfs_release(obj_new);
errno = errno_save;
return (-1);

Expand Down Expand Up @@ -5498,7 +5506,7 @@ finalize_dfs(void)
continue;
}
rc = daos_pool_disconnect(dfs_list[i].poh, NULL);
if (rc != 0) {
if (rc != -DER_SUCCESS) {
D_ERROR("error in daos_pool_disconnect(%s): " DF_RC "\n",
dfs_list[i].fs_root, DP_RC(rc));
continue;
Expand Down
2 changes: 2 additions & 0 deletions src/object/cli_obj.c
Original file line number Diff line number Diff line change
Expand Up @@ -1414,6 +1414,8 @@ dc_obj_open(tse_task_t *task)

args = dc_task_get_args(task);
D_ASSERTF(args != NULL, "Task Argument OPC does not match DC OPC\n");
/* initialize with an invalid handle */
(*args->oh).cookie = 0;
Comment on lines +1417 to +1418
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better. I still can't see where this was being used undefined, can you point it out to me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the calling stack when the handle is accessed in dc_obj_close().

#0 d_hhash_link_lookup (hhash=0x55b881a7e4d0, key=) at src/gurt/hash.c:1269
#1 0x00007fa691f77923 in daos_hhash_link_lookup (key=) at src/common/misc.c:604
#2 0x00007fa6922dc6e0 in obj_hdl2ptr (oh=...) at src/object/cli_obj.c:171
#3 0x00007fa6922f06bc in dc_obj_close (task=0x55b881a7b940) at src/object/cli_obj.c:1496
#4 0x00007fa691fbcedc in tse_task_schedule_with_delay (task=0x55b881a7b940, instant=instant@entry=true, delay=delay@entry=0) at src/common/tse.c:1056
#5 0x00007fa691fbd2e9 in tse_task_schedule (task=, instant=instant@entry=true) at src/common/tse.c:1070
#6 0x00007fa692393342 in open_handle_cb (task=0x55b882040be0, data=) at src/client/array/dc_array.c:675
#7 0x00007fa691fb6745 in tse_task_complete_callback (task=task@entry=0x55b882040be0) at src/common/tse.c:520
#8 0x00007fa691fb7583 in tse_task_post_process (task=task@entry=0x55b8820536f0) at src/common/tse.c:675
#9 0x00007fa691fbae62 in tse_sched_process_complete (dsp=dsp@entry=0x7fa6925fb350 <daos_sched_g+16>) at src/common/tse.c:727
#10 0x00007fa691fbc5ff in tse_task_complete (task=task@entry=0x55b8820536f0, ret=ret@entry=-1009) at src/common/tse.c:885
#11 0x00007fa6922f11fd in dc_obj_open (task=0x55b8820536f0) at src/object/cli_obj.c:1467
#12 0x00007fa691fbcedc in tse_task_schedule_with_delay (task=0x55b8820536f0, instant=instant@entry=true, delay=delay@entry=0) at src/common/tse.c:1056
#13 0x00007fa691fbd2e9 in tse_task_schedule (task=, instant=instant@entry=true) at src/common/tse.c:1070
#14 0x00007fa69239471b in dc_array_open (task=0x55b882040be0) at src/client/array/dc_array.c:768
#15 0x00007fa691fbcedc in tse_task_schedule_with_delay (task=task@entry=0x55b882040be0, instant=instant@entry=true, delay=delay@entry=0) at src/common/tse.c:1056
#16 0x00007fa691fbd2e9 in tse_task_schedule (task=task@entry=0x55b882040be0, instant=instant@entry=true) at src/common/tse.c:1070
#17 0x00007fa6922547ac in dc_task_schedule (task=0x55b882040be0, instant=instant@entry=true) at src/client/api/task.c:114
#18 0x00007fa6922369c2 in daos_array_open_with_attr (coh=..., oid=..., th=th@entry=..., mode=mode@entry=2, cell_size=, cell_size@entry=1, chunk_size=, oh=0x7ffe3f702c18, ev=0x0) at src/client/api/array.c:95
#19 0x00007fa691d1972d in entry_stat (dfs=dfs@entry=0x55b882033df0, th=th@entry=..., oh=..., name=name@entry=0x7ffe3f702d70 "file.4", len=, obj=obj@entry=0x0, get_size=, stbuf=, obj_hlc=)
at src/client/dfs/dfs.c:1069

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks to be at https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L675 which is in error handling code, do you know what allocation failed to get to that point? Perhaps https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L618 should just return instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation target of fault injection is,
https://github.com/daos-stack/daos/blob/master/src/object/cli_obj.c#L143

#1 0x00007f85ce6d8234 in d_should_fail (fault_attr=0x55585b488230) at src/gurt/fault_inject.c:792
#2 0x00007f85d080ff11 in obj_alloc () at src/object/cli_obj.c:143
#3 0x00007f85d081bb52 in dc_obj_open (task=0x55585ba71bd0) at src/object/cli_obj.c:1420
#4 0x00007f85d04eb3f4 in tse_task_schedule_with_delay (task=0x55585ba71bd0, instant=true, delay=0)
at src/common/tse.c:1056
#5 0x00007f85d04eb445 in tse_task_schedule (task=0x55585ba71bd0, instant=true) at src/common/tse.c:1070
#6 0x00007f85d08ecd5e in dc_array_open (task=0x55585ba5ef90) at src/client/array/dc_array.c:768
#7 0x00007f85d04eb3f4 in tse_task_schedule_with_delay (task=0x55585ba5ef90, instant=true, delay=0)
at src/common/tse.c:1056
#8 0x00007f85d04eb445 in tse_task_schedule (task=0x55585ba5ef90, instant=true) at src/common/tse.c:1070
#9 0x00007f85d07855c3 in dc_task_schedule (task=0x55585ba5ef90, instant=true) at src/client/api/task.c:114
#10 0x00007f85d0764ed1 in daos_array_open_with_attr (coh=..., oid=..., th=..., mode=2, cell_size=1, chunk_size=1048576,
oh=0x7ffdf170ba60, ev=0x0) at src/client/api/array.c:95
#11 0x00007f85d0242de2 in entry_stat (dfs=0x55585ba52000, th=..., oh=..., name=0x7ffdf170bbe0 "file.4", len=6, obj=0x0,
get_size=true, stbuf=0x7ffdf170bd40, obj_hlc=0x0) at src/client/dfs/dfs.c:1066
#12 0x00007f85d025f5b1 in dfs_stat (dfs=0x55585ba52000, parent=0x55585ba71a80, name=0x7ffdf170bbe0 "file.4",
stbuf=0x7ffdf170bd40) at src/client/dfs/dfs.c:4918
#13 0x00007f85d1580dfa in new_lxstat (ver=1, path=0x55585b4996e0 "/tmp/pil4dfs_mountsx0ils0g/new_dir/file.4",
stat_buf=0x7ffdf170bd40) at src/client/dfuse/pil4dfs/int_dfs.c:2242
#14 0x00007f85d1581485 in statx (dirfd=-100, path=0x7ffdf170bf10 "new_dir/file.4", flags=256, mask=606,
statx_buf=0x7ffdf170bdf0) at src/client/dfuse/pil4dfs/int_dfs.c:2357
#15 0x000055585a03183b in do_statx ()
#16 0x000055585a0356f5 in gobble_file.constprop ()
#17 0x000055585a0368af in print_dir ()
#18 0x000055585a03019d in main ()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L661

open_handle_cb() is triggered to call dc_obj_close(). In case of uninitialized "oh", the content is non-zero. "daos_handle_is_valid(*args->oh)" returns true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a non-zero rc means the handle is invalid, then the error type is not relevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather, there's been an error so the handles not valid so it doesn't need closing. What the error is is not a factor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I do not know how long time this PR can be landed, I created a separate ticket DAOS-14030 and PR #12691 to address this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add it to this PR to see if it passes the test? Happy to land it separately but I'm like to see it tested here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I will add it to this PR soon. It does help to avoid the issue encountered in close in my test on boro.


obj = obj_alloc();
if (obj == NULL)
Expand Down
3 changes: 2 additions & 1 deletion utils/node_local_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -5516,6 +5516,7 @@ def test_alloc_pil4dfs_ls(server, conf, wf):
test_cmd.use_pil4dfs(container)
test_cmd.check_daos_stderr = False
test_cmd.check_post_stdout = False
test_cmd.ignore_busy = True

return test_cmd.launch()

Expand Down Expand Up @@ -5877,7 +5878,7 @@ def run(wf, args):
fatal_errors.add_result(test_alloc_cont_create(server, conf, wf_client))

# Disabled for now because of errors
# fatal_errors.add_result(test_alloc_pil4dfs_ls(server, conf, wf_client))
fatal_errors.add_result(test_alloc_pil4dfs_ls(server, conf, wf_client))

wf_client.close()

Expand Down