Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-13548 client: enable one NLT fault injection test for libpil4dfs #12664

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

wiliamhuang
Copy link
Contributor

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@github-actions
Copy link

github-actions bot commented Jul 18, 2023

Bug-tracker data:
Ticket title is 'Add new fault injection test for pil4dfs'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-13548

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@wiliamhuang wiliamhuang marked this pull request as ready for review July 19, 2023 03:58
@wiliamhuang wiliamhuang requested a review from a team as a code owner July 19, 2023 03:58
Copy link
Contributor

@ashleypittman ashleypittman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try adding test_cmd.ignore_busy to the test_alloc_pil4dfs_ls function to silence the shutdown errors.

For the dfs_release() calls I think maybe we should try another approach, for this PR for now can you keep moving forward and fixing the issues but perhaps when it comes to landing we'll just take the fixes, leave the test disabled and remove the dfs_release retries.

Comment on lines 1066 to 1068
/* Added to avoid issues in daos_hhash_link_lookup due to uninitialized
* file_oh in NLT fault injection test */
memset(&file_oh, 0, sizeof(daos_handle_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several issues here:

  1. Testing is to find bugs, when you fix one you're not "making the test pass", you're fixing a bug so you should reference the bug, not the test in any comment.
  2. This comment includes technical information on a different part of the code-base, that might be appropriate for a commit message but it doesn't help future readers of this function understand the code in front of them.
  3. Many daos functions, daos_array_open_with_attr included return a rc value and pass out the result via a arg, functions like this should always set the arg oh on the successful case and not use it on the unsuccessful case. They should not expect the arg to be zeroed before being called.
  4. If you did want to set file_oh to zeros then the correct way would be to change the declaration to be daos_handle_t file_oh = {0} or daos_handle_t file_oh = {}, I think the compiler would expect both.
  5. Further to the 3. above daos_array_open_with_attr defines of as a param of type out so it's value at input should not matter.

What I think is happening here is that there's likely a bug in the error handling of array_open() which I think is exactly the type of bug this test is looking for.

Can you look at this again please and revert this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Initialize oh.cookie with an invalid handle in dc_obj_open() now.

rc = dfs_release(hdl->oh);
if (rc)
D_ERROR("dfs_release() failed: %d (%s)\n", rc, strerror(rc));
dfs_release(hdl->oh);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a problem with the code here before?

It's always a good idea to review logging and see if it's appropriate so if you're removing logging as you think it improves the end result that's fine. The presence of logging itself however will not cause failures in NLT, as long as the messages are well-formed which this one looked to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log here does not cause any failure. I will put it back. Thank you!

Comment on lines +625 to +627
/* Added to avoid issues due to fault injection. Try again in case of ENOMEM */
if (dfs_list[num_dfs].fs_root == NULL)
D_STRNDUP(dfs_list[num_dfs].fs_root, fs_root, len_fs_root);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change also isn't right. The fault injection test in NLT doesn't check that a use-case or function works, but rather than if it doesn't then there's no undesired behaviour. There's no need to put in hacks for this in order to make the test complete.

What happens if you simply return the error here, as the code did before? As long as there's no crash and log lines are well-formed then that's enough to pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, returning error "ENOMEM" from discover_daos_mount_with_env() leads to calling abort(). NLT fault injection will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct thing for pil4dfs to do in case of startup errors (of which this is an example) is to fail with an abort() and a helpful error message. If you need to then update NLT to detect that error message and not check the exit code in that case. We typically still try and resolve all memory leaks however.

If you take the approach I've described then it means the product behaves in a defined way, and the testing is verifying that. Adding this kind of hack is making the product worst to appease a test which is the wrong approach.

Comment on lines 687 to 688
if (pt_dfs_mt->fs_root == NULL)
D_STRNDUP(pt_dfs_mt->fs_root, fs_entry->mnt_dir, pt_dfs_mt->len_fs_root);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, although in this case the error handling in the function will also be called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the retry of D_STRNDUP() here.

Comment on lines 5503 to 5504
if (rc == -DER_NOMEM)
rc = daos_cont_close(dfs_list[i].coh, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't desirable. Since initially writing this test I've added logic to cover this case so that if DER_BUSY is logged then leak checking is disabled. If we enable the new flag for this test then what would happen here is that dfs_umount() would fail with -DER_NOMEM, then daos_cont_close() and daos_pool_disconnect() would fail with -DER_BUSY and NLT would not run the leak checker in this case so the test would pass without this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed as suggested. Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashleypittman I added "test_cmd.ignore_busy = True". It looks like memory leak issues are reported if we do not retry "rc = daos_cont_close(dfs_list[i].coh, NULL);" and "rc = daos_pool_disconnect(dfs_list[i].poh, NULL);". Is there anything else should be changed? Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashleypittman @mchaarawi
rc = daos_cont_close(dfs_list[i].coh, NULL);
if (rc != -DER_SUCCESS && rc != -DER_NONEXIST)
rc = daos_cont_close(dfs_list[i].coh, NULL);

	rc = daos_pool_disconnect(dfs_list[i].poh, NULL);
	if (rc != -DER_SUCCESS && rc != -DER_NONEXIST)
		rc = daos_pool_disconnect(dfs_list[i].poh, NULL);

Shall we retry daos_cont_close() and daos_pool_disconnect when they fail? Is there resource leak on server side without retry on client side?

The fault injections causes client-side memory leak are in tse_task_create(),
https://github.com/daos-stack/daos/blob/master/src/common/tse.c#L961

Calling stacks,
daos_cont_close() failed
DBG> 0, /scratch/leihuan1/tickets/13548/daos/install/lib64/libgurt.so.4(print_trace+0x67) [0x7f84b1ea0886]
DBG> 1, /scratch/leihuan1/tickets/13548/daos/install/lib64/libgurt.so.4(d_should_fail+0x574) [0x7f84b1ea0ee7]
DBG> 2, /scratch/leihuan1/tickets/13548/daos/install/lib64/libdaos_common.so(tse_task_create+0x5a) [0x7f84b3cb389e]
DBG> 3, /scratch/leihuan1/tickets/13548/daos/install/lib64/libdaos.so.2(dc_task_create+0x80) [0x7f84b3f4e4cd]
DBG> 4, /scratch/leihuan1/tickets/13548/daos/install/lib64/libdaos.so.2(daos_cont_close+0x1dd) [0x7f84b3f2f9df]
DBG> 5, /scratch/leihuan1/tickets/13548/daos/install/lib64/libpil4dfs.so(+0x250ba) [0x7f84b4d5f0ba]
DBG> 6, /scratch/leihuan1/tickets/13548/daos/install/lib64/libpil4dfs.so(+0x23992) [0x7f84b4d5d992]
DBG> 7, /lib64/ld-linux-x86-64.so.2(+0x151e6) [0x7f84b4f8e1e6]
DBG> 8, /lib64/libc.so.6(+0x5123c) [0x7f84b459423c]
DBG> 9, /lib64/libc.so.6(on_exit+0) [0x7f84b4594370]
DBG> 10, /scratch/leihuan1/tickets/13548/daos/install/lib64/libpil4dfs.so(+0x22015) [0x7f84b4d5c015]
DBG> 11, /lib64/libc.so.6(__libc_start_main+0xfa) [0x7f84b457dcfa]
DBG> 12, ls(+0x5e7e) [0x561f33b92e7e]

daos_pool_disconnect() failed
DBG> 0, /scratch/leihuan1/tickets/13548/daos/install/lib64/libgurt.so.4(print_trace+0x67) [0x7f315e085886]
DBG> 1, /scratch/leihuan1/tickets/13548/daos/install/lib64/libgurt.so.4(d_should_fail+0x574) [0x7f315e085ee7]
DBG> 2, /scratch/leihuan1/tickets/13548/daos/install/lib64/libdaos_common.so(tse_task_create+0x5a) [0x7f315fe9889e]
DBG> 3, /scratch/leihuan1/tickets/13548/daos/install/lib64/libdaos.so.2(dc_task_create+0x80) [0x7f31601334cd]
DBG> 4, /scratch/leihuan1/tickets/13548/daos/install/lib64/libdaos.so.2(daos_pool_disconnect+0x1dd) [0x7f316012f3ae]
DBG> 5, /scratch/leihuan1/tickets/13548/daos/install/lib64/libpil4dfs.so(+0x25269) [0x7f3160f44269]
DBG> 6, /scratch/leihuan1/tickets/13548/daos/install/lib64/libpil4dfs.so(+0x23992) [0x7f3160f42992]
DBG> 7, /lib64/ld-linux-x86-64.so.2(+0x151e6) [0x7f31611731e6]
DBG> 8, /lib64/libc.so.6(+0x5123c) [0x7f316077923c]
DBG> 9, /lib64/libc.so.6(on_exit+0) [0x7f3160779370]
DBG> 10, /scratch/leihuan1/tickets/13548/daos/install/lib64/libpil4dfs.so(+0x22015) [0x7f3160f41015]
DBG> 11, /lib64/libc.so.6(__libc_start_main+0xfa) [0x7f3160762cfa]
DBG> 12, ls(+0x5e7e) [0x55b1934ace7e]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashleypittman @mchaarawi RPC is needed in daos_cont_close() and daos_pool_disconnect(), right? How we should proceed here? It looks retries exist in several other places. Shall I also adopt retry? Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i cannot do a similar "fix" that i did for object close since as you said those are not local operations and require a tse task to be attached to the operation / RPC.
I also do not think we should add more retries / hacks like this and should work on removing them. I'm not sure what can be done here other than not do fault injection on this path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mchaarawi I see. Thank you very much for your comments!

Comment on lines 5511 to 5513
if (rc == -DER_NOMEM)
rc = daos_pool_disconnect(dfs_list[i].poh, NULL);
if (rc != -DER_SUCCESS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed as suggested. Thank you!

Comment on lines +1084 to +1087
/* Try again in case of out of memory. Needed to avoid memory leak in
* NLT fault injection */
if (rc == -DER_NOMEM)
rc = daos_array_close(file_oh, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This again is an example of a bug in dfs that ideally we'd fix. I'm not keen on this style of hack but there is a bug here that we should look at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/daos-stack/daos/blob/master/src/client/api/array.c#L118-L120
When the fault injection causes dc_task_create() returns -DER_NOMEM, dc_task_schedule() is not called. The leaked memory was allocated by "D_ALLOC_PTR(array);" at line 91 in array_alloc() in dc_array.c,
https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L91

Shall we retry "dc_task_create(dc_array_close, NULL, ev, &task);" in case it returns -DER_NOMEM inside daos_array_close()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we retry "dc_task_create(dc_array_close, NULL, ev, &task);" in case it returns -DER_NOMEM inside daos_array_close(), we need to retry "dc_task_schedule(task, true)" too. How do you want to proceed? Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying "rc = daos_array_close(file_oh, NULL);" may be reasonable to me. It is similar to retrying dfs_release(). What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such hacks are really getting out of hand in the code. let's not add more please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. What should we do about the caused memory leak due to fault injection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such hacks are really getting out of hand in the code. let's not add more please.

I agree, however this is a bug and it's impossible to address it at a higher level than dfs as the context has been lost by that point. Outside the scope of this PR however.

initialization with an invalid handle in dc_obj_open()
add "test_cmd.ignore_busy = True" in "test_alloc_pil4dfs_ls"
remove retry daos_cont_close() and daos_pool_disconnect()

Required-githooks: true

Signed-off-by: Lei Huang <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Comment on lines +1084 to +1087
/* Try again in case of out of memory. Needed to avoid memory leak in
* NLT fault injection */
if (rc == -DER_NOMEM)
rc = daos_array_close(file_oh, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such hacks are really getting out of hand in the code. let's not add more please.

I agree, however this is a bug and it's impossible to address it at a higher level than dfs as the context has been lost by that point. Outside the scope of this PR however.

Comment on lines +625 to +627
/* Added to avoid issues due to fault injection. Try again in case of ENOMEM */
if (dfs_list[num_dfs].fs_root == NULL)
D_STRNDUP(dfs_list[num_dfs].fs_root, fs_root, len_fs_root);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct thing for pil4dfs to do in case of startup errors (of which this is an example) is to fail with an abort() and a helpful error message. If you need to then update NLT to detect that error message and not check the exit code in that case. We typically still try and resolve all memory leaks however.

If you take the approach I've described then it means the product behaves in a defined way, and the testing is verifying that. Adding this kind of hack is making the product worst to appease a test which is the wrong approach.

Comment on lines +1417 to +1418
/* initialize with an invalid handle */
(*args->oh).cookie = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better. I still can't see where this was being used undefined, can you point it out to me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the calling stack when the handle is accessed in dc_obj_close().

#0 d_hhash_link_lookup (hhash=0x55b881a7e4d0, key=) at src/gurt/hash.c:1269
#1 0x00007fa691f77923 in daos_hhash_link_lookup (key=) at src/common/misc.c:604
#2 0x00007fa6922dc6e0 in obj_hdl2ptr (oh=...) at src/object/cli_obj.c:171
#3 0x00007fa6922f06bc in dc_obj_close (task=0x55b881a7b940) at src/object/cli_obj.c:1496
#4 0x00007fa691fbcedc in tse_task_schedule_with_delay (task=0x55b881a7b940, instant=instant@entry=true, delay=delay@entry=0) at src/common/tse.c:1056
#5 0x00007fa691fbd2e9 in tse_task_schedule (task=, instant=instant@entry=true) at src/common/tse.c:1070
#6 0x00007fa692393342 in open_handle_cb (task=0x55b882040be0, data=) at src/client/array/dc_array.c:675
#7 0x00007fa691fb6745 in tse_task_complete_callback (task=task@entry=0x55b882040be0) at src/common/tse.c:520
#8 0x00007fa691fb7583 in tse_task_post_process (task=task@entry=0x55b8820536f0) at src/common/tse.c:675
#9 0x00007fa691fbae62 in tse_sched_process_complete (dsp=dsp@entry=0x7fa6925fb350 <daos_sched_g+16>) at src/common/tse.c:727
#10 0x00007fa691fbc5ff in tse_task_complete (task=task@entry=0x55b8820536f0, ret=ret@entry=-1009) at src/common/tse.c:885
#11 0x00007fa6922f11fd in dc_obj_open (task=0x55b8820536f0) at src/object/cli_obj.c:1467
#12 0x00007fa691fbcedc in tse_task_schedule_with_delay (task=0x55b8820536f0, instant=instant@entry=true, delay=delay@entry=0) at src/common/tse.c:1056
#13 0x00007fa691fbd2e9 in tse_task_schedule (task=, instant=instant@entry=true) at src/common/tse.c:1070
#14 0x00007fa69239471b in dc_array_open (task=0x55b882040be0) at src/client/array/dc_array.c:768
#15 0x00007fa691fbcedc in tse_task_schedule_with_delay (task=task@entry=0x55b882040be0, instant=instant@entry=true, delay=delay@entry=0) at src/common/tse.c:1056
#16 0x00007fa691fbd2e9 in tse_task_schedule (task=task@entry=0x55b882040be0, instant=instant@entry=true) at src/common/tse.c:1070
#17 0x00007fa6922547ac in dc_task_schedule (task=0x55b882040be0, instant=instant@entry=true) at src/client/api/task.c:114
#18 0x00007fa6922369c2 in daos_array_open_with_attr (coh=..., oid=..., th=th@entry=..., mode=mode@entry=2, cell_size=, cell_size@entry=1, chunk_size=, oh=0x7ffe3f702c18, ev=0x0) at src/client/api/array.c:95
#19 0x00007fa691d1972d in entry_stat (dfs=dfs@entry=0x55b882033df0, th=th@entry=..., oh=..., name=name@entry=0x7ffe3f702d70 "file.4", len=, obj=obj@entry=0x0, get_size=, stbuf=, obj_hlc=)
at src/client/dfs/dfs.c:1069

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks to be at https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L675 which is in error handling code, do you know what allocation failed to get to that point? Perhaps https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L618 should just return instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation target of fault injection is,
https://github.com/daos-stack/daos/blob/master/src/object/cli_obj.c#L143

#1 0x00007f85ce6d8234 in d_should_fail (fault_attr=0x55585b488230) at src/gurt/fault_inject.c:792
#2 0x00007f85d080ff11 in obj_alloc () at src/object/cli_obj.c:143
#3 0x00007f85d081bb52 in dc_obj_open (task=0x55585ba71bd0) at src/object/cli_obj.c:1420
#4 0x00007f85d04eb3f4 in tse_task_schedule_with_delay (task=0x55585ba71bd0, instant=true, delay=0)
at src/common/tse.c:1056
#5 0x00007f85d04eb445 in tse_task_schedule (task=0x55585ba71bd0, instant=true) at src/common/tse.c:1070
#6 0x00007f85d08ecd5e in dc_array_open (task=0x55585ba5ef90) at src/client/array/dc_array.c:768
#7 0x00007f85d04eb3f4 in tse_task_schedule_with_delay (task=0x55585ba5ef90, instant=true, delay=0)
at src/common/tse.c:1056
#8 0x00007f85d04eb445 in tse_task_schedule (task=0x55585ba5ef90, instant=true) at src/common/tse.c:1070
#9 0x00007f85d07855c3 in dc_task_schedule (task=0x55585ba5ef90, instant=true) at src/client/api/task.c:114
#10 0x00007f85d0764ed1 in daos_array_open_with_attr (coh=..., oid=..., th=..., mode=2, cell_size=1, chunk_size=1048576,
oh=0x7ffdf170ba60, ev=0x0) at src/client/api/array.c:95
#11 0x00007f85d0242de2 in entry_stat (dfs=0x55585ba52000, th=..., oh=..., name=0x7ffdf170bbe0 "file.4", len=6, obj=0x0,
get_size=true, stbuf=0x7ffdf170bd40, obj_hlc=0x0) at src/client/dfs/dfs.c:1066
#12 0x00007f85d025f5b1 in dfs_stat (dfs=0x55585ba52000, parent=0x55585ba71a80, name=0x7ffdf170bbe0 "file.4",
stbuf=0x7ffdf170bd40) at src/client/dfs/dfs.c:4918
#13 0x00007f85d1580dfa in new_lxstat (ver=1, path=0x55585b4996e0 "/tmp/pil4dfs_mountsx0ils0g/new_dir/file.4",
stat_buf=0x7ffdf170bd40) at src/client/dfuse/pil4dfs/int_dfs.c:2242
#14 0x00007f85d1581485 in statx (dirfd=-100, path=0x7ffdf170bf10 "new_dir/file.4", flags=256, mask=606,
statx_buf=0x7ffdf170bdf0) at src/client/dfuse/pil4dfs/int_dfs.c:2357
#15 0x000055585a03183b in do_statx ()
#16 0x000055585a0356f5 in gobble_file.constprop ()
#17 0x000055585a0368af in print_dir ()
#18 0x000055585a03019d in main ()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/daos-stack/daos/blob/master/src/client/array/dc_array.c#L661

open_handle_cb() is triggered to call dc_obj_close(). In case of uninitialized "oh", the content is non-zero. "daos_handle_is_valid(*args->oh)" returns true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a non-zero rc means the handle is invalid, then the error type is not relevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather, there's been an error so the handles not valid so it doesn't need closing. What the error is is not a factor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I do not know how long time this PR can be landed, I created a separate ticket DAOS-14030 and PR #12691 to address this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add it to this PR to see if it passes the test? Happy to land it separately but I'm like to see it tested here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I will add it to this PR soon. It does help to avoid the issue encountered in close in my test on boro.

@mchaarawi mchaarawi removed the request for review from a team July 28, 2023 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants