Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-10877 vos: gang allocation for huge SV #14790

Merged
merged 3 commits into from
Sep 9, 2024
Merged

DAOS-10877 vos: gang allocation for huge SV #14790

merged 3 commits into from
Sep 9, 2024

Conversation

NiuYawei
Copy link
Contributor

@NiuYawei NiuYawei commented Jul 19, 2024

To avoid allocation failure on a fragmented system, huge SV allocation will
be split into multiple smaller allocations, each allocation size is capped
to 8MB (the DMA chunk size, that could avoid huge DMA buffer allocation).

The address of such scattered SV payload is represented by 'gang address'.

Removed io_allocbuf_failure() vos unit test, it's not applicable in gang
SV mode now.

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented Jul 19, 2024

Ticket title is 'Large KV support'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-10877

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14790/1/execution/node/571/log

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14790/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14790/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14790/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14790/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14790/3/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14790/3/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14790/3/execution/node/1221/log

To avoid allocation failure on a fragmented system, huge SV allocation will
be split into multiple smaller allocations, each allocation size is capped
to 8MB (the DMA chunk size, that could avoid huge DMA buffer allocation).

The address of such scattered SV payload is represented by 'gang address'.

Removed io_allocbuf_failure() vos unit test, it's not applicable in gang
SV mode now.

Required-githooks: true

Signed-off-by: Niu Yawei <[email protected]>
@NiuYawei NiuYawei marked this pull request as ready for review July 23, 2024 06:39
@NiuYawei NiuYawei requested review from a team as code owners July 23, 2024 06:39
jolivier23
jolivier23 previously approved these changes Aug 14, 2024
Copy link
Contributor

@jolivier23 jolivier23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still wonder if we'd be better off with evtree for SV

@NiuYawei
Copy link
Contributor Author

I still wonder if we'd be better off with evtree for SV

It looks to me that evtree might be too heavy for such scattered SV. Gang address uses 13 bytes (64 bits offset, 32 bits length, 8 bits type) to represent one segment of SV, however, evtree requires at least 64 bytes (evt_node_entry + evt_desc) for each segment. Using evtree will also introduce unnecessary complexities for SV updating & aggregation. (current evtree & aggregation algorithm can't handle the SV overwriting with a smaller value size, no matter if the two update epochs are same or not).

The gang address isn't introduced for scattered SV only, it's a generic feature and could be used for other purposes in the future, for example, to reduce internal fragmentation caused by unaligned I/O size, we could make an extent consists of multiple segments on different media. (Let's assume we support QLC and the block size of QLC is 64k, then a 65k update could be split into 64k on QLC and 1k on TLC).

@NiuYawei NiuYawei requested a review from Nasf-Fan August 19, 2024 06:40
@NiuYawei
Copy link
Contributor Author

merge master & fix conflict.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14790/6/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14790/6/execution/node/1551/log

if (gang_nr > UINT8_MAX) {
D_ERROR("Too large SV:"DF_U64", gang_nr:%u\n",
iod->iod_size, gang_nr);
return -DER_REC2BIG;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we cannot support SV if its size is larger than 8MB * 2^8 = 2GB ? That seems not reasonable. Can we force to disable gang_addr (and try to allocate continuous large space) for the SV under such case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this number (2GB) is far more larger than the restrictions of current implementation. Without this gang SV, we can't even support 200MB sized SV.

I think we shouldn't try to allocate 2GB and hope it won't fail, furthermore, there is only by default 1GB DMA buffer per-xstream, it can't support DMA transfer for a 2GB SV.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how the real use case will be, such as for HDF5, hopes 2GB is large enough. Maybe we can add something in our document to describe related restriction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that we'd document all the DAOS API restrictions well, but I think that's not a trivial work and should be done in a separate PR.

D_ASSERT(size > 0);
alloc_sz = min(size, VOS_GANG_SIZE_THRESH);

rc = reserve_space(ioc, media, alloc_sz, &off);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Means that all the sub_addrs must have the same media type? Cannot be mixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gang address itself supports mixed media type, but here VOS SV uses same media type for this moment. If we need to support mixed media SV in the future, we could simply change this function then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not support mixed media types, then we do not need store the type in gang address to reduce metadata a bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on this PR, the gang address is a common feature which could be used for other purpose in the future.

@Nasf-Fan Nasf-Fan self-requested a review September 1, 2024 06:27
@NiuYawei
Copy link
Contributor Author

NiuYawei commented Sep 4, 2024

ping reviewers.


iod_set_sv_addr(ioc, umoff, &gaddr);

for (i = 0; i < gang_nr; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to consider to yield if gang_nr is large? same for the loop in iod_gang_fetch().
MAX gang_nr is 255 now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it simple for this moment, if it turns out 255 reserve consumes too much CPU, we could add yield then.

@NiuYawei
Copy link
Contributor Author

NiuYawei commented Sep 5, 2024

@jolivier23 @Nasf-Fan could you take look? nothing was changed since your last review.

@NiuYawei NiuYawei merged commit b95ef01 into master Sep 9, 2024
54 of 55 checks passed
@NiuYawei NiuYawei deleted the niu/DAOS-10877 branch September 9, 2024 03:07
mchaarawi added a commit that referenced this pull request Sep 11, 2024
* DAOS-16484 test: Exclude local host in default interface selection (#15049)

When including the local host in the default interface selection a
difference in ib0 speeds will cause the logic to select eth0 and then
the tcp provider.

Signed-off-by: Phil Henderson <[email protected]>

* DAOS-15800 client: create cart context on specific interface (#14804)

Cart has added the ability to select network interface on context creation. The daos_agent also added a numa-fabric map that can be queried at init time. Update the DAOS client to query from the agent a map of numa to network interface on daos_init(), and on EQ creation, select the best interface for the network context based on the numa of the calling thread.

Signed-off-by: Mohamad Chaarawi <[email protected]>

* DAOS-16445 client: Add function to cycle OIDs non-sequentially (#14999)

We've noticed that with sequential order, object placement is poor.

We get 40% fill for 8GiB files with 25 ranks and 16 targets per rank
with EC_2P1G8. With this patch, we get a much better distribution.

This patch adds the following:

1. A function for cycling oid.hi incrementing by a large prime
2. For DFS, randomize the starting value
3. Modify DFS to cycle OIDs using the new function.

Signed-off-by: Jeff Olivier <[email protected]>

* DAOS-16251 dtx: Fix dtx_req_send user-after-free (#15035)

In dtx_req_send, since the crt_req_send releases the req reference, din
may have been freed when dereferenced for the DL_CDEBUG call.

Signed-off-by: Li Wei <[email protected]>

* DAOS-16304 tools: Add daos health net-test command (#14980)

Wrap self_test to provide a simplified network test
to detect obvious client/server connectivity and
performance problems.

Signed-off-by: Michael MacDonald <[email protected]>

* DAOS-16272 dfs: fix get_info returning incorrect oclass (#15048)

If user creates a container without --file-oclass, the get_info call was
returning the default oclass of a directory on daos fs get-attr. Fix
that to properly use the enum types for default scenario.

Signed-off-by: Mohamad Chaarawi <[email protected]>

* DAOS-15863 container: fix a race for container cache (#15038)

* DAOS-15863 container: fix a race for container cache

while destroying a container, cont_child_destroy_one() releases
its own refcount before waiting, if another ULT releases its
refcount, which is the last one, wakes up the waiting ULT and frees
it ds_cont_child straightaway, because no one else has refcount.

When the waiting ULT is waken up, it will try to change the already
freed ds_cont_child.

This patch changes the LRU eviction logic and fixes this race.


Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Jeff Olivier <[email protected]>
Co-authored-by: Jeff Olivier <[email protected]>

* DAOS-16471 test: Reduce targets for ioctl_pool_handles.py (#15063)

The dfuse/ioctl_pool_handles.py test is overloading the VM so reduce the number of engine targets.

Signed-off-by: Phil Henderson <[email protected]>

* DAOS-16483 vos: handle empty DTX when vos_tx_end (#15053)

It is possible that the DTX modified nothing when stop currnet backend
transaction. Under such case, we may not generate persistent DTX entry.
Then need to bypass such case before checking on-disk DTX entry status.

The patch makes some clean and removed redundant metrics for committed
DTX entries.

Enhance vos_dtx_deregister_record() to handle GC case.

Signed-off-by: Fan Yong <[email protected]>

* DAOS-16271 mercury: Add patch to avoid seg fault in key resolve. (#15067)

Signed-off-by: Joseph Moore <[email protected]>

* DAOS-16484 test: Support mixed speeds when selecting a default interface (#15050)

Allow selecting a default interface that is running at a different speed
on different hosts.  Primarily this is to support selecting the ib0
interface by default when the launch node has a slower ib0 interface
than the cluster hosts.

Signed-off-by: Phil Henderson <[email protected]>

* DAOS-16446 test: HDF5-VOL test - Set object class and container prope… (#15004)

In HDF5, DFS, MPIIO, or POSIX, object class and container properties are defined
during the container create. If it’s DFS, object class is also set to the IOR
parameter. However, in HDF5-VOL, object class and container properties are
defined with the following environment variables of mpirun.

HDF5_DAOS_OBJ_CLASS (Object class)
HDF5_DAOS_FILE_PROP (Container properties)

The infrastructure to set these variables are already there in run_ior_with_pool().
In file_count_test_base.py, pass in the env vars to run_ior_with_pool(env=env) as a
dictionary. Object class is the oclass variable. Container properties can be
obtained from container -> properties field in the test yaml.

This fix is discussed in PR #14964.

Signed-off-by: Makito Kano <[email protected]>

* DAOS-16447 test: set D_IL_REPORT per test (#15012)

set D_IL_REPORT per test instead of setting defaults values in
utilities. This allows running without it set.

Signed-off-by: Dalton Bohning <[email protected]>

* DAOS-16450 test: auto run dfs tests when dfs is modified (#15017)

Automatically include dfs tests when dfs files are modified in PRs.

Signed-off-by: Dalton Bohning <[email protected]>

* DAOS-16510 cq: update pylint to 3.2.7 (#15072)

update pylint to 3.2.7

Signed-off-by: Dalton Bohning <[email protected]>

* DAOS-16509 test: replace IorTestBase.execute_cmd with run_remote (#15070)

replace usage of IorTestBase.execute_cmd with run_remote

Signed-off-by: Dalton Bohning <[email protected]>

* DAOS-16458 object: fix invalid DRAM access in obj_bulk_transfer (#15026)

For EC object update via CPD RPC, when calculate the bitmap to skip
some iods for current EC data shard, we may input NULL for "*skips"
parameter. It may cause the old logic in obj_get_iods_offs_by_oid()
to generate some undefined DRAM for "skips" bitmap. Such bitmap may
be over-written by others, as to subsequent obj_bulk_transfer() may
be misguided.

The patch also fixes a bug inside obj_bulk_transfer() that cast any
input RPC as UPDATE/FETCH by force.

Signed-off-by: Fan Yong <[email protected]>

* DAOS-16486 object: return proper error on stale pool map (#15064)

Client with stale pool map may try to send RPC to a DOWN target, if the
target was brought DOWN due to faulty NVMe device, the ds_pool_child could
have been stopped on the NVMe faulty reaction, We'd ensure proper error
code is returned for such case.

Signed-off-by: Niu Yawei <[email protected]>

* DAOS-16514 vos: fix coverity issue (#15083)

Fix coverity 2555843 explict null dereferenced.

Signed-off-by: Niu Yawei <[email protected]>

* DAOS-16467 rebuild: add DAOS_POOL_RF ENV for massive failure case (#15037)

* DAOS-16467 rebuild: add DAOS_PW_RF ENV for massive failure case

Allow user to set DAOS_PW_RF as pw_rf (pool wise RF).
If SWIM detected engine failure is going to break pw_rf, don't change
pool map, also don't trigger rebuild.
With critical log message to ask administrator to bring back those
engines in top priority (just "system start --ranks=xxx", need not to
reintegrate those engines).

a few functions renamed to avoid confuse -
pool_map_find_nodes() -> pool_map_find_ranks()
pool_map_find_node_by_rank() -> pool_map_find_dom_by_rank()
pool_map_node_nr() -> pool_map_rank_nr()

Signed-off-by: Xuezhao Liu <[email protected]>

* DAOS-16508 csum: retry a few times on checksum mismatch on update (#15069)

Unlike fetch, we return DER_CSUM on update (turned into EIO by dfs) without any retry.
We should retry a few times in case it is a transient error.

The patch also prints more information about the actual checksum mismatch.

Signed-off-by: Johann Lombardi <[email protected]>

* DAOS-10877 vos: gang allocation for huge SV (#14790)

To avoid allocation failure on a fragmented system, huge SV allocation will
be split into multiple smaller allocations, each allocation size is capped
to 8MB (the DMA chunk size, that could avoid huge DMA buffer allocation).

The address of such scattered SV payload is represented by 'gang address'.

Removed io_allocbuf_failure() vos unit test, it's not applicable in gang
SV mode now.

Signed-off-by: Niu Yawei <[email protected]>

* DAOS-16304 tools: Adjust default RPC size for net-test (#15091)

The previous default of 1MiB isn't helpful at large scales.
Use a default of 1KiB to get faster results and a better
balance between raw latency and bandwidth.

Also include calculated rpc throughput and bandwidth in
JSON output.

Signed-off-by: Michael MacDonald <[email protected]>

* SRE-2408 ci: Increase timeout (to 15 minutes) for system restore (#14926)

Signed-off-by: Tomasz Gromadzki <[email protected]>

* DAOS-16251 object: Fix obj_ec_singv_split overflow (#15045)

It has been seen that obj_ec_singv_split may read beyond the end of
sgl->sg_iovs[0].iov_buf:

    iod_size=8569
    c_bytes=4288
    id_shard=0
    tgt_off=1
    iov_len=8569
    iov_buf_len=8569

The memmove read 4288 bytes from offset 4288, whereas the buffer only
had 8569 - 4288 = 4281 bytes from offset 4288. This patch fixes the
problem by adding the min(...) expression.

Signed-off-by: Li Wei <[email protected]>

---------

Signed-off-by: Phil Henderson <[email protected]>
Signed-off-by: Mohamad Chaarawi <[email protected]>
Signed-off-by: Jeff Olivier <[email protected]>
Signed-off-by: Li Wei <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Fan Yong <[email protected]>
Signed-off-by: Joseph Moore <[email protected]>
Signed-off-by: Makito Kano <[email protected]>
Signed-off-by: Dalton Bohning <[email protected]>
Signed-off-by: Niu Yawei <[email protected]>
Signed-off-by: Xuezhao Liu <[email protected]>
Signed-off-by: Johann Lombardi <[email protected]>
Signed-off-by: Tomasz Gromadzki <[email protected]>
Co-authored-by: Phil Henderson <[email protected]>
Co-authored-by: Jeff Olivier <[email protected]>
Co-authored-by: Li Wei <[email protected]>
Co-authored-by: Michael MacDonald <[email protected]>
Co-authored-by: Liang Zhen <[email protected]>
Co-authored-by: Nasf-Fan <[email protected]>
Co-authored-by: Joseph Moore <[email protected]>
Co-authored-by: Makito Kano <[email protected]>
Co-authored-by: Dalton Bohning <[email protected]>
Co-authored-by: Niu Yawei <[email protected]>
Co-authored-by: Liu Xuezhao <[email protected]>
Co-authored-by: Johann Lombardi <[email protected]>
Co-authored-by: Tomasz Gromadzki <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants