-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14261 engine: Add dss_chore for I/O forwarding #13372
Conversation
Bug-tracker data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/1/execution/node/381/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/1/execution/node/368/log |
Test stage Build RPM on Leap 15.4 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/1/execution/node/356/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/1/execution/node/450/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/1/execution/node/315/log |
Test stage Build on Leap 15.4 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/1/execution/node/543/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage NLT on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/3/execution/node/839/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/4/execution/node/1377/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/4/execution/node/1331/log |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/4/execution/node/1519/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/4/execution/node/1473/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/5/execution/node/1331/log |
a4658c2
to
3325b2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/9/execution/node/352/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/9/execution/node/448/log |
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/9/execution/node/564/log |
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13372/14/testReport/ |
@@ -773,6 +787,8 @@ dss_start_one_xstream(hwloc_cpuset_t cpus, int tag, int xs_id) | |||
} else { | |||
dx->dx_main_xs = (xs_id >= dss_sys_xs_nr) && (xs_offset == 0); | |||
} | |||
/* See the DSS_XS_IOFW case in sched_ult2xs. */ | |||
dx->dx_iofw = xs_id >= dss_sys_xs_nr && (!dx->dx_main_xs || dss_tgt_offload_xs_nr == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Seems you don't regard vos xstream as iofw xstream, so it could be simplified as "dx->dx_iofw = xs_id >= dss_sys_xs_nr && !dx->dx_main_xs".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NiuYawei, I thought about the same when updating the patch last time. Ended up keeping it this way because if we make such a change to dx_iofw
, then a dx_iofw
xstream != a DSS_XS_IOFW xstream, potentially making our terminology even more complex (i.e., we currently talk about "I/O forward xstream", "offload xstream", "helper xstream", "DSS_XS_OFFLOAD xstream", etc.---already too confusing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nasf-Fan, thanks for the replies.
D_ASSERT(chore->cho_status != DSS_CHORE_NEW); | ||
if (chore->cho_status == DSS_CHORE_DONE) | ||
d_list_del_init(&chore->cho_link); | ||
ABT_thread_yield(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, list
is not cleared before the splice
. Have I overlooked anything?
} | ||
rc = dss_chore_delegate(&dca->dca_chore, dtx_rpc_helper); | ||
} else { | ||
dss_chore_diy(&dca->dca_chore, dtx_rpc_helper); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to merge diy
with delegate
, if that's acceptable to you.
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13372/15/testReport/ |
dtx_common.c and dtx_rpc.c have been much changed recently, need some merge work for this patch. |
Required-githooks: true
Signed-off-by: Li Wei <[email protected]> Required-githooks: true
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/16/execution/node/1387/log |
Signed-off-by: Li Wei <[email protected]> Required-githooks: true
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13372/16/execution/node/1574/log |
Signed-off-by: Li Wei <[email protected]> Required-githooks: true
Required-githooks: true
else | ||
rc = dtx_rpc_internal(dca); | ||
if (dss_has_enough_helper()) { | ||
rc = ABT_eventual_create(0, &dca->dca_chore_eventual); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why non-coll case need the eventual, but coll case does not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to not create the dca_chore_eventual, instead, depends on the dra->dra_future to check whether related operation has been done or not, then non-coll case and coll case can share the same mechanism or logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why non-coll case need the eventual, but coll case does not?
@Nasf-Fan, the dca_chore_eventual
is for dtx_rpc_post
to know when it's safe to access dra_future
.
Is it possible to not create the dca_chore_eventual, instead, depends on the dra->dra_future to check whether related operation has been done or not, then non-coll case and coll case can share the same mechanism or logic.
AFAICS, it's possible, though not easy. The main trouble is that the number of slots in the future is computed by the chore, so we can't create the future until the chore has run to the point where that number is computed. Also, even if we really want to make such a change in this PR, it will take more changes to share the mechanism with dtx_coll_rpc_helper
, since the two code paths use different data structures.
So, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why non-coll case need the eventual, but coll case does not?
@Nasf-Fan, the
dca_chore_eventual
is fordtx_rpc_post
to know when it's safe to accessdra_future
.Is it possible to not create the dca_chore_eventual, instead, depends on the dra->dra_future to check whether related operation has been done or not, then non-coll case and coll case can share the same mechanism or logic.
AFAICS, it's possible, though not easy. The main trouble is that the number of slots in the future is computed by the chore, so we can't create the future until the chore has run to the point where that number is computed. Also, even if we really want to make such a change in this PR, it will take more changes to share the mechanism with
dtx_coll_rpc_helper
, since the two code paths use different data structures.So, what do you think?
Then ignore my former comment and go ahead.
Features: tx Required-githooks: true
else | ||
rc = dtx_rpc_internal(dca); | ||
if (dss_has_enough_helper()) { | ||
rc = ABT_eventual_create(0, &dca->dca_chore_eventual); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why non-coll case need the eventual, but coll case does not?
@Nasf-Fan, the
dca_chore_eventual
is fordtx_rpc_post
to know when it's safe to accessdra_future
.Is it possible to not create the dca_chore_eventual, instead, depends on the dra->dra_future to check whether related operation has been done or not, then non-coll case and coll case can share the same mechanism or logic.
AFAICS, it's possible, though not easy. The main trouble is that the number of slots in the future is computed by the chore, so we can't create the future until the chore has run to the point where that number is computed. Also, even if we really want to make such a change in this PR, it will take more changes to share the mechanism with
dtx_coll_rpc_helper
, since the two code paths use different data structures.So, what do you think?
Then ignore my former comment and go ahead.
Bug-tracker data: |
Features: tx Required-githooks: true
Ticket title is 'limit ULT creation on helper xstream' |
As requested by the Jira ticket, add a new I/O forwarding mechanism, dss_chore, to avoid creating a ULT for every forwarding task. - Forwarding of object I/O and DTX RPCs is converted to chores. - Cancelation is not implemented, because the I/O forwarding tasks themselves do not support cancelation yet. - In certain engine configurations, some xstreams do not need to initialize dx_chore_queue. This is left to future work. Required-githooks: true Skipped-githooks: clang Change-Id: I8d6f9889f5562a8bc3683d26cb830672a8aa40f3 Signed-off-by: Li Wei <[email protected]>
As requested by the Jira ticket, add a new I/O forwarding mechanism, dss_chore, to avoid creating a ULT for every forwarding task. - Forwarding of object I/O and DTX RPCs is converted to chores. - Cancelation is not implemented, because the I/O forwarding tasks themselves do not support cancelation yet. - In certain engine configurations, some xstreams do not need to initialize dx_chore_queue. This is left to future work. Signed-off-by: Li Wei <[email protected]>
As requested by the Jira ticket, add a new I/O forwarding mechanism,
dss_chore, to avoid creating a ULT for every forwarding task.
Forwarding of object I/O and DTX RPCs is converted to chores.
Cancelation is not implemented, because the I/O forwarding tasks
themselves do not support cancelation yet.
In certain engine configurations, some xstreams do not need to
initialize dx_chore_queue. This is left to future work.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: