Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/WIREUP: Support EP reconfiguration for non-reused p2p scenarios #9949

Merged
merged 1 commit into from
Sep 12, 2024

Conversation

shasson5
Copy link
Contributor

@shasson5 shasson5 commented Jun 11, 2024

What

Support partial EP reconfiguration: only non-reused, fully wired-up scenarios. (Lanes must not be reused and must all be p2p).

Why ?

PR 2 for EP reconfiguration.

How ?

Add a new function for replacing wireup lane by moving internal AUX EP to the new lane.
The following steps are performed:

  1. Set uct_ep to be used for new lane:
    a) for CM flow -> use CM EP.
    b) otherwise -> create a new wireup_ep.
  2. Move aux ep from old config to new wireup lane.
  3. Delete old wireup lane.
  4. Select lane to be used as new wireup lane:
    a) for CM flow -> use CM lane.
    b) otherwise -> select a random lane. (Will be modified in next PRs to look for a non-reused lane).

@shasson5 shasson5 added the WIP-DNM Work in progress / Do not review label Jun 11, 2024
@shasson5 shasson5 changed the title UCP/WIREUP: Support EP reconfiguration for non-reused wired-up scenarios. UCP/WIREUP: Support EP reconfiguration for non-reused wired-up scenarios Jun 11, 2024
@shasson5 shasson5 changed the title UCP/WIREUP: Support EP reconfiguration for non-reused wired-up scenarios UCP/WIREUP: Support EP reconfiguration for non-reused p2p scenarios Jun 11, 2024
@shasson5 shasson5 changed the title UCP/WIREUP: Support EP reconfiguration for non-reused p2p scenarios UCP/WIREUP: Support EP reconfiguration for non-reused wireup scenarios Jun 16, 2024
@shasson5 shasson5 changed the title UCP/WIREUP: Support EP reconfiguration for non-reused wireup scenarios UCP/WIREUP: Support EP reconfiguration for non-reused p2p scenarios Jun 16, 2024
@shasson5 shasson5 force-pushed the no_reuse branch 4 times, most recently from 986a5c8 to 3ccadc6 Compare July 10, 2024 11:03
@shasson5 shasson5 removed the WIP-DNM Work in progress / Do not review label Jul 10, 2024
@shasson5 shasson5 requested review from yosefe and gleon99 July 10, 2024 11:47
@shasson5
Copy link
Contributor Author

@yosefe @gleon99 please review

@gleon99
Copy link
Contributor

gleon99 commented Jul 28, 2024

@shasson5 please see clang-format output.

@shasson5 shasson5 closed this Jul 29, 2024
@shasson5 shasson5 reopened this Jul 29, 2024
test/gtest/ucp/test_ucp_reconfigure.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_reconfigure.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_reconfigure.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_reconfigure.cc Outdated Show resolved Hide resolved
test/gtest/ucp/ucp_test.h Show resolved Hide resolved
test/gtest/ucp/test_ucp_reconfigure.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_reconfigure.cc Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup_ep.c Show resolved Hide resolved
@shasson5 shasson5 requested a review from gleon99 August 1, 2024 14:25
@shasson5
Copy link
Contributor Author

shasson5 commented Aug 1, 2024

/azp run

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@shasson5 shasson5 requested review from gleon99 and removed request for gleon99 August 4, 2024 17:02
ucs_assert(reuse_lane_map[ucp_ep_get_cm_lane(ep)] != UCP_NULL_LANE);
/* wireup lane hasn't been selected by the new configuration: only this
* function should select it */
ucs_assert(new_key->wireup_msg_lane == UCP_NULL_LANE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ucs_assert(new_key->wireup_msg_lane == UCP_NULL_LANE);
ucs_assertv(new_key->wireup_msg_lane == UCP_NULL_LANE, "new_key->wireup_msg_lane=%u",
new_key->wireup_msg_lane);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's code from master.
Was marked as "new added" code for some wrong reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is indeed from master. Marked as diff because of indentation (if block).
IMO the suggestions makes sense.


/* Select CM/non-reused lane as new wireup lane */
new_lane = ucp_wireup_find_non_reused_lane(ep, key);
new_uct_eps[new_lane] = &new_ep->super.super;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be later overwritten in ucp_wireup_check_config_intersect? Is it possible to initialize new_uct_eps in one place? (would be less error-prone)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may complicate the flow, because you need to return [new_lane,new_ep] from this function and then add extra "if" block in ucp_wireup_check_config_intersect in order to properly initialize new_uct_eps.

And anyway the value of new_uct_eps will not be overwritten, because in ucp_wireup_check_config_intersect we only handle "reused" lanes.

@shasson5 shasson5 requested a review from brminich August 6, 2024 13:35
@shasson5
Copy link
Contributor Author

shasson5 commented Sep 1, 2024

/azp run

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

src/ucp/wireup/wireup.c Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_reconfigure.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_reconfigure.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_reconfigure.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_reconfigure.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_reconfigure.cc Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_ep_reconfigure.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_ep_reconfigure.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_ep_reconfigure.cc Outdated Show resolved Hide resolved
@gleon99
Copy link
Contributor

gleon99 commented Sep 8, 2024

@yosefe please look, as discussed

@@ -151,6 +151,7 @@ gtest_SOURCES = \
ucp/test_ucp_mem_type.cc \
ucp/test_ucp_perf.cc \
ucp/test_ucp_proto.cc \
ucp/test_ucp_ep_reconfigure.cc \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to test_ucp_ep_reconfig.cc?

src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
@@ -1501,7 +1677,8 @@ ucs_status_t ucp_wireup_init_lanes(ucp_ep_h ep, unsigned ep_init_flags,
* without CM to prevent reconfiguration error.
*/
if ((ep->cfg_index != UCP_WORKER_CFG_INDEX_NULL) &&
!ucp_ep_has_cm_lane(ep)) {
!ucp_wireup_can_reconfigure(ep, ep_init_flags, &tl_bitmap,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better avoid calling init_lanes another time just for checking if we need to limit tl_bitmap,
can we do this hack (restrict tl_bitmap) only if we are going to print a fatal error that reconfig not supported?

src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
msg_ep = ucp_wireup_ep_extract_msg_ep(old_ep, &pending_queue);
ucp_wireup_ep_set_aux(new_ep, msg_ep, aux_rsc_index, is_p2p);

/* Add wireup messages to pending queue */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_wireup_ep_extract_msg_ep is currently NOT returning wireup message requests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


/* Get old wireup lane */
old_lane = ucp_wireup_get_msg_lane(ep, UCP_WIREUP_MSG_REQUEST);
old_ep = ucp_wireup_ep(ucp_ep_get_lane(ep, old_lane));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old_wireup_ep?

ucp_ep_set_lane(ep, old_lane, NULL);

/* Select CM/non-reused lane as new wireup lane */
new_lane = ucp_wireup_find_non_reused_lane(ep, key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_wireup_lane?

if (uct_ep != NULL) {
uct_ep_destroy(uct_ep);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove spaceline

uct_ep_h msg_ep = ucp_wireup_ep_get_msg_ep(wireup_ep);

uct_ep_pending_purge(&wireup_ep->super.super,
ucp_wireup_ep_append_pending_cb, pending_queue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems ucp_wireup_ep_pending_purge (called from here) already calls ucp_wireup_ep_pending_req_release which decrements pending_count by 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced

@shasson5 shasson5 requested a review from yosefe September 9, 2024 13:02
/* Get all reachable MDs from full remote address list and join with
* current ep configuration
*/
key->dst_md_cmpts = dst_md_storage;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: move it after line 1609

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean move it to be before 1609?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either before or after key->dst_version = remote_address->dst_version;

src/ucp/wireup/wireup.c Show resolved Hide resolved
Comment on lines 101 to 102
ucp_wireup_ep_pending_purge(&wireup_ep->super.super,
ucs_empty_function_do_assert_void, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we can just assert that wireup_ep->pending_q length is 0. and call uct_ep_pending_purge. then no need to move the functions ucp_wireup_ep_pending_purge, ucp_wireup_ep_pending_req_release

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but notice it requires duplication of ucs_assert(wireup_ep->pending_count == 0); as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to check pending_count==0? ucp_wireup_ep_pending_purge would release them anyway so it doesn't have to be empty IMO

src/ucp/wireup/wireup_ep.c Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_ep_reconfig.cc Show resolved Hide resolved
test/gtest/ucp/test_ucp_ep_reconfig.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_ep_reconfig.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_ep_reconfig.cc Show resolved Hide resolved
Comment on lines 202 to 203
/* Verify local and remote lanes are identical */
EXPECT_TRUE(has_matching_lane(ep(), lane, other));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this? isn't it enough to test that communication send/recv is working?
same for checking num_lanes

Copy link
Contributor Author

@shasson5 shasson5 Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is it possible to check communication on all lanes without using multiple protocols/message sizes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we can do basic test like active message or tag match on several message sizes and extend it
since communication ability is something we need to test anyway to ensure configuration is working

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better as a complement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO in the general case - when not all transports are p2p - we do not need to require that number of lanes is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so removed the num_lanes check

@shasson5 shasson5 requested a review from yosefe September 11, 2024 12:52
src/ucp/wireup/wireup_ep.c Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_ep_reconfig.cc Show resolved Hide resolved
test/gtest/ucp/test_ucp_ep_reconfig.cc Outdated Show resolved Hide resolved
Comment on lines 202 to 203
/* Verify local and remote lanes are identical */
EXPECT_TRUE(has_matching_lane(ep(), lane, other));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO in the general case - when not all transports are p2p - we do not need to require that number of lanes is the same.

@shasson5 shasson5 requested a review from yosefe September 11, 2024 14:52
@shasson5
Copy link
Contributor Author

@yosefe please approve

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls squash

@yosefe yosefe enabled auto-merge September 11, 2024 16:19
@yosefe yosefe merged commit 5b7fb19 into openucx:master Sep 12, 2024
143 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants