Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rmw_connextdds] New RMW discovery options #108

Merged
merged 27 commits into from
Apr 8, 2023

Conversation

mxgrey
Copy link
Contributor

@mxgrey mxgrey commented Mar 10, 2023

This PR updates rmw_connextdds to be compatible with the upcoming discovery options feature for RMW. (Note that the final agreed upon requirements matrix is the one in this comment).

This PR depends on these upstream PRs:

With this PR, the requirements matrix for the new discovery options is almost fully satisfied. There is just one aspect that is not working correctly, which I would appreciate some help with from Connext DDS experts: It seems that Connext DDS participants are not accepting unknown initial peers even though accept_unknown_peers is being set to true. As a result, the Same Host matrix is 100% passing while the Other Host matrix looks like this:

  • 🆗: Required behavior is passing
  • 🔴: Required behavior is failing
Different Hosts     Node B setting          
      No static peer     With static peer    
      Off Localhost Subnet Off Localhost Subnet
Node A setting No static peer Off 🆗  🆗  🆗 🆗 🆗 🆗
    Localhost 🆗 🆗 🆗 🆗 🔴 🔴
    Subnet 🆗 🆗 🆗 🆗 🔴 🆗
  With static peer Off 🆗 🆗 🆗 🆗 🆗 🆗
    Localhost 🆗 🔴 🔴 🆗 🆗 🆗 
    Subnet 🆗 🔴 🆗  🆗 🆗 🆗 

The gist of the current problem is:

  • when two endpoints are on different hosts
  • and one or both endpoints are in LOCALHOST mode
  • and neither endpoint is in OFF mode
  • and only one endpoint declares the other as a static peer

-> then the required behavior is for them to discover each other and connect, but they currently do not. Every other scenario is working correctly.

From my reading of the Connext DDS documentation, I believe the current implementation is supposed to work since accept_unknown_peers is set to true, but perhaps there is a detail that I'm overlooking.

One observation from watching packets with wireshark is that when there's a LOCALHOST + static peer <=> LOCALHOST + no static peer connection, the endpoint that has a static peer is sending DATA(p) participant information packets while the endpoint without a static peer is never sending the DATA(p) packets (..even though it does send various ACKNACK and HEARTBEAT packets). I'm happy to provide the wireshark data if that would be helpful.

mxgrey added 4 commits March 9, 2023 00:34
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
rmw_connextdds_common/src/common/rmw_context.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_context.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_context.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_impl.cpp Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_impl.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/ndds/dds_api_ndds.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/ndds/dds_api_ndds.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@asorbini asorbini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution, and apologies for the late reply. Things look good, except for a few minor changes (mostly "coding style" and missing error checking).

I'll review and suggest a solution to the problems with the failing cases of the matrix chart next.

rmw_connextdds_common/src/common/rmw_context.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_context.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_context.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_context.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_impl.cpp Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_impl.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/common/rmw_impl.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/ndds/dds_api_ndds.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/ndds/dds_api_ndds.cpp Outdated Show resolved Hide resolved
@asorbini
Copy link
Collaborator

The gist of the current problem is:

* when two endpoints are on different hosts

* and one or both endpoints are in `LOCALHOST` mode

* and neither endpoint is in `OFF` mode

* and only one endpoint declares the other as a static peer

-> then the required behavior is for them to discover each other and connect, but they currently do not. Every other scenario is working correctly.

From my reading of the Connext DDS documentation, I believe the current implementation is supposed to work since accept_unknown_peers is set to true, but perhaps there is a detail that I'm overlooking.

I've been reviewing the requirements, your implementation, and thinking about the problem. At first I thought I had spotted an issue, and started thinking about alternative implementations, but after going down that rabbit hole for a little bit, I realized I was wrong.

Your solution is the correct way to prevent the default multicast discovery from occurring while still allowing unicast communication to occur, and I have verified it with a simple hello world for Connext 6.0.1.

discovery_options_tests.tar.gz (I have only modified ShapeType_publisher.c and ShapeType_subscriber.c to customize dds.transport.UDPv4.builtin.parent.allow_multicast_interfaces_list).

I tried running the publisher on one host and the subscriber on a different one connected via LAN: by default, the processes don't discover each other, but if one declares the other's ip via NDDS_DISCOVERY_PEERS, then communication occurs as expected.

I will continue my testing by building the branch and observing the rmw's behavior.

@sloretz sloretz changed the title New RMW discovery options [rmw_connextdds] New RMW discovery options Mar 29, 2023
@asorbini
Copy link
Collaborator

@mxgrey @clalancette I have identified the issue and implemented a solution here (the commit should apply directly on top of your branch).

Basically, the problem stems from the interaction between variable NDDS_DISCOVERY_PEERS and Connext's multicast receive address, explained in this section of the Connext manual. Quoting the comment I added in the code:

Connext looks at variable NDDS_DISCOVERY_PEERS to determine whether it should add a multicast locator to the set of locators used for discovery. If this variable is empty, or if it contains at least one multicast address, a multicast locator is used for discovery (the default 239.255.0.1 group when the variable is empty, the first multicast locator found in the list when one or more multicast addresses are present).
Because of this, we must make sure that NDDS_DISCOVERY_PEERS contains some value to prevent this default behavior, otherwise the participant will be announcing the default multicast group, which in turn will cause ROS_STATIC_PEERS not to work correctly

My proposed solution is to detect whether the variable is empty when the LOCALHOST discovery policy is used, and set it to some value (127.0.0.1 seems like a safe value) so that Connext will not automatically add the multicast address. The variable must be set before the DomainParticipantFactory is initialized, which forced me to move that initialization from rmw_api_connextdds_ini to rmw_context_t::initialize_node.

Let me know if this fixes the failing test cases in the matrix, and I'll keep an eye out for updates on this PR to have it merged.

@wjwwood
Copy link
Member

wjwwood commented Mar 30, 2023

I cherry-picked your commit @asorbini, I'm still working on some of my own changes, then I'll ping you again. Thanks!

@wjwwood
Copy link
Member

wjwwood commented Mar 30, 2023

Ok, I think I'm finished addressing feedback, @asorbini can you have another look when you have time?

@wjwwood wjwwood requested a review from asorbini March 30, 2023 20:53
Copy link
Collaborator

@asorbini asorbini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I think it can be merged with a green run of CI.

I only had a question about whether we should accept invalid range enums, which I'd like to address, and two other minor things with DDS_String_alloc (which could stay as is, but it would be great to clean up).

rmw_connextdds_common/src/common/rmw_impl.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/ndds/dds_api_ndds.cpp Outdated Show resolved Hide resolved
rmw_connextdds_common/src/ndds/dds_api_ndds.cpp Outdated Show resolved Hide resolved
@asorbini
Copy link
Collaborator

@wjwwood I realized that I was under the wrong impression that only NDDS_DISCOVERY_PEERS will affect the determination of whether a multicast receive address is announced or not, whereas just setting initial_peers will not achieve the same effect.

After a bit more thinking, I believe they are equivalent, and I'd like to try reimplementing the feature with that strategy.

The reason for this is that it would be a cleaner implementation that could potentially allow two DomainParticipants to be created with different discovery options. This is not a problem at the moment, because the options are process-/ROS 2 context-/DDS factory-wide, but the change would likely make the implementation more future-proof in case we change things in the future.

It would also remove the new dependency from rcutils_set_env, which isn't much of a problem per se (since we already depend on rcutils_get_env), but "less dependencies"/"the same number of dependencies" is always a better scenario than "one more dependency" in my book.

The change would require moving back the participant factory initialization to rmw_connextdds_init, and checking if ctx->initial_peers is empty after asserting the static peers, and if so, filling it with "127.0.0.1" (when policy is LOCALHOST).

I can prepare the code and link a commit to cherry-pick, if that's ok with you. I should have it ready by the end of the day, and we can test/merge the PR tomorrow.

I'll start preparing it, while I wait for your input.

@wjwwood
Copy link
Member

wjwwood commented Mar 31, 2023

After a bit more thinking, I believe they are equivalent, and I'd like to try reimplementing the feature with that strategy.

That is fine with me.

It would also remove the new dependency from rcutils_set_env, which isn't much of a problem per se (since we already depend on rcutils_get_env), but "less dependencies"/"the same number of dependencies" is always a better scenario than "one more dependency" in my book.

Also fine with me.

I can prepare the code and link a commit to cherry-pick, if that's ok with you. I should have it ready by the end of the day, and we can test/merge the PR tomorrow.

Sounds good to me, thank you for pushing on it so quickly! I'll wait to re-test the matrix of use cases until you have this.

@asorbini
Copy link
Collaborator

asorbini commented Mar 31, 2023

@wjwwood After further testing, turns out I was right the first time around, and we must set NDDS_DISCOVERY_PEERS for the multicast receive address not to be used. Setting only initial_peers is not enough.

So I left the implementation as is, but I took the opportunity to consolidate all the changes into rmw_context.cpp so that they can be shared by rmw_connextddsmicro too (and we can think about deprecating RMW_CONNEXT_INITIAL_PEERS in favor of ROS_STATIC_PEERS in the near future).

I committed the changes here in this commit.

While at it, I also addressed those two comments I had about DDS_String_alloc already taking into account the nul terminator, and actually got rid of one of those two calls by remembering that DDS_String_dup exists (and so we don't need to use DDS_String_alloc+memcpy).

I also run the linter/uncrustify and fixed a couple of issues. Should be good for testing.

@asorbini
Copy link
Collaborator

I was just told that it might be possible to prevent the multicast receive behavior by resetting DomainParticipantQos::discovery::multicast_receive_address directly.

It might be a better idea, because of similar reasons I explained earlier (work at the participant level, instead of factory, don't mess with env vars., etc...).

I apologize for the back and forth, but I'm going to take a quick look into implementing this. While the current implementation works, it does have the problem that it prevents a unit test from initializing two contexts with two different policies, which might be a problem.

@asorbini
Copy link
Collaborator

I applied and tested the change, and it does work as expected.

@wjwwood can you apply these two commits on top of your branch and run CI/test matrix?

  • b156654 (the one I created yesterday, which refactors the logic to a separate function)
  • f5df932 (the one from this morning, which resets multicast_receive_addresses)

The commits are the two most recent ones from this branch.

@wjwwood
Copy link
Member

wjwwood commented Mar 31, 2023

I cherry-picked those and update it to not accept NOT_SET explicitly. I also added a doc note about this and we're bringing fastrtps and cyclone up to that too: ros2/rmw@2dbd6ef

Signed-off-by: William Woodall <[email protected]>
Copy link
Collaborator

@asorbini asorbini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor touch up, otherwise I think we're looking good

rmw_connextdds_common/src/common/rmw_context.cpp Outdated Show resolved Hide resolved
@wjwwood
Copy link
Member

wjwwood commented Mar 31, 2023

I'm currently debugging an issue with the pr in it's current state. It's failing to create a node when using RMW_IMPLEMENTATION=rmw_connextdds ROS_AUTOMATIC_DISCOVERY_RANGE=LOCALHOST ROS_STATIC_PEERS=127.0.0.1.

@asorbini
Copy link
Collaborator

I'm currently debugging an issue with the pr in it's current state. It's failing to create a node when using RMW_IMPLEMENTATION=rmw_connextdds ROS_AUTOMATIC_DISCOVERY_RANGE=LOCALHOST ROS_STATIC_PEERS=127.0.0.1.

I'm not able to reproduce on my side. I'm rebuilding the tree after pulling all repositories to see if that makes a difference (looks like I was missing a few of the most recent commits).

Signed-off-by: William Woodall <[email protected]>
@wjwwood
Copy link
Member

wjwwood commented Mar 31, 2023

We figured out the issue and fixed it in 4bafd73. Thanks for your help @asorbini!

Signed-off-by: William Woodall <[email protected]>
Copy link
Collaborator

@asorbini asorbini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go with green CI

"[email protected]://127.0.0.1",
"[email protected]://",
};
const auto rc2 = rmw_connextdds_extend_initial_peer_list(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These peers should be added to the list only if it has not been already customized by the user in the environment variable and/or XML Qos, otherwise they might shadow custom values they configured

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also leave it as is, but we should document the behavior in the README so users are aware that these peers are always added when the LOCALHOST policy is used.

I would prefer the "smarter" approach because so far, the rmw does not have any "hard-coded" QoS setting that cannot be somehow disabled and controlled via XML QoS by users. I'd like these peers to not disrupt that trend, by supporting either having the rmw not assert them if the peers have already been customized, or have the user explicitly disable them if they need to (i.e. through a new env var... but we have many already, and I'd rather not add another one).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good point. The current strategy for enabling users to control these settings is to set the range to RMW_AUTOMATIC_DISCOVERY_RANGE_SYSTEM_DEFAULT. It looks like maybe line 242 of this PR should be else if (ctx->discovery_options->automatic_discovery_range != ..._SYSTEM_DEFAULT) so that we don't try to extend the peer list in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped modifying initial peers list when automatic discovery range is SYSTEM_DEFAULT in c501a76

@sloretz sloretz merged commit 022cc46 into ros2:rolling Apr 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants