Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minmdns - perform exponental backoff retry for discovering nodes #9900

Merged
merged 13 commits into from
Sep 28, 2021

Conversation

andy31415
Copy link
Contributor

Problem

A single mDNS query may be insufficient to discover nodes:

  • device may not yet be online (e.g. right after provisioning)
  • packets may be dropped

Change overview

Adds retries for resolution with exponential backoff suppport.

Testing

  • manually confirmed that chip-tool pairing works in my environment (did not before this patch)
  • added unit tests.

mspang added a commit to mspang/connectedhomeip that referenced this pull request Sep 24, 2021
As of 97d9f7e ("Address 9154 and CM/AC DNS-SD convergence (project-chip#9348)")
operational discovery during chip-tool's commissioning flow typically
fails with minimal mDNS. This occurs because this change unintentionally
broke the logic to send unsolicited mDNS announcements as required by
RFC 6762.

During commissioning, the initial mDNS query to resolve the operational
service on the network is generally sent before the device is able to
join the network. The is perfectly fine because RFC 6762 accounts for
this case and requires devices joining the network to advertise all of
their records. These unsolicited responses provide the IP address and
allow the final commissioning steps to execute.

The minimal mDNS implementation sends its unsolicited announcements in
Start() and not at any other time. In the above change, we started
calling StopPublishDevice() immediately prior to (re-)starting the
server, leaving no records to announce. The records are added
immediately afterwards, but currently adding records does not trigger
additional announcements. Thus, a response never arrives and
commissioning times out.

Fix this by adding an announcement after the records are added.
Unfortunately this will quite spammy since we announce everything and
advertise is called several times in a row. What should happen is
announcements only happen with new or updated records, or when new
interfaces are enabled.

(Note that this problem is greatly exacerbated by the lack of
retransmissions of queries, which is being addressed in
project-chip#9900).
bzbarsky-apple pushed a commit that referenced this pull request Sep 24, 2021
As of 97d9f7e ("Address 9154 and CM/AC DNS-SD convergence (#9348)")
operational discovery during chip-tool's commissioning flow typically
fails with minimal mDNS. This occurs because this change unintentionally
broke the logic to send unsolicited mDNS announcements as required by
RFC 6762.

During commissioning, the initial mDNS query to resolve the operational
service on the network is generally sent before the device is able to
join the network. The is perfectly fine because RFC 6762 accounts for
this case and requires devices joining the network to advertise all of
their records. These unsolicited responses provide the IP address and
allow the final commissioning steps to execute.

The minimal mDNS implementation sends its unsolicited announcements in
Start() and not at any other time. In the above change, we started
calling StopPublishDevice() immediately prior to (re-)starting the
server, leaving no records to announce. The records are added
immediately afterwards, but currently adding records does not trigger
additional announcements. Thus, a response never arrives and
commissioning times out.

Fix this by adding an announcement after the records are added.
Unfortunately this will quite spammy since we announce everything and
advertise is called several times in a row. What should happen is
announcements only happen with new or updated records, or when new
interfaces are enabled.

(Note that this problem is greatly exacerbated by the lack of
retransmissions of queries, which is being addressed in
#9900).
andy31415 pushed a commit that referenced this pull request Sep 24, 2021
As of 97d9f7e ("Address 9154 and CM/AC DNS-SD convergence (#9348)")
operational discovery during chip-tool's commissioning flow typically
fails with minimal mDNS. This occurs because this change unintentionally
broke the logic to send unsolicited mDNS announcements as required by
RFC 6762.

During commissioning, the initial mDNS query to resolve the operational
service on the network is generally sent before the device is able to
join the network. The is perfectly fine because RFC 6762 accounts for
this case and requires devices joining the network to advertise all of
their records. These unsolicited responses provide the IP address and
allow the final commissioning steps to execute.

The minimal mDNS implementation sends its unsolicited announcements in
Start() and not at any other time. In the above change, we started
calling StopPublishDevice() immediately prior to (re-)starting the
server, leaving no records to announce. The records are added
immediately afterwards, but currently adding records does not trigger
additional announcements. Thus, a response never arrives and
commissioning times out.

Fix this by adding an announcement after the records are added.
Unfortunately this will quite spammy since we announce everything and
advertise is called several times in a row. What should happen is
announcements only happen with new or updated records, or when new
interfaces are enabled.

(Note that this problem is greatly exacerbated by the lack of
retransmissions of queries, which is being addressed in
#9900).
hank820 pushed a commit to hank820/connectedhomeip that referenced this pull request Sep 27, 2021
As of 97d9f7e ("Address 9154 and CM/AC DNS-SD convergence (project-chip#9348)")
operational discovery during chip-tool's commissioning flow typically
fails with minimal mDNS. This occurs because this change unintentionally
broke the logic to send unsolicited mDNS announcements as required by
RFC 6762.

During commissioning, the initial mDNS query to resolve the operational
service on the network is generally sent before the device is able to
join the network. The is perfectly fine because RFC 6762 accounts for
this case and requires devices joining the network to advertise all of
their records. These unsolicited responses provide the IP address and
allow the final commissioning steps to execute.

The minimal mDNS implementation sends its unsolicited announcements in
Start() and not at any other time. In the above change, we started
calling StopPublishDevice() immediately prior to (re-)starting the
server, leaving no records to announce. The records are added
immediately afterwards, but currently adding records does not trigger
additional announcements. Thus, a response never arrives and
commissioning times out.

Fix this by adding an announcement after the records are added.
Unfortunately this will quite spammy since we announce everything and
advertise is called several times in a row. What should happen is
announcements only happen with new or updated records, or when new
interfaces are enabled.

(Note that this problem is greatly exacerbated by the lack of
retransmissions of queries, which is being addressed in
project-chip#9900).
@andy31415
Copy link
Contributor Author

@msandstedt @mspang - would you be able to review this change still? I believe exponential backoff is a desirable mdns functionality regardless of auto-broadcast working or not.

Copy link
Contributor

@msandstedt msandstedt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this mostly makes sense.

My biggest questions are around the larger overall design. I am surprised to see that the Resolver_ImplMinimalMdns.cpp does not appear to have any types of callbacks for resolution success or failure.

This seems divergent from the interfaces we have in inet/DNSResolver.h. Is there a reason these aren't more unified?

Second question: is it presumed that we only need to implement this in the sdk for Minimal mDNS? Do Avahi and Bonjour provide this type of retry behavior on their own?

@andy31415
Copy link
Contributor Author

I think this mostly makes sense.

My biggest questions are around the larger overall design. I am surprised to see that the Resolver_ImplMinimalMdns.cpp does not appear to have any types of callbacks for resolution success or failure.

This seems divergent from the interfaces we have in inet/DNSResolver.h. Is there a reason these aren't more unified?

Second question: is it presumed that we only need to implement this in the sdk for Minimal mDNS? Do Avahi and Bonjour provide this type of retry behavior on their own?

minmdns only has the 'success' part. Failure likely can be implemented only after this PR, since we are effectively saying 'we keep retrying until we give up' and we only have this now.

I expect Avahi and Bonjour to implement more of the spec than minmdns, so retries should be there, however I have not explicitly tried it so far.

@todo
Copy link

todo bot commented Sep 27, 2021

node was evicted here, if/when resolution failures are

// TODO: node was evicted here, if/when resolution failures are
// supported this could be a place for error callbacks
//
// Note however that this is NOT an actual 'timeout' it is showing
// a burst of lookups for which we cannot maintain state. A reply may
// still be received for this peer id (query was already sent on the
// network)
ChipLogError(Discovery, "Re-using pending resolve entry before reply was received.");
}
entryToUse->peerId = peerId;


This comment was generated by todo based on a TODO comment in 8bf12cc in #9900. cc @andy31415.

@woody-apple
Copy link
Contributor

@andy31415 andy31415 merged commit 083c9ad into project-chip:master Sep 28, 2021
andy31415 added a commit that referenced this pull request Sep 28, 2021
* V1: very hardcoded logic for retries

* Move the active resolve attempts class to a separate header/cpp file for better code organization

* Restyle fixes

* Remove unused include in minmdns resolver implementation

* Update code to use Optional for potentially missing values

* use a clock base for the active resolve attempts, to allow for testability

* Start adding unit tests for the resolve attempt queue

* More  unit tests

* Restyle fixes

* Fix signed/unsigned for optional, to make android compile

* Code review updates, add a more complex test for scheduling check
@andy31415 andy31415 deleted the retry_mdns_discovery branch October 28, 2021 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants