-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(update-server): Keep name deconflicted with other devices on the network #10559
Conversation
Codecov Report
@@ Coverage Diff @@
## avahi_name_conflict #10559 +/- ##
======================================================
Coverage ? 73.81%
======================================================
Files ? 2144
Lines ? 57629
Branches ? 5807
======================================================
Hits ? 42541
Misses ? 13868
Partials ? 1220
Flags with carried forward coverage won't be shown. Click here to find out more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't do a real review since it's in draft but looks great!
update-server/otupdate/common/name_management/name_synchronizer.py
Outdated
Show resolved
Hide resolved
update-server/otupdate/common/name_management/name_synchronizer.py
Outdated
Show resolved
Hide resolved
# (https://datatracker.ietf.org/doc/html/rfc6762#section-9). | ||
# It prevents two machines with the same name from flipping | ||
# which one is #1 and which one is #2 every time they reboot. | ||
await self.set_name(new_name=alternative_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that because the new name contains the "special" characters of
(space) and #
, it technically counts as "dangerous" because of #10197.
Thankfully, this appears to not cause any problems in practice. And I think this is the best option we have at the moment, especially since it will be moot when #10197 is fixed.
The alternative would be to do our own implementation of alternative_service_name()
to avoid the "dangerous" characters, but that would bring its own edge cases that I think are worse.
Specifically:
- Appending
#2
to names that are already close to the 63-octet limit- Dropping characters to fit in the octet limit without splitting Unicode code points or modifier sequences
- Incrementing properly after doing
#2
(i.e.#3
, not#2 #2
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks great and from my testing, this works completely as advertised (hah). However, when combined with the app, we've got at least a couple show stoppers:
- Avahi's name conflict resolution strategy does not agree with the naming restrictions the app gives users (alphanumeric, i.e. no spaces, no hashes)
- Hashes in robot names break the UI quite badly
- Even with renaming, the app treats names as robot identifiers
- These identifiers go in URLs, because the app is a web UI
- Hashes mean something specific in URLs, so robot names with hashes are mangled when routes are parsed, and you will not be able to navigate to the robot in question to fix the name
The UI behaves badly in other ways during a rename-with-a-conflict, but I think those issues outside the scope of this PR.
I'm at a bit of a loss here. I don't relish the thought of implementing custom name deconflicting ourselves, but it's certainly a smaller lift than changing up the app's entire page routing strategy in the 11th hour.
I think to get this PR mergeable, implementing our own deconflict strategy is our most viable path. Is there something lightweight we can do that isn't necessarily trying to increment anything (e.g. append a random 4 digit number to the end of the name)?
I think appending a random 4-digit number is roughly as difficult as incrementing an integer, but yes, totally doable. 👍 |
could you just url-escape the name avahi gives us? |
Are you saying do something like this on the server side? # On collision...
alternative_name = avahi_client.alternative_service_name(current_name)
alternative_name = url_escape(alternative_name)
avahi_client.start_advertising(alternative_name)
persist_pretty_hostname(alternative_name) If so:
If you mean should the Opentrons App be url-escaping these names: yes, probably. |
I experimented with URL encoding the names on the front-end, and it did not go well. We run up against limitations in the react-router library that the app relies on, and I don't think that's something that can get thrown on the A&U team's plate right now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom name conflict logic looks good, but the new de-conflicted name is not returned to the client, resulting in UI bugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a great way to consistently return the de-conflicted name to the client. There's a couple not great options we could explore, but I think it'd be a better idea to get this PR in, first.
I think we should also file a ticket so that the app knows the name that the server returns can't necessarily be trusted
I think the best solution we'd have is something like:
In a very brief (and hacky!) experiment, dropping the conflict poll interval to 1s and sleeping for 1s after the name set was enough time to get the de-conflicted name to the client |
Meaning: return the deconflicted name in the response to the initial HTTP Yeah—in the general case, I don't think the server can do that, because name deconfliction isn't necessarily even tied to these HTTP requests in the first place. You could turn on two robots that start out with the same name and they'd silently deconflict with each other without you ever sending either of them a rename request, for example.
The way I'd put it is that the client needs to be prepared for a robot's name to change on its own at any time. The name immediately deconflicting, like this...
...is one special case of that. I think that's #10689, but let me know if that's not quite what you mean. |
I agree, but in the more immediate term, the app currently expects the HTTP returned name to be the new name of the, robot, and does a UI redirect after the request succeeds to go from If the name is deconflicted, though, this appears to happen in practice:
|
Oof, okay. In addition to what you pointed out, I think there's another problem with that UI redirection logic: it assumes that all robots that the app can connect to have unique names, but that's not necessarily true. Imagine one robot plugged in over USB and another robot over Wi-Fi both called "MyRobot". They're on different networks with different mDNS collision domains, so they have no idea about each other's existence and they will never deconflict from each other. But the app will try to put them both under the route Edit: Ticketed this as #10725. |
Overview
Fixes #10126.
Changelog
Expand our code for communicating with the Avahi daemon over D-Bus, to give us the ability to detect when the Avahi service name has collided with another device on the network.
While
update-server
is running, constantly monitor Avahi for name collisions. When one happens, automatically set a new name, likeMy Robot #2
. This is the addition that fixes bug: Robot stops advertising itself when another device has the same name #10126.At all times, keep the machine's 3 human-readable names in sync with each other:
update-server
exposes over HTTP viaGET /server/name
,POST /server/name
, andGET /server/update
.update-server
and winds up inrobot-server
's/health
endpoint.All 3 of these names are used by the discovery client. Keeping them in sync with each other is important to avoid confusing the discovery client, in its current implementation. See spike: Can we configure Avahi with a static .service file? #10199.
The static hostname is unaffected by this PR. It continues to follow the robot's serial number.
Reviewing
Recommended code review order
Here is one way to ease into this PR. Working from the outside in:
name_management/__init__.py
...set_name_endpoint()
andget_name_endpoint()
changed.name_managemement/name_synchronizer.py
, check out theNameSynchronizer
interface, which the endpoints now use, and itsRealNameSynchronizer
implementation.name_management/avahi.py
, check outAvahiClient
, a dependency ofNameSynchronizer
.buildroot/__main__.py
andbuildroot/__init__.py
. Note that this is heavily duplicated withopenembedded/__main__.py
andopenembedded/__init__.py
.Testing
What to test
We need to verify the following properties:
GET /health
may lag behind the one inGET /server/update/health
sometimes, because of a different bug. See bug:GET /health
andGET /server/health
disagree on the robot's name #10413.How to test it
This needs at least one real OT-2.
On macOS, you can synthesize fake devices on the network to create collisions by running:
From there, there are many ways to inspect the OT2s' names:
dns-sd -B
command to monitor just the DNS-SD instance names. (Pay attention to theadd
/remove
column!)curl
to directly checkGET /health
,GET /server/health
, andGET /server/name
.cd discovery-client && yarn run discovery
).dns-sd -B
command.Interesting scenarios to try out:
MyRobot
,MyRobot #2
, andMyRobot #3
.After testing
When you want to go back to
edge
after testing this PR, do not dogit switch edge && make -C update-server push
. It will soft-brick your robot because of #10582. Instead, download anedge
build ofot2-system.zip
and push it through the Opentrons App.Risk assessment
High.
update-server
.