Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(update-server): Keep name deconflicted with other devices on the network #10559

Merged
merged 41 commits into from
Jun 9, 2022

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Jun 1, 2022

Overview

Fixes #10126.

Changelog

  • Expand our code for communicating with the Avahi daemon over D-Bus, to give us the ability to detect when the Avahi service name has collided with another device on the network.

  • While update-server is running, constantly monitor Avahi for name collisions. When one happens, automatically set a new name, like My Robot #2. This is the addition that fixes bug: Robot stops advertising itself when another device has the same name #10126.

  • At all times, keep the machine's 3 human-readable names in sync with each other:

    All 3 of these names are used by the discovery client. Keeping them in sync with each other is important to avoid confusing the discovery client, in its current implementation. See spike: Can we configure Avahi with a static .service file? #10199.

  • The static hostname is unaffected by this PR. It continues to follow the robot's serial number.

Reviewing

Recommended code review order

Here is one way to ease into this PR. Working from the outside in:

  1. In name_management/__init__.py...
    1. Check out the docstring to familiarize yourself with some important terminology.
    2. Check out how the HTTP endpoint functions set_name_endpoint() and get_name_endpoint() changed.
  2. In name_managemement/name_synchronizer.py, check out the NameSynchronizer interface, which the endpoints now use, and its RealNameSynchronizer implementation.
  3. In name_management/avahi.py, check out AvahiClient, a dependency of NameSynchronizer.
  4. Check out the server initialization code in buildroot/__main__.py and buildroot/__init__.py. Note that this is heavily duplicated with openembedded/__main__.py and openembedded/__init__.py.
  5. Everything else.

Testing

What to test

We need to verify the following properties:

How to test it

This needs at least one real OT-2.

On macOS, you can synthesize fake devices on the network to create collisions by running:

dns-sd -R 'My Fake Robot Name' _http._tcp . 31950

From there, there are many ways to inspect the OT2s' names:

  • You can simply play around on the Opentrons App. You might want to dig into the app's network settings to clear the robot discovery cache to make things cleaner and make it easier to notice anomalies.
  • You can use the macOS dns-sd -B command to monitor just the DNS-SD instance names. (Pay attention to the add/remove column!)
  • You can use Postman or curl to directly check GET /health, GET /server/health, and GET /server/name.
  • You can use the Opentrons discovery client command-line interface (cd discovery-client && yarn run discovery).
  • The macOS dns-sd -B command.

Interesting scenarios to try out:

  • Two machines are both online and have different names. You change one of their names so that they match. One of them should notice the conflict and correct it.
  • Two machines are both offline and have the same name. You turn both of them on. One of them should notice the conflict and correct it.
  • You give three machines the same name. You should see something like MyRobot, MyRobot #2, and MyRobot #3.
  • You have one OT-2 connected to your computer over both a wired and wireless connection. It should not conflict with itself or change its name on its own.

After testing

When you want to go back to edge after testing this PR, do not do git switch edge && make -C update-server push. It will soft-brick your robot because of #10582. Instead, download an edge build of ot2-system.zip and push it through the Opentrons App.

Risk assessment

High.

  • Screwing stuff up in this part of the codebase can soft-brick the robot by crashing update-server.
  • Screwing stuff up in this part of the codebase can soft-brick the robot by by accidentally breaking some nuance of robot discovery.
  • This part of the codebase has spotty type-checking and unit testing coverage.
  • We're handling interactions with system processes that we don't own and that we can only test manually.

@SyntaxColoring SyntaxColoring added update server robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). labels Jun 1, 2022
@SyntaxColoring SyntaxColoring changed the base branch from edge to avahi_name_conflict June 1, 2022 13:06
@codecov
Copy link

codecov bot commented Jun 1, 2022

Codecov Report

❗ No coverage uploaded for pull request base (avahi_name_conflict@89deec4). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@                  Coverage Diff                   @@
##             avahi_name_conflict   #10559   +/-   ##
======================================================
  Coverage                       ?   73.81%           
======================================================
  Files                          ?     2144           
  Lines                          ?    57629           
  Branches                       ?     5807           
======================================================
  Hits                           ?    42541           
  Misses                         ?    13868           
  Partials                       ?     1220           
Flag Coverage Δ
update-server 73.04% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't do a real review since it's in draft but looks great!

# (https://datatracker.ietf.org/doc/html/rfc6762#section-9).
# It prevents two machines with the same name from flipping
# which one is #1 and which one is #2 every time they reboot.
await self.set_name(new_name=alternative_name)
Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that because the new name contains the "special" characters of (space) and #, it technically counts as "dangerous" because of #10197.

Thankfully, this appears to not cause any problems in practice. And I think this is the best option we have at the moment, especially since it will be moot when #10197 is fixed.

The alternative would be to do our own implementation of alternative_service_name() to avoid the "dangerous" characters, but that would bring its own edge cases that I think are worse.

Specifically:

  • Appending #2 to names that are already close to the 63-octet limit
    • Dropping characters to fit in the octet limit without splitting Unicode code points or modifier sequences
  • Incrementing properly after doing #2 (i.e. #3, not #2 #2)

@SyntaxColoring SyntaxColoring marked this pull request as ready for review June 2, 2022 16:25
@SyntaxColoring SyntaxColoring marked this pull request as ready for review June 6, 2022 22:15
mcous
mcous previously requested changes Jun 7, 2022
Copy link
Contributor

@mcous mcous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great and from my testing, this works completely as advertised (hah). However, when combined with the app, we've got at least a couple show stoppers:

  1. Avahi's name conflict resolution strategy does not agree with the naming restrictions the app gives users (alphanumeric, i.e. no spaces, no hashes)
  2. Hashes in robot names break the UI quite badly
    • Even with renaming, the app treats names as robot identifiers
    • These identifiers go in URLs, because the app is a web UI
    • Hashes mean something specific in URLs, so robot names with hashes are mangled when routes are parsed, and you will not be able to navigate to the robot in question to fix the name

The UI behaves badly in other ways during a rename-with-a-conflict, but I think those issues outside the scope of this PR.

I'm at a bit of a loss here. I don't relish the thought of implementing custom name deconflicting ourselves, but it's certainly a smaller lift than changing up the app's entire page routing strategy in the 11th hour.

I think to get this PR mergeable, implementing our own deconflict strategy is our most viable path. Is there something lightweight we can do that isn't necessarily trying to increment anything (e.g. append a random 4 digit number to the end of the name)?

@SyntaxColoring
Copy link
Contributor Author

I think to get this PR mergeable, implementing our own deconflict strategy is our most viable path. Is there something lightweight we can do that isn't necessarily trying to increment anything (e.g. append a random 4 digit number to the end of the name)?

I think appending a random 4-digit number is roughly as difficult as incrementing an integer, but yes, totally doable. 👍

@sfoster1
Copy link
Member

sfoster1 commented Jun 8, 2022

could you just url-escape the name avahi gives us?

@SyntaxColoring
Copy link
Contributor Author

could you just url-escape the name avahi gives us?

Are you saying do something like this on the server side?

# On collision...
alternative_name = avahi_client.alternative_service_name(current_name)
alternative_name = url_escape(alternative_name)
avahi_client.start_advertising(alternative_name)
persist_pretty_hostname(alternative_name)

If so:

  • I can test this, but I suspect it will interfere with Avahi's integer sequencing. alternative_service_name(current_name) expects current_name to be of the form Base Name #N, not Base Name %23N or whatever. We could URL-decode the current name before passing it to alternative_service_name(), but at that point it feels more direct to me to implement our own logic instead of fixing up Avahi to get the result that we want.
  • We still have technically have a special character problem, with the %s.

If you mean should the Opentrons App be url-escaping these names: yes, probably.

@mcous
Copy link
Contributor

mcous commented Jun 8, 2022

I experimented with URL encoding the names on the front-end, and it did not go well. We run up against limitations in the react-router library that the app relies on, and I don't think that's something that can get thrown on the A&U team's plate right now

@SyntaxColoring SyntaxColoring requested a review from mcous June 8, 2022 20:43
@mcous mcous dismissed their stale review June 8, 2022 22:27

Dismissing due to changes

Copy link
Contributor

@mcous mcous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom name conflict logic looks good, but the new de-conflicted name is not returned to the client, resulting in UI bugs

Copy link
Contributor

@mcous mcous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a great way to consistently return the de-conflicted name to the client. There's a couple not great options we could explore, but I think it'd be a better idea to get this PR in, first.

I think we should also file a ticket so that the app knows the name that the server returns can't necessarily be trusted

@mcous
Copy link
Contributor

mcous commented Jun 9, 2022

I think the best solution we'd have is something like:

  1. Request new name
  2. Wait for the next conflict poll
    • Probably with some minimum wait time
  3. Check the name
  4. Return it

In a very brief (and hacky!) experiment, dropping the conflict poll interval to 1s and sleeping for 1s after the name set was enough time to get the de-conflicted name to the client

@SyntaxColoring
Copy link
Contributor Author

I don't see a great way to consistently return the de-conflicted name to the client. There's a couple not great options we could explore, but I think it'd be a better idea to get this PR in, first.

Meaning: return the deconflicted name in the response to the initial HTTP POST /server/name request? Like, "OK, you told me to change my name to 'MyRobot', but I couldn't do that; my new name is 'MyRobot2' instead"?

Yeah—in the general case, I don't think the server can do that, because name deconfliction isn't necessarily even tied to these HTTP requests in the first place. You could turn on two robots that start out with the same name and they'd silently deconflict with each other without you ever sending either of them a rename request, for example.

I think we should also file a ticket so that the app knows the name that the server returns can't necessarily be trusted

The way I'd put it is that the client needs to be prepared for a robot's name to change on its own at any time. The name immediately deconflicting, like this...

  1. `POST /server/name {"name": "MyRobot"}
  2. 200 success
  3. Robot immediately deconflicts with something else and changes name to "MyRobot2"

...is one special case of that.

I think that's #10689, but let me know if that's not quite what you mean.

@SyntaxColoring SyntaxColoring merged commit fc40a21 into avahi_name_conflict Jun 9, 2022
@mcous
Copy link
Contributor

mcous commented Jun 9, 2022

I think that's #10689, but let me know if that's not quite what you mean.

I agree, but in the more immediate term, the app currently expects the HTTP returned name to be the new name of the, robot, and does a UI redirect after the request succeeds to go from /robots/oldname to /robots/newname.

If the name is deconflicted, though, this appears to happen in practice:

  1. HTTP API responds with `{ "name": "newname" }
  2. UI redirects from /robot/oldname to /robot/newname
  3. UI is wrong in one of two possible ways
    • Conflict source: other OT-2 already named newname
      • User is redirected to the settings page of a different robot than the one they renamed
    • Conflict source: a non-OT-2
      • User is now on a completely blank robot page be the DC has no knowledge of a robot named newname

@SyntaxColoring
Copy link
Contributor Author

SyntaxColoring commented Jun 9, 2022

Oof, okay.

In addition to what you pointed out, I think there's another problem with that UI redirection logic: it assumes that all robots that the app can connect to have unique names, but that's not necessarily true.

Imagine one robot plugged in over USB and another robot over Wi-Fi both called "MyRobot". They're on different networks with different mDNS collision domains, so they have no idea about each other's existence and they will never deconflict from each other. But the app will try to put them both under the route /robot/MyRobot?

Edit: Ticketed this as #10725.

@SyntaxColoring SyntaxColoring deleted the avahi_name_conflict_fix branch June 9, 2022 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). update server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: Robot stops advertising itself when another device has the same name
6 participants