fix(app): simplify discovery state to avoid stale IPs and health data #6193

mcous · 2020-07-22T19:46:53Z

Overview

This PR switches the app to a newer implementation of the Discovery Client with improved state management to:

Avoid displaying stale health data to the user (See Discovery Client: Separate health response data and status in state #5981)
Prevent weird phantom robots from showing up with mDNS FQDNs (See Discovery Client: do not fall back to FQDN as service name #5984)
Avoid the buggy attempt to gather real-time mDNS advertisement status (See Discovery Client: Do not attempt to "refresh" MDNS advertising status of a given robot #5985)
Clear out unreachable IP addresses for a given robot when we have an address that is currently working (See bug: App floods the network trying to connect to every IP it's ever seen on a robot #5250)

Closes #5985, closes #5250

Changelog

fix(app): simplify discovery state to avoid stale health data
- The new discovery client state has given the app increased visibility into how the robot is failing to respond
- Connectivity error messages can now differentiate between error responses from the robot (e.g. 502) and fetch failures (e.g. EHOSTDOWN)
refactor(discovery-client): remove large chunks of now-unused code

Review requests

summary of changes

This PR boils down to 3 fundamental changes:

Removing the old Discovery Client from discovery-client
Updating app-shell/src/discovery.js to use the new Discovery Client's API
Updating app/src/discovery to use the new shape of robots returned by the new DC
- Updating consumers of this state involves touching a lot of files in the app

old state

Discovery Client sends an Array<Service> to the app, where:
- There's a one-to-one relationship between IP addresses and Services, if the IP is known
- There's a one-to-many relationship between a robot name and its Services, because a robot may have multiple IP addresses

export type LegacyService = {
  // robot name
  name: string,
  // ip address this entry tracks, if known
  ip: ?string,
  port: number,
  // IP address (if known) is a link-local address
  local: ?boolean,
  // GET /health response.ok === true
  ok: ?boolean,
  // GET /server/update/health response.ok === true
  serverOk: ?boolean,
  // is advertising on MDNS
  advertising: ?boolean,
  // last good /health response
  health: ?HealthResponse,
  // last good /server/update/health response
  serverHealth: ?ServerHealthResponse,
  ...
}

This state shape ended combining too many concerns and we ended up losing information that led to stale data bugs.

new state

Discovery Client sends a Array<DiscoveryClientRobot>, where:
- There is a one-to-one relationship between robot name and DiscoveryClientRobot
- A DiscoveryClientRobot contains a list of IP addresses and each IP's current health status
The app's discovery selectors take the head of the addresses array (which is the address the DC thinks is most connectable) and merges it into the object returned by the selector
- All machinery that expects a robot to have an IP and port still works
- Checks to Service.ok can be replaced with DiscoveryClientRobot.healthStatus === HEALTH_STATUS_OK

type DiscoveryClientRobot = {|
  /** unique name of the robot */
  name: string,
  /** latest /health response data from the robot */
  health: HealthResponse | null,
  /** latest /server/update/health response data from the robot */
  serverHealth: ServerHealthResponse | null,
  /** IP addresses and health state, ranked by connectability (descending) */
  addresses: $ReadOnlyArray<DiscoveryClientRobotAddress>,
|}

type DiscoveryClientRobotAddress = {|
  /** IP address */
  ip: string,
  /** Port */
  port: number,
  /** Whether this IP has been seen via mDNS or HTTP while the client has been running */
  seen: boolean,
  /** How the last GET /health responded (null if no response yet) */
  healthStatus: HealthStatus | null,
  /** How the last GET /server/update/health responded (null if no response yet) */
  serverHealthStatus: HealthStatus | null,
  /** Error status and response from /health if last request was not 200 */
  healthError: HealthErrorResponse | null,
  /** Error status and response from /server/update/health if last request was not 200 */
  serverHealthError: HealthErrorResponse | null,
|}

type HealthStatus =
  | HEALTH_STATUS_UNREACHABLE
  | HEALTH_STATUS_NOT_OK
  | HEALTH_STATUS_OK

testing

A fair amount of smoke testing is important for this PR. It's the culmination of a lot of work and we've got a lot of fun edge cases that our discovery system interacts with:

The following checklist is a grab bag of things that need to keep working that this switchover touches. See this Discovery Client test plan for a set of instructions on how to test each of these items

Robots on the network continue to show up
App correctly identifies link-local vs non-link-local robots (via the USB icon in the list)
Correct error message is displayed if robot server is down but update server is up
- "Your robot's API server is not responding..."
Correct error message is displayed if both servers are down but robot was recently seen
- "This OT-2 has been seen recently, but is not responding..."
All the HTTP client stuff (that used discovery state to find the IP address of a robot to hit by robot name) still works
Robot updates still work
Balena migrations + updates still work
Robot restarting spinner(s) still work
"Lost Connection" alert still shows if RPC WebSocket becomes disconnected

Important to note: This PR changes the format of the app's discovery.json cache file. If you run this PR, it will migrate your file to the new format. If you then open the app on an older version, it will appear as though your cache has been cleared because of the migration. Everything should, however, continue to function.

Risk assessment

Medium. The app's discovery system is a critical piece of connectivity. The following steps were taken to mitigate risk:

Requesting QA oversight + testing for this PR
Made sure there is high unit test coverage
Wired new discovery client to CLI first for E2E testing (see refactor(discovery-client): wire-up new discovery client to the CLI #6162)
Tested with Wi-Fi and U2E robots / bare RPis with various server states

Switches the app to a newer implementation of the Discovery Client with improved state management Closes #5985

codecov · 2020-07-22T20:42:41Z

Codecov Report

❗ No coverage uploaded for pull request base (edge@2cd4fd1). Click here to learn what that means.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             edge    #6193   +/-   ##
=======================================
  Coverage        ?   79.26%           
=======================================
  Files           ?      199           
  Lines           ?    18395           
  Branches        ?        0           
=======================================
  Hits            ?    14581           
  Misses          ?     3814           
  Partials        ?        0

Impacted Files	Coverage Δ
opentrons/config/robot_configs.py	`97.47% <0.00%> (ø)`
opentrons/protocols/types.py	`90.00% <0.00%> (ø)`
opentrons/tools/factory_test.py	`0.00% <0.00%> (ø)`
opentrons/data_storage/database_queries.py	`95.65% <0.00%> (ø)`
...er/robot_server/service/legacy/routers/settings.py	`100.00% <0.00%> (ø)`
opentrons/calibration_storage/helpers.py	`94.44% <0.00%> (ø)`
...-server/robot_server/service/legacy/routers/rpc.py	`100.00% <0.00%> (ø)`
opentrons/drivers/mag_deck/__init__.py	`100.00% <0.00%> (ø)`
...r/robot_server/service/legacy/models/networking.py	`100.00% <0.00%> (ø)`
opentrons/calibration_storage/file_operators.py	`83.33% <0.00%> (ø)`
... and 189 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2cd4fd1...83f09b4. Read the comment docs.

mcous · 2020-07-27T17:43:03Z

I have a strong suspicion that additional changes to the ap's fetch usage will need to be made to support the pre-3.3.0 IPv6 robots due to the fact that the discovery client no longer adds square-brackets to IPv6 addresses

nusrat813

Tested - working as intended

mcous · 2020-07-30T16:08:43Z

After testing on various Balena robots (both pre-3.3 and post-3.3) I pushed the following changes to fix observed issues:

IP address ranking was wrong for IPv6 link-local addresses, causing pre-3.3 robots (with the fd00:0:cafe:fefe::1 address) to always prefer Wi-Fi over U2E
- Pushed a fix + tests to rank IPv6LL lower than IPv4LL but higher than non-LL addresses
Health endpoint responses were latched to the first received value rather than the latest received value
- Pretty classic "put the variables in the wrong order" bug that manifested in (among other things) a broken Balena > Buildroot migration process
- Push a fix + test to ensure the health responses saved to state are always the latest values

sanni-t

LGTM!
Tested everything except balena migrations. Worked as expected.

fix(app): simplify discovery state to avoid stale health data

e578c3b

Switches the app to a newer implementation of the Discovery Client with improved state management Closes #5985

mcous added app Affects the `app` project ready for review fix PR fixes a bug qa QA input / review required discovery-client labels Jul 22, 2020

mcous requested review from a team as code owners July 22, 2020 19:46

mcous requested review from amitlissack, sanni-t and IanLondon and removed request for a team July 22, 2020 19:46

fixup: s/DiscoveryClientNext/DiscoveryClient/

0afe267

mcous requested a review from a team July 23, 2020 16:49

mcous added 2 commits July 29, 2020 16:42

Merge branch 'edge' into app_wire-new-dc

4c281e5

fixup: fix priority and requests for pre-3.3 IPv6 eth0 address

009c372

mcous requested a review from a team as a code owner July 29, 2020 21:32

fixup: fix accidental latching of health responses

83f09b4

nusrat813 approved these changes Jul 30, 2020

View reviewed changes

sanni-t approved these changes Jul 30, 2020

View reviewed changes

mcous merged commit 0e089a7 into edge Aug 3, 2020

mcous deleted the app_wire-new-dc branch August 3, 2020 20:10

SyntaxColoring linked an issue Aug 12, 2020 that may be closed by this pull request

feat: show OT-2's IP addresses even when the robot server is down #6137

Closed

mcous mentioned this pull request Apr 9, 2021

fix(discovery-client): allow discovery client to track network interfaces #7608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(app): simplify discovery state to avoid stale IPs and health data #6193

fix(app): simplify discovery state to avoid stale IPs and health data #6193

mcous commented Jul 22, 2020 •

edited

Loading

codecov bot commented Jul 22, 2020 •

edited

Loading

mcous commented Jul 27, 2020

nusrat813 left a comment

mcous commented Jul 30, 2020

sanni-t left a comment

fix(app): simplify discovery state to avoid stale IPs and health data #6193

fix(app): simplify discovery state to avoid stale IPs and health data #6193

Conversation

mcous commented Jul 22, 2020 • edited Loading

Overview

Changelog

Review requests

summary of changes

old state

new state

testing

Risk assessment

codecov bot commented Jul 22, 2020 • edited Loading

Codecov Report

mcous commented Jul 27, 2020

nusrat813 left a comment

Choose a reason for hiding this comment

mcous commented Jul 30, 2020

sanni-t left a comment

Choose a reason for hiding this comment

mcous commented Jul 22, 2020 •

edited

Loading

codecov bot commented Jul 22, 2020 •

edited

Loading