Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(app): simplify discovery state to avoid stale IPs and health data #6193

Merged
merged 5 commits into from
Aug 3, 2020

Conversation

mcous
Copy link
Contributor

@mcous mcous commented Jul 22, 2020

Overview

This PR switches the app to a newer implementation of the Discovery Client with improved state management to:

Closes #5985, closes #5250

Changelog

  • fix(app): simplify discovery state to avoid stale health data
    • The new discovery client state has given the app increased visibility into how the robot is failing to respond
    • Connectivity error messages can now differentiate between error responses from the robot (e.g. 502) and fetch failures (e.g. EHOSTDOWN)
  • refactor(discovery-client): remove large chunks of now-unused code

Review requests

summary of changes

This PR boils down to 3 fundamental changes:

  • Removing the old Discovery Client from discovery-client
  • Updating app-shell/src/discovery.js to use the new Discovery Client's API
  • Updating app/src/discovery to use the new shape of robots returned by the new DC
    • Updating consumers of this state involves touching a lot of files in the app

old state

  • Discovery Client sends an Array<Service> to the app, where:
    • There's a one-to-one relationship between IP addresses and Services, if the IP is known
    • There's a one-to-many relationship between a robot name and its Services, because a robot may have multiple IP addresses
export type LegacyService = {
  // robot name
  name: string,
  // ip address this entry tracks, if known
  ip: ?string,
  port: number,
  // IP address (if known) is a link-local address
  local: ?boolean,
  // GET /health response.ok === true
  ok: ?boolean,
  // GET /server/update/health response.ok === true
  serverOk: ?boolean,
  // is advertising on MDNS
  advertising: ?boolean,
  // last good /health response
  health: ?HealthResponse,
  // last good /server/update/health response
  serverHealth: ?ServerHealthResponse,
  ...
}

This state shape ended combining too many concerns and we ended up losing information that led to stale data bugs.

new state

  • Discovery Client sends a Array<DiscoveryClientRobot>, where:
    • There is a one-to-one relationship between robot name and DiscoveryClientRobot
    • A DiscoveryClientRobot contains a list of IP addresses and each IP's current health status
  • The app's discovery selectors take the head of the addresses array (which is the address the DC thinks is most connectable) and merges it into the object returned by the selector
    • All machinery that expects a robot to have an IP and port still works
    • Checks to Service.ok can be replaced with DiscoveryClientRobot.healthStatus === HEALTH_STATUS_OK
type DiscoveryClientRobot = {|
  /** unique name of the robot */
  name: string,
  /** latest /health response data from the robot */
  health: HealthResponse | null,
  /** latest /server/update/health response data from the robot */
  serverHealth: ServerHealthResponse | null,
  /** IP addresses and health state, ranked by connectability (descending) */
  addresses: $ReadOnlyArray<DiscoveryClientRobotAddress>,
|}

type DiscoveryClientRobotAddress = {|
  /** IP address */
  ip: string,
  /** Port */
  port: number,
  /** Whether this IP has been seen via mDNS or HTTP while the client has been running */
  seen: boolean,
  /** How the last GET /health responded (null if no response yet) */
  healthStatus: HealthStatus | null,
  /** How the last GET /server/update/health responded (null if no response yet) */
  serverHealthStatus: HealthStatus | null,
  /** Error status and response from /health if last request was not 200 */
  healthError: HealthErrorResponse | null,
  /** Error status and response from /server/update/health if last request was not 200 */
  serverHealthError: HealthErrorResponse | null,
|}

type HealthStatus =
  | HEALTH_STATUS_UNREACHABLE
  | HEALTH_STATUS_NOT_OK
  | HEALTH_STATUS_OK

testing

A fair amount of smoke testing is important for this PR. It's the culmination of a lot of work and we've got a lot of fun edge cases that our discovery system interacts with:

The following checklist is a grab bag of things that need to keep working that this switchover touches. See this Discovery Client test plan for a set of instructions on how to test each of these items

  • Robots on the network continue to show up
  • App correctly identifies link-local vs non-link-local robots (via the USB icon in the list)
  • Correct error message is displayed if robot server is down but update server is up
    • "Your robot's API server is not responding..."
  • Correct error message is displayed if both servers are down but robot was recently seen
    • "This OT-2 has been seen recently, but is not responding..."
  • All the HTTP client stuff (that used discovery state to find the IP address of a robot to hit by robot name) still works
  • Robot updates still work
  • Balena migrations + updates still work
  • Robot restarting spinner(s) still work
  • "Lost Connection" alert still shows if RPC WebSocket becomes disconnected

Important to note: This PR changes the format of the app's discovery.json cache file. If you run this PR, it will migrate your file to the new format. If you then open the app on an older version, it will appear as though your cache has been cleared because of the migration. Everything should, however, continue to function.

Risk assessment

Medium. The app's discovery system is a critical piece of connectivity. The following steps were taken to mitigate risk:

Switches the app to a newer implementation of the Discovery Client with improved state management

Closes #5985
@mcous mcous added app Affects the `app` project ready for review fix PR fixes a bug qa QA input / review required discovery-client labels Jul 22, 2020
@mcous mcous requested review from a team as code owners July 22, 2020 19:46
@mcous mcous requested review from amitlissack, sanni-t and IanLondon and removed request for a team July 22, 2020 19:46
@codecov
Copy link

codecov bot commented Jul 22, 2020

Codecov Report

❗ No coverage uploaded for pull request base (edge@2cd4fd1). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             edge    #6193   +/-   ##
=======================================
  Coverage        ?   79.26%           
=======================================
  Files           ?      199           
  Lines           ?    18395           
  Branches        ?        0           
=======================================
  Hits            ?    14581           
  Misses          ?     3814           
  Partials        ?        0           
Impacted Files Coverage Δ
opentrons/config/robot_configs.py 97.47% <0.00%> (ø)
opentrons/protocols/types.py 90.00% <0.00%> (ø)
opentrons/tools/factory_test.py 0.00% <0.00%> (ø)
opentrons/data_storage/database_queries.py 95.65% <0.00%> (ø)
...er/robot_server/service/legacy/routers/settings.py 100.00% <0.00%> (ø)
opentrons/calibration_storage/helpers.py 94.44% <0.00%> (ø)
...-server/robot_server/service/legacy/routers/rpc.py 100.00% <0.00%> (ø)
opentrons/drivers/mag_deck/__init__.py 100.00% <0.00%> (ø)
...r/robot_server/service/legacy/models/networking.py 100.00% <0.00%> (ø)
opentrons/calibration_storage/file_operators.py 83.33% <0.00%> (ø)
... and 189 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2cd4fd1...83f09b4. Read the comment docs.

@mcous mcous requested a review from a team July 23, 2020 16:49
@mcous
Copy link
Contributor Author

mcous commented Jul 27, 2020

I have a strong suspicion that additional changes to the ap's fetch usage will need to be made to support the pre-3.3.0 IPv6 robots due to the fact that the discovery client no longer adds square-brackets to IPv6 addresses

@mcous mcous requested a review from a team as a code owner July 29, 2020 21:32
Copy link

@nusrat813 nusrat813 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested - working as intended

@mcous
Copy link
Contributor Author

mcous commented Jul 30, 2020

After testing on various Balena robots (both pre-3.3 and post-3.3) I pushed the following changes to fix observed issues:

  • IP address ranking was wrong for IPv6 link-local addresses, causing pre-3.3 robots (with the fd00:0:cafe:fefe::1 address) to always prefer Wi-Fi over U2E
    • Pushed a fix + tests to rank IPv6LL lower than IPv4LL but higher than non-LL addresses
  • Health endpoint responses were latched to the first received value rather than the latest received value
    • Pretty classic "put the variables in the wrong order" bug that manifested in (among other things) a broken Balena > Buildroot migration process
    • Push a fix + test to ensure the health responses saved to state are always the latest values

Copy link
Member

@sanni-t sanni-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Tested everything except balena migrations. Worked as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app Affects the `app` project discovery-client fix PR fixes a bug qa QA input / review required
Projects
None yet
3 participants