Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(app,robot-server): Download analyses as raw JSON documents #13425

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Aug 30, 2023

Overview

This dramatically improves the load time for certain views in the Opentrons App and Flex on-device display.

For example, under the circumstances described in RSS-328, it reduces load time of the Flex's home page from ~5 minutes to <5 seconds.

This closes RSS-328 and RSS-160. It probably closes some other performance-related tickets. It also helps a little bit with RSS-98.

Architecture

Protocol analysis on the robot is a major known compute bottleneck for us. One of the problems is the overhead of storing the completed analyses in the database and then extracting them. It goes through several layers of translation:

flowchart LR
    subgraph robot-server
        analysis_engine((Analysis engine))
        subgraph SQL database
            bytes
        end
        dict
        pydantic[Pydantic object]
        fastapi((FastAPI endpoint))
        json_str
    end
    analysis_engine --> pydantic
    bytes <-->|pickle| dict
    dict <--> pydantic
    pydantic --> fastapi --> json_str[JSON string] --> to_client((HTTP client))
Loading

This scheme adds a lot of overhead. Creating Pydantic objects is especially slow. (This scheme is also very brittle—see RSS-98.)

To improve this, this PR adds a second path that's basically a direct read from the database:

flowchart LR
    subgraph robot-server
        analysis_engine((Analysis engine))
        subgraph SQL database
            bytes
            json_str_db[JSON string]
        end
        dict
        pydantic[Pydantic object]
        old_fastapi((FastAPI endpoint))
        new_fastapi((FastAPI endpoint))
        json_str
    end
    analysis_engine --> pydantic
    bytes <-->|pickle| dict
    dict <--> pydantic
    pydantic --> old_fastapi --> json_str[JSON string] --> to_client((HTTP client))
    json_str_db <--- pydantic
    json_str_db --> new_fastapi --> to_client
    linkStyle 7 stroke:red
    linkStyle 8 stroke:red
Loading

We expose this through an experimental new HTTP endpoint, GET /protocols/:id/analyses/:id/asDocument. We leave the existing endpoints, such as GET /protocols/:id/analyses/:id, untouched for now.

This requires a database migration to add a new column for the JSON string, so we do that.

Detailed changelog

Server

  • Add a new endpoint. Document it as experimental.
  • Add a new column to the analysis table: completed_analysis_as_document, which stores the serialized JSON as a VARCHAR.
  • Introduce a new database migration to support the new column.
    • The migration includes copying over all existing data, so the new column should never be NULL.
    • Per previous discussion in RSS-130, we report an in-progress migration with an HTTP 503 status code on GET /health and on the endpoints that need the database.
    • Expected migration times:
      • Post-release Flexes: None, if we get this in for v7.0.0.
      • Internal Flexes: 5 minutes, from testing with the data in RSS-328.
      • OT-2s: 1 minute on average (from my old notes in RSS-130), or 10 minutes if heavily-loaded (extrapolating from RSS-328). There's some room for optimizing this.

Desktop app and ODD

@b-cooper can provide more details, but roughly, we're switching a bunch of stuff to query the new fast endpoint and avoid querying the old slow endpoints.

Helpers for the new GET /protocols/:protocolId/analyses/:analysisId/asDocument endpoint have been added api-client and react-api-client. The new useProtocolAnalysisAsDocumentQuery has been substituted across all locations in the Desktop and ODD where we were formerly requesting GET /protocols/:protocolId/analyses (old non-performant un-pickling path) and picking the latest entry.

Test Plan

@SyntaxColoring tested on an OT-2 and Flex:

  • It should be faster. :) See RSS-328 for one test case.
  • A Flex's on-device display should say "initializing..." while the migration is in progress.
  • Per prior discussion in RSS-130, the desktop app may say something like "This robot's API server is not responding correctly to requests..." while the migration is in progress. This is potentially confusing, but it's what it's historically shown while the robot boots up. The only difference now is that bootup is taking longer.
  • Power-cycle a robot in the middle of the migration and make sure it doesn't cause any errors. The migration should start from scratch.
  • You should be able to freely upgrade and downgrade across this commit without anything breaking.

Review requests

Architecturally, is there anything you think we need to do now in order to make this less risky, or avoid headaches in the long run?

Risk assessment

Medium.

@SyntaxColoring SyntaxColoring changed the base branch from edge to chore_release-7.0.0 August 30, 2023 13:54
@codecov
Copy link

codecov bot commented Aug 30, 2023

Codecov Report

Merging #13425 (6375ed1) into chore_release-7.0.0 (89c0a4f) will decrease coverage by 0.11%.
Report is 14 commits behind head on chore_release-7.0.0.
The diff coverage is 69.38%.

Additional details and impacted files

Impacted file tree graph

@@                   Coverage Diff                   @@
##           chore_release-7.0.0   #13425      +/-   ##
=======================================================
- Coverage                71.52%   71.42%   -0.11%     
=======================================================
  Files                     2431     2435       +4     
  Lines                    67807    67844      +37     
  Branches                  7865     7883      +18     
=======================================================
- Hits                     48501    48455      -46     
- Misses                   17463    17549      +86     
+ Partials                  1843     1840       -3     
Flag Coverage Δ
app 69.09% <77.27%> (-0.45%) ⬇️
react-api-client 69.30% <0.00%> (-0.82%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...ceDisplay/RobotDashboard/RecentRunProtocolCard.tsx 81.25% <ø> (-1.11%) ⬇️
...eviceDisplay/RobotDashboard/ServerInitializing.tsx 0.00% <0.00%> (ø)
...rc/protocols/useProtocolAnalysisAsDocumentQuery.ts 0.00% <0.00%> (ø)
robot-server/robot_server/hardware.py 81.42% <ø> (ø)
...ot-server/robot_server/protocols/analysis_store.py 100.00% <ø> (ø)
robot-server/robot_server/protocols/router.py 100.00% <ø> (ø)
app/src/pages/Protocols/hooks/index.ts 68.75% <60.00%> (+1.00%) ⬆️
.../pages/OnDeviceDisplay/ProtocolDetails/Liquids.tsx 70.58% <66.66%> (+1.83%) ⬆️
...src/pages/OnDeviceDisplay/RobotDashboard/index.tsx 70.58% <71.42%> (-6.34%) ⬇️
...rc/pages/OnDeviceDisplay/ProtocolDetails/index.tsx 62.60% <75.00%> (+0.32%) ⬆️
... and 6 more

... and 29 files with indirect coverage changes

@@ -6,6 +6,7 @@ import {
import { TakeoverModal } from './TakeoverModal'
import { TakeoverModalContext } from './TakeoverModalContext'

const MAINTENANCE_RUN_POLL_MS = 10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backing off this refetch interval to a slower poll. There are separate interval instantiated within wizards where we need more up to date info

@SyntaxColoring SyntaxColoring marked this pull request as ready for review August 31, 2023 15:46
@SyntaxColoring SyntaxColoring requested review from a team as code owners August 31, 2023 15:46
@SyntaxColoring SyntaxColoring requested review from mjhuff and a team and removed request for a team August 31, 2023 15:46
@vegano1
Copy link
Contributor

vegano1 commented Aug 31, 2023

Sweet!

Copy link
Contributor

@jbleon95 jbleon95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed part of PR in live review, Python side looks good to me.

Copy link
Contributor

@smb2268 smb2268 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the JS code changes look sound to me. I put this branch on app&ui bot and everything loaded very quickly!

@SyntaxColoring SyntaxColoring merged commit 32e90d7 into chore_release-7.0.0 Sep 1, 2023
@SyntaxColoring SyntaxColoring deleted the performance_testing_analyses_as_opaque_documents branch September 1, 2023 15:10
mjhuff pushed a commit that referenced this pull request Sep 5, 2023
sfoster1 pushed a commit that referenced this pull request Sep 11, 2023
* Allow extra time for restart.

* Deemphasize. Also, increase time to 15 minutes just to be safe.

* Remove load time note from API release notes.

* Remove note from app release notes.

* Active voice.
TamarZanzouri pushed a commit that referenced this pull request Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants