perf(app,robot-server): Download analyses as raw JSON documents #13425

SyntaxColoring · 2023-08-30T13:54:19Z

Overview

This dramatically improves the load time for certain views in the Opentrons App and Flex on-device display.

For example, under the circumstances described in RSS-328, it reduces load time of the Flex's home page from ~5 minutes to <5 seconds.

This closes RSS-328 and RSS-160. It probably closes some other performance-related tickets. It also helps a little bit with RSS-98.

Architecture

Protocol analysis on the robot is a major known compute bottleneck for us. One of the problems is the overhead of storing the completed analyses in the database and then extracting them. It goes through several layers of translation:

flowchart LR
    subgraph robot-server
        analysis_engine((Analysis engine))
        subgraph SQL database
            bytes
        end
        dict
        pydantic[Pydantic object]
        fastapi((FastAPI endpoint))
        json_str
    end
    analysis_engine --> pydantic
    bytes <-->|pickle| dict
    dict <--> pydantic
    pydantic --> fastapi --> json_str[JSON string] --> to_client((HTTP client))

This scheme adds a lot of overhead. Creating Pydantic objects is especially slow. (This scheme is also very brittle—see RSS-98.)

To improve this, this PR adds a second path that's basically a direct read from the database:

flowchart LR
    subgraph robot-server
        analysis_engine((Analysis engine))
        subgraph SQL database
            bytes
            json_str_db[JSON string]
        end
        dict
        pydantic[Pydantic object]
        old_fastapi((FastAPI endpoint))
        new_fastapi((FastAPI endpoint))
        json_str
    end
    analysis_engine --> pydantic
    bytes <-->|pickle| dict
    dict <--> pydantic
    pydantic --> old_fastapi --> json_str[JSON string] --> to_client((HTTP client))
    json_str_db <--- pydantic
    json_str_db --> new_fastapi --> to_client
    linkStyle 7 stroke:red
    linkStyle 8 stroke:red

We expose this through an experimental new HTTP endpoint, GET /protocols/:id/analyses/:id/asDocument. We leave the existing endpoints, such as GET /protocols/:id/analyses/:id, untouched for now.

This requires a database migration to add a new column for the JSON string, so we do that.

Detailed changelog

Server

Add a new endpoint. Document it as experimental.
Add a new column to the analysis table: completed_analysis_as_document, which stores the serialized JSON as a VARCHAR.
Introduce a new database migration to support the new column.
- The migration includes copying over all existing data, so the new column should never be NULL.
- Per previous discussion in RSS-130, we report an in-progress migration with an HTTP 503 status code on GET /health and on the endpoints that need the database.
- Expected migration times:
  - Post-release Flexes: None, if we get this in for v7.0.0.
  - Internal Flexes: 5 minutes, from testing with the data in RSS-328.
  - OT-2s: 1 minute on average (from my old notes in RSS-130), or 10 minutes if heavily-loaded (extrapolating from RSS-328). There's some room for optimizing this.

Desktop app and ODD

@b-cooper can provide more details, but roughly, we're switching a bunch of stuff to query the new fast endpoint and avoid querying the old slow endpoints.

Helpers for the new GET /protocols/:protocolId/analyses/:analysisId/asDocument endpoint have been added api-client and react-api-client. The new useProtocolAnalysisAsDocumentQuery has been substituted across all locations in the Desktop and ODD where we were formerly requesting GET /protocols/:protocolId/analyses (old non-performant un-pickling path) and picking the latest entry.

Test Plan

@SyntaxColoring tested on an OT-2 and Flex:

It should be faster. :) See RSS-328 for one test case.
A Flex's on-device display should say "initializing..." while the migration is in progress.
Per prior discussion in RSS-130, the desktop app may say something like "This robot's API server is not responding correctly to requests..." while the migration is in progress. This is potentially confusing, but it's what it's historically shown while the robot boots up. The only difference now is that bootup is taking longer.
Power-cycle a robot in the middle of the migration and make sure it doesn't cause any errors. The migration should start from scratch.
You should be able to freely upgrade and downgrade across this commit without anything breaking.

Review requests

Architecturally, is there anything you think we need to do now in order to make this less risky, or avoid headaches in the long run?

Risk assessment

Medium.

Supporting both strategies simultaneously adds complexity to the server, which will be bug-prone if we let it linger.
Any kind of robot-server database migration with our homegrown system is inherently kind of scary, and this is the first real one that we're doing in production.

We have good automated tests for it (e.g. test_tables.py and test_persistence.py), but things can still sneak in. Examples:
- The transaction bug fixed in fix(robot-server,system-server): Make SQL transactions behave sanely #13424
- This thing about a mismatched column type
- This quirk in how we stamp the database with schema versions

…nalysis as doc

codecov · 2023-08-30T13:59:13Z

Codecov Report

Merging #13425 (6375ed1) into chore_release-7.0.0 (89c0a4f) will decrease coverage by 0.11%.
Report is 14 commits behind head on chore_release-7.0.0.
The diff coverage is 69.38%.

Additional details and impacted files

@@                   Coverage Diff                   @@
##           chore_release-7.0.0   #13425      +/-   ##
=======================================================
- Coverage                71.52%   71.42%   -0.11%     
=======================================================
  Files                     2431     2435       +4     
  Lines                    67807    67844      +37     
  Branches                  7865     7883      +18     
=======================================================
- Hits                     48501    48455      -46     
- Misses                   17463    17549      +86     
+ Partials                  1843     1840       -3

Flag	Coverage Δ
app	`69.09% <77.27%> (-0.45%)`	⬇️
react-api-client	`69.30% <0.00%> (-0.82%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
...ceDisplay/RobotDashboard/RecentRunProtocolCard.tsx	`81.25% <ø> (-1.11%)`	⬇️
...eviceDisplay/RobotDashboard/ServerInitializing.tsx	`0.00% <0.00%> (ø)`
...rc/protocols/useProtocolAnalysisAsDocumentQuery.ts	`0.00% <0.00%> (ø)`
robot-server/robot_server/hardware.py	`81.42% <ø> (ø)`
...ot-server/robot_server/protocols/analysis_store.py	`100.00% <ø> (ø)`
robot-server/robot_server/protocols/router.py	`100.00% <ø> (ø)`
app/src/pages/Protocols/hooks/index.ts	`68.75% <60.00%> (+1.00%)`	⬆️
.../pages/OnDeviceDisplay/ProtocolDetails/Liquids.tsx	`70.58% <66.66%> (+1.83%)`	⬆️
...src/pages/OnDeviceDisplay/RobotDashboard/index.tsx	`70.58% <71.42%> (-6.34%)`	⬇️
...rc/pages/OnDeviceDisplay/ProtocolDetails/index.tsx	`62.60% <75.00%> (+0.32%)`	⬆️
... and 6 more

... and 29 files with indirect coverage changes

…as_opaque_documents Resolve conflicts in: * robot-server/robot_server/protocols/completed_analysis_store.py

…h FastAPI.

b-cooper · 2023-08-30T19:32:33Z

app/src/organisms/TakeoverModal/MaintenanceRunTakeover.tsx

@@ -6,6 +6,7 @@ import {
 import { TakeoverModal } from './TakeoverModal'
 import { TakeoverModalContext } from './TakeoverModalContext'

+const MAINTENANCE_RUN_POLL_MS = 10000


Backing off this refetch interval to a slower poll. There are separate interval instantiated within wizards where we need more up to date info

Separate the schema migration from the data migration. Do the data part unconditionally.

Add unit tests for AnalysisStore.

the machine understands `make format-js` as an expression of love.

vegano1 · 2023-08-31T16:44:57Z

Sweet!

robot-server/tests/protocols/test_protocols_router.py

jbleon95

Reviewed part of PR in live review, Python side looks good to me.

This reverts commit 9ebe737.

smb2268

All of the JS code changes look sound to me. I put this branch on app&ui bot and everything loaded very quickly!

Co-authored-by: Brian Cooper <[email protected]>

* Allow extra time for restart. * Deemphasize. Also, increase time to 15 minutes just to be safe. * Remove load time note from API release notes. * Remove note from app release notes. * Active voice.

Co-authored-by: Brian Cooper <[email protected]>

SyntaxColoring and others added 5 commits August 24, 2023 15:54

Hack up some performance tests.

bc0abd4

temporarily remove tracking from run card, and check equipment from a…

2bb3af4

…nalysis as doc

server initializing empty state for recent run protocol card

341eaf7

remove old data fetching comments from required hardware hook

5c4bcb7

Delete benchmarking script.

6e3a261

SyntaxColoring changed the base branch from edge to chore_release-7.0.0 August 30, 2023 13:54

SyntaxColoring and others added 7 commits August 30, 2023 10:20

Merge branch 'chore_release-7.0.0' into performance_testing_analyses_…

5cb7f3d

…as_opaque_documents Resolve conflicts in: * robot-server/robot_server/protocols/completed_analysis_store.py

Add a migration to add the new column.

0db671d

Try to migrate records eagerly.

5965e92

refactor useMostRecentCompletedAnalysis to use asDocument endpoint

aeb5012

back off the top level maintenance run poll a bit

7c0f29d

Add a Tavern test for the analysis endpoints.

09eaf6e

Tweak serialization configurables to match what we normally do throug…

5d83c8b

…h FastAPI.

b-cooper reviewed Aug 30, 2023

View reviewed changes

SyntaxColoring and others added 15 commits August 30, 2023 16:06

Add endpoint docs and fix Content-Type header.

007d53e

Unrelated test fixups.

1211d1c

Add CompletedAnalysisStore unit tests.

9bc8b91

Fix edge case with potentially NULL documents.

0bcbb3e

update all instances of analyses query to as doc

11c534f

undo problematic useprotocoldetailsforrun test fix

d2c4b04

fix up test for use protocol details for run

f890c9f

fix up formatting and tests

b413c88

Import fixup.

00e9f0d

Add checks for new endpoint in persistence test.

1807fb7

Add checks for new endpoint in persistence snapshot compatibility test.

690d406

Describe confusing schema stamp behavior.

954287e

Simpler solution for the upgrade-downgrade-upgrade edge case.

67ecc5f

Separate the schema migration from the data migration. Do the data part unconditionally.

Document column data types.

8d62ad3

Format 'n lint.

2a4c36f

SyntaxColoring added 4 commits August 31, 2023 10:51

Note potential future trap with _systemd_notify().

a639e8e

Raise 404 if the given analysis is still pending.

9296728

Add unit tests for AnalysisStore.

Minor fixups to comments and formatting.

ae61097

we are all children of the machine. the machine takes care of us.

593d509

the machine understands `make format-js` as an expression of love.

SyntaxColoring marked this pull request as ready for review August 31, 2023 15:46

SyntaxColoring requested review from a team as code owners August 31, 2023 15:46

SyntaxColoring requested review from mjhuff and a team and removed request for a team August 31, 2023 15:46

SyntaxColoring added 2 commits August 31, 2023 13:57

Bump copmatibility tests' startup timeouts, for CI.

9ebe737

Add missing router tests.

ab8adac

jbleon95 reviewed Aug 31, 2023

View reviewed changes

robot-server/tests/protocols/test_protocols_router.py Outdated Show resolved Hide resolved

jbleon95 approved these changes Aug 31, 2023

View reviewed changes

SyntaxColoring added 2 commits August 31, 2023 16:13

Fix test docstrings.

fe9adcc

It would help if I applied the increased timeout to the correct file.

6375ed1

This reverts commit 9ebe737.

SyntaxColoring mentioned this pull request Aug 31, 2023

chore: Update v7.0.0 release notes for PR #13425 #13443

Merged

smb2268 self-requested a review August 31, 2023 21:16

smb2268 approved these changes Aug 31, 2023

View reviewed changes

SyntaxColoring merged commit 32e90d7 into chore_release-7.0.0 Sep 1, 2023

SyntaxColoring deleted the performance_testing_analyses_as_opaque_documents branch September 1, 2023 15:10

mjhuff pushed a commit that referenced this pull request Sep 5, 2023

perf(app,robot-server): Download analyses as raw JSON documents (#13425)

196bd8e

Co-authored-by: Brian Cooper <[email protected]>

TamarZanzouri pushed a commit that referenced this pull request Sep 13, 2023

perf(app,robot-server): Download analyses as raw JSON documents (#13425)

9e89658

Co-authored-by: Brian Cooper <[email protected]>

SyntaxColoring mentioned this pull request Oct 11, 2023

fix(api): cancellation bug on legacy core api #13767

Merged

3 tasks

This was referenced Jan 22, 2024

refactor(robot-server): Add missing comment for v2 schema migration #14338

Merged

perf(robot-server): Store one command per row #14348

Merged

refactor(robot-server): Store Pydantic objects as JSON instead of pickles, take 2 #14355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(app,robot-server): Download analyses as raw JSON documents #13425

perf(app,robot-server): Download analyses as raw JSON documents #13425

SyntaxColoring commented Aug 30, 2023 •

edited

Loading

codecov bot commented Aug 30, 2023 •

edited

Loading

b-cooper Aug 30, 2023

vegano1 commented Aug 31, 2023

jbleon95 left a comment

smb2268 left a comment

perf(app,robot-server): Download analyses as raw JSON documents #13425

perf(app,robot-server): Download analyses as raw JSON documents #13425

Conversation

SyntaxColoring commented Aug 30, 2023 • edited Loading

Overview

Architecture

Detailed changelog

Server

Desktop app and ODD

Test Plan

Review requests

Risk assessment

codecov bot commented Aug 30, 2023 • edited Loading

Codecov Report

b-cooper Aug 30, 2023

Choose a reason for hiding this comment

vegano1 commented Aug 31, 2023

jbleon95 left a comment

Choose a reason for hiding this comment

smb2268 left a comment

Choose a reason for hiding this comment

SyntaxColoring commented Aug 30, 2023 •

edited

Loading

codecov bot commented Aug 30, 2023 •

edited

Loading