-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(robot-server): Store Pydantic objects as JSON instead of pickles, take 2 #14355
Conversation
old_state_summary = old_run_row.state_summary | ||
new_state_summary = ( | ||
None | ||
if old_run_row.state_summary is None | ||
else pydantic_to_json( | ||
pydantic.parse_obj_as(StateSummary, old_state_summary) | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a try: ... except: pass
around each run's migration to drop it if something goes wrong, instead of making the whole server unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've decided I should do this. Dropping runs silently is bad, but it's significantly better than failing the boot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gotten to this yet. We should do it in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some background on this file:
This supports endpoints like GET /protocols/{id}/analyses/{id}
.
Originally, we stored these records as pickled dicts, and fully validated them before returning them over HTTP.
In #13425, we introduced a parallel DB column and parallel endpoint: they store the records as JSON, and return them without validating them. This of course has downsides, but it seems to be working well.
This PR consolidates the parallel DB columns, but it leaves the parallel endpoints alone. So we're always reading JSON now. The difference is just whether we validate it before returning it to the client, which depends on which endpoint the client uses.
We can probably consolidate the parallel endpoints now, but that's beyond the scope of this PR.
Test results so far:
|
0e8d735
to
98624e3
Compare
769747c
to
9f01a5f
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## edge #14355 +/- ##
=======================================
Coverage 68.29% 68.29%
=======================================
Files 1623 1623
Lines 54858 54858
Branches 4115 4115
=======================================
+ Hits 37466 37468 +2
+ Misses 16705 16704 -1
+ Partials 687 686 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good for moving this whole database migration forward. We should definitely keep an eye on the potential performance issues you noted, but at this point we should have this out on our internal robots and respond from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts on the risks section:
- To what extent has the new migration time estimate been communicated to other parties within opentrons? Is there a plan in place to implement user/public facing prompts regarding the migration process to prepare users for this?
- If the solutions within perf(robot-server): Discriminate command unions #14286 and perf(robot-server): Store one command per row #14348 do not provide the desired enhancement in performance we should be prepared to implement an as yet designed third option so as not to further slow communications on the system. I'm not exactly a fan of the nuclear option.
Outside of that, the changes here look appropriate and we can feel safe to move forward into more advanced testing based on the initially reported test results.
98624e3
to
a8156e5
Compare
9f01a5f
to
4df1b1d
Compare
I'll make a post internally, and for the public, we'll have to add release notes. |
4df1b1d
to
66121ad
Compare
Overview
Closes RSS-98. Takes over for #11561.
Test Plan
I'm going to spend the day putting this on various robots and smoke-testing them to make sure it doesn't break any basic flows or cause any horrific performance regressions.
Other than that, we have okay automated test coverage. But see the risk assessment below.
Changelog
RunStore
, convert our storage ofCommand
s andStateSummary
s from pickled dictified Pydantic objects to JSON-serialized Pydantic objects.CompletedAnalysisStore
, we already did that conversion in perf(app,robot-server): Download analyses as raw JSON documents #13425, in an experimental parallel column. Promote that column to be non-experimental, and delete the old one that had pickled dicts.Review requests
None in particular.
Risk assessment
Medium.
Correctness risks
Automated tests currently cover these:
But they do not cover:
null
/omitted fields is always a bit hazardous with Pydantic, and this migration introduces points where something can go wrong with that.This omission is tracked as RSS-213.
Mitigating this risk is the fact that we've been running this scheme in production, experimentally, since #13425, and it apparently hasn't caused any problems there.
Performance risks
Prior testing has shown parsing JSON to take about 1.7x as long as parsing our existing pickled dicts. So retrieving runs and commands may get slower. (Retrieving analyses is unaffected because of #13425).
Manual testing will have to bear out whether this is a problem in practice. I expect #14286 and #14348 to go a long way. If we need a nuclear option, we can do something like #13425 again.
The migration can take a long time. I think this is unavoidable to some extent and we just need to communicate it. I've done what I can here to mitigate this by combining this migration with #14348. RSS-217 makes this worse, but we can fix that separately.
Risk tolerance
We should take seriously the risks listed above, and mitigate them as much as we have time for, but we also need to weigh them against the risk of not merging this.