-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(robot-server): Use a BadRun when we cant load #14711
Conversation
Up to now, if there's a run saved in the persistence layer that cannot be loaded - which is typically because it contains data from a version of the robot server or api package that whatever's currently running can't handle - we error when trying to retrieve it. That includes both a 500 error when trying to access that particular run, which clients can broadly handle, and a 500 error when trying to list all runs, which clients cannot. Without being able to list out all runs, there's no way for clients to find the problematic run - the IDs are UUIDs and cannot be enumerated - and remove it. The only recourse is to delete all the run storage. A different way to handle this problem is to consider a "bad run", a run whose run metadata or engine state summary cannot be loaded, as a first class entity that can be returned from run access endpoints wherever a run could be, without an HTTP level error. This is done everywhere for consistency in this commit, though the argument could be made that it should only be done in the list-all-runs access and other endpoints should continue to error. This bad run contains error information about the cause of the invalid data using a new enumerated error. The bad run will carry all the information that could be loaded - in effect, if the state summary is bad then the run metadata will still be present, and the ID should generally be accessible. Closes EXEC-344
with a robot with a run using a state from the future we get this:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## edge #14711 +/- ##
==========================================
- Coverage 67.34% 67.34% -0.01%
==========================================
Files 2485 2485
Lines 71355 71360 +5
Branches 9016 9016
==========================================
+ Hits 48055 48058 +3
- Misses 21157 21159 +2
Partials 2143 2143
Flags with carried forward coverage won't be shown. Click here to find out more.
|
|
||
if run_resource.ok and isinstance(state_summary, StateSummary): | ||
return Run.construct( | ||
id=run_resource.run_id, | ||
protocolId=run_resource.protocol_id, | ||
createdAt=run_resource.created_at, | ||
actions=run_resource.actions, | ||
status=state_summary.status, | ||
errors=state_summary.errors, | ||
labware=state_summary.labware, | ||
labwareOffsets=state_summary.labwareOffsets, | ||
pipettes=state_summary.pipettes, | ||
modules=state_summary.modules, | ||
current=current, | ||
completedAt=state_summary.completedAt, | ||
startedAt=state_summary.startedAt, | ||
liquids=state_summary.liquids, | ||
) | ||
else: | ||
errors: List[EnumeratedError] = [] | ||
if isinstance(state_summary, BadStateSummary): | ||
state = StateSummary.construct( | ||
status=EngineStatus.STOPPED, | ||
errors=[], | ||
labware=[], | ||
labwareOffsets=[], | ||
pipettes=[], | ||
modules=[], | ||
liquids=[], | ||
) | ||
errors.append(state_summary.dataError) | ||
else: | ||
state = state_summary | ||
if not run_resource.ok: | ||
errors.append(run_resource.error) | ||
|
||
if len(errors) > 1: | ||
run_loading_error = RunLoadingError.from_exc( | ||
InvalidStoredData( | ||
message=( | ||
"Data on this run is not valid. The run may have been " | ||
"created on a future software version." | ||
), | ||
wrapping=errors, | ||
) | ||
) | ||
elif errors: | ||
run_loading_error = RunLoadingError.from_exc(errors[0]) | ||
else: | ||
# We should never get here | ||
run_loading_error = RunLoadingError.from_exc( | ||
AssertionError("Logic error in parsing invalid run.") | ||
) | ||
|
||
return BadRun.construct( | ||
dataError=run_loading_error, | ||
id=run_resource.run_id, | ||
protocolId=run_resource.protocol_id, | ||
createdAt=run_resource.created_at, | ||
actions=run_resource.actions, | ||
status=state.status, | ||
errors=state.errors, | ||
labware=state.labware, | ||
labwareOffsets=state.labwareOffsets, | ||
pipettes=state.pipettes, | ||
modules=state.modules, | ||
current=current, | ||
completedAt=state.completedAt, | ||
startedAt=state.startedAt, | ||
liquids=state.liquids, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at this version of the logic.
It's less nested, passes all the tests, and doesn't have a case where we get a logic error.
It does make the assumption, if you have no errors, you didn't have a bad run. Is that correct?
if run_resource.ok and isinstance(state_summary, StateSummary): | |
return Run.construct( | |
id=run_resource.run_id, | |
protocolId=run_resource.protocol_id, | |
createdAt=run_resource.created_at, | |
actions=run_resource.actions, | |
status=state_summary.status, | |
errors=state_summary.errors, | |
labware=state_summary.labware, | |
labwareOffsets=state_summary.labwareOffsets, | |
pipettes=state_summary.pipettes, | |
modules=state_summary.modules, | |
current=current, | |
completedAt=state_summary.completedAt, | |
startedAt=state_summary.startedAt, | |
liquids=state_summary.liquids, | |
) | |
else: | |
errors: List[EnumeratedError] = [] | |
if isinstance(state_summary, BadStateSummary): | |
state = StateSummary.construct( | |
status=EngineStatus.STOPPED, | |
errors=[], | |
labware=[], | |
labwareOffsets=[], | |
pipettes=[], | |
modules=[], | |
liquids=[], | |
) | |
errors.append(state_summary.dataError) | |
else: | |
state = state_summary | |
if not run_resource.ok: | |
errors.append(run_resource.error) | |
if len(errors) > 1: | |
run_loading_error = RunLoadingError.from_exc( | |
InvalidStoredData( | |
message=( | |
"Data on this run is not valid. The run may have been " | |
"created on a future software version." | |
), | |
wrapping=errors, | |
) | |
) | |
elif errors: | |
run_loading_error = RunLoadingError.from_exc(errors[0]) | |
else: | |
# We should never get here | |
run_loading_error = RunLoadingError.from_exc( | |
AssertionError("Logic error in parsing invalid run.") | |
) | |
return BadRun.construct( | |
dataError=run_loading_error, | |
id=run_resource.run_id, | |
protocolId=run_resource.protocol_id, | |
createdAt=run_resource.created_at, | |
actions=run_resource.actions, | |
status=state.status, | |
errors=state.errors, | |
labware=state.labware, | |
labwareOffsets=state.labwareOffsets, | |
pipettes=state.pipettes, | |
modules=state.modules, | |
current=current, | |
completedAt=state.completedAt, | |
startedAt=state.startedAt, | |
liquids=state.liquids, | |
) | |
errors: List[EnumeratedError] = [] | |
if isinstance(state_summary, BadStateSummary): | |
state = StateSummary.construct( | |
status=EngineStatus.STOPPED, | |
errors=[], | |
labware=[], | |
labwareOffsets=[], | |
pipettes=[], | |
modules=[], | |
liquids=[], | |
) | |
errors.append(state_summary.dataError) | |
else: | |
state = state_summary | |
if not run_resource.ok: | |
errors.append(run_resource.error) | |
if len(errors) == 0: | |
return Run.construct( | |
id=run_resource.run_id, | |
protocolId=run_resource.protocol_id, | |
createdAt=run_resource.created_at, | |
actions=run_resource.actions, | |
status=state_summary.status, | |
errors=state_summary.errors, | |
labware=state_summary.labware, | |
labwareOffsets=state_summary.labwareOffsets, | |
pipettes=state_summary.pipettes, | |
modules=state_summary.modules, | |
current=current, | |
completedAt=state_summary.completedAt, | |
startedAt=state_summary.startedAt, | |
liquids=state_summary.liquids, | |
) | |
if len(errors) == 1: | |
run_loading_error = RunLoadingError.from_exc(errors[0]) | |
else: | |
run_loading_error = RunLoadingError.from_exc( | |
InvalidStoredData( | |
message=( | |
"Data on this run is not valid. The run may have been " | |
"created on a future software version." | |
), | |
wrapping=errors, | |
) | |
) | |
return BadRun.construct( | |
dataError=run_loading_error, | |
id=run_resource.run_id, | |
protocolId=run_resource.protocol_id, | |
createdAt=run_resource.created_at, | |
actions=run_resource.actions, | |
status=state.status, | |
errors=state.errors, | |
labware=state.labware, | |
labwareOffsets=state.labwareOffsets, | |
pipettes=state.pipettes, | |
modules=state.modules, | |
current=current, | |
completedAt=state.completedAt, | |
startedAt=state.startedAt, | |
liquids=state.liquids, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll unnest it but I really like having the no-error case first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! From an app perspective, this does seem reasonable to work with.
) -> PydanticResponse[SimpleBody[Run]]: | ||
) -> PydanticResponse[SimpleBody[Union[Run, BadRun]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, this is a breaking HTTP API change, right? A client that was doing response.data.actions.length
, for example, will now error. I guess the right way to deal with that would be:
- For
Opentrons-Version
≥ n, ReturnBadRun
s as you are now. - For
Opentrons-Version
< n, simply filter outBadRun
s.
I'm also happy for this to be deemed not worthwhile to worry about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me that's an argument for making the actions not Optional[]
and having it be an empty list instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that works in this case. It wouldn't work if there were ever a problem reading a scalar like createdAt
, but that's not a problem that we have right now.
async def get_run( | ||
run_data: Run = Depends(get_run_data_from_url), | ||
) -> PydanticResponse[SimpleBody[Run]]: | ||
) -> PydanticResponse[SimpleBody[Union[Run, BadRun]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though the argument could be made that it should only be done in the list-all-runs access and other endpoints should continue to error.
Yeah, I'm of that opinion. Can GET /runs/{id}
return an HTTP 500 whenever it returns a BadRun
?
My thinking is that we have a lot of Python integration test code (and maybe also JS client code) that does stuff like:
run_response = client.get_run()
run_response.raise_if_not_http_ok()
And I think it's more correct for it not to proceed as normal in these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, the client side doesn't currently have any conditional behavior based on an HTTP error for GET /runs/{id}
, but there could be something more implicit that I'm missing.
I don't have any strong convictions on this either way. I think having the "all runs" and "this run" resources behave uniformly would be preferred, but if it's going to greatly interfere with existing code (or at least make us think something insidious could break), then sure, I'm of the opinion we can continue to error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integration test coverage is enough that I did in fact have to change some behavior, and I think that's good.
I do think that semantically, what we're doing here is adding this new kind of resource, and getting a resource successfully - which this now counts as - gives you a 200. I think we move "there was invalid run data" problem out of the realm of an HTTP API concern and into the system that API provides access to and modeling of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that we have a lot of Python integration test code (and maybe also JS client code) that does stuff like:
run_response = client.get_run() run_response.raise_if_not_http_ok()
I'm also quite happy to change all this stuff Where is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also quite happy to change all this stuff Where is it?
At least some of the uses of RobotClient.get_run()
(haven't looked at them all yet, sorry)
Co-authored-by: Max Marrone <[email protected]>
Co-authored-by: Max Marrone <[email protected]>
Co-authored-by: Max Marrone <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Types look good with passing CI.
…#14723) # Overview Follow-ups for #14711 (comment). #14711 added safer error propagation for when robot-server encounters bad stored run data. As part of that, if it finds a run where the `state_summary` SQL column is `NULL`, it treats that as bad data and propagates the error to HTTP clients. When you restart the robot while there is an active run, then no state summary will be inserted (this only happens when the run is ended and the state moves from engine to sql) and the run will be bad. We say that this is in fact a bad run because to the client, there is no distinction between state summary and run. A run with an empty state summary does not have correct data and does not represent what occurred. Add a regression test to make sure this is how we handle runs that did not have state summaries persisted. # Testing - [x] create a run with this branch on a flex and restart the flex (or kill the robot server process - this isn't about the details of when things are written to disk, just the lifetime of the data here) and see that the run is now bad --------- Co-authored-by: Seth Foster <[email protected]>
Up to now, if there's a run saved in the persistence layer that cannot be loaded - which is typically because it contains data from a version of the robot server or api package that whatever's currently running can't handle - we error when trying to retrieve it. That includes both a 500 error when trying to access that particular run, which clients can broadly handle, and a 500 error when trying to list all runs, which clients cannot. Without being able to list out all runs, there's no way for clients to find the problematic run - the IDs are UUIDs and cannot be enumerated - and remove it. The only recourse is to delete all the run storage. A different way to handle this problem is to consider a "bad run", a run whose run metadata or engine state summary cannot be loaded, as a first class entity that can be returned from run access endpoints wherever a run could be, without an HTTP level error. This is done everywhere for consistency in this commit, though the argument could be made that it should only be done in the list-all-runs access and other endpoints should continue to error. This bad run contains error information about the cause of the invalid data using a new enumerated error. The bad run will carry all the information that could be loaded - in effect, if the state summary is bad then the run metadata will still be present, and the ID should generally be accessible. Closes EXEC-344 Co-authored-by: Max Marrone <[email protected]>
…#14723) # Overview Follow-ups for #14711 (comment). #14711 added safer error propagation for when robot-server encounters bad stored run data. As part of that, if it finds a run where the `state_summary` SQL column is `NULL`, it treats that as bad data and propagates the error to HTTP clients. When you restart the robot while there is an active run, then no state summary will be inserted (this only happens when the run is ended and the state moves from engine to sql) and the run will be bad. We say that this is in fact a bad run because to the client, there is no distinction between state summary and run. A run with an empty state summary does not have correct data and does not represent what occurred. Add a regression test to make sure this is how we handle runs that did not have state summaries persisted. # Testing - [x] create a run with this branch on a flex and restart the flex (or kill the robot server process - this isn't about the details of when things are written to disk, just the lifetime of the data here) and see that the run is now bad --------- Co-authored-by: Seth Foster <[email protected]>
Up to now, if there's a run saved in the persistence layer that cannot be loaded - which is typically because it contains data from a version of the robot server or api package that whatever's currently running can't handle - we error when trying to retrieve it. That includes both a 500 error when trying to access that particular run, which clients can broadly handle, and a 500 error when trying to list all runs, which clients cannot. Without being able to list out all runs, there's no way for clients to find the problematic run - the IDs are UUIDs and cannot be enumerated - and remove it. The only recourse is to delete all the run storage.
A different way to handle this problem is to consider a "bad run", a run whose run metadata or engine state summary cannot be loaded, as a first class entity that can be returned from run access endpoints wherever a run could be, without an HTTP level error. This is done everywhere for consistency in this commit, though the argument could be made that it should only be done in the list-all-runs access and other endpoints should continue to error.
This bad run contains error information about the cause of the invalid data using a new enumerated error. The bad run will carry all the information that could be loaded - in effect, if the state summary is bad then the run metadata will still be present, and the ID should generally be accessible.
Closes EXEC-344
Review requests
Testing