feat(robot-server): Use a BadRun when we cant load #14711

sfoster1 · 2024-03-21T15:03:58Z

Up to now, if there's a run saved in the persistence layer that cannot be loaded - which is typically because it contains data from a version of the robot server or api package that whatever's currently running can't handle - we error when trying to retrieve it. That includes both a 500 error when trying to access that particular run, which clients can broadly handle, and a 500 error when trying to list all runs, which clients cannot. Without being able to list out all runs, there's no way for clients to find the problematic run - the IDs are UUIDs and cannot be enumerated - and remove it. The only recourse is to delete all the run storage.

A different way to handle this problem is to consider a "bad run", a run whose run metadata or engine state summary cannot be loaded, as a first class entity that can be returned from run access endpoints wherever a run could be, without an HTTP level error. This is done everywhere for consistency in this commit, though the argument could be made that it should only be done in the list-all-runs access and other endpoints should continue to error.

This bad run contains error information about the cause of the invalid data using a new enumerated error. The bad run will carry all the information that could be loaded - in effect, if the state summary is bad then the run metadata will still be present, and the ID should generally be accessible.

Closes EXEC-344

Review requests

normal stuff
from an app perspective, does this seem reasonable to work with?

Testing

Put it on some robots and see that we get some useful data

Up to now, if there's a run saved in the persistence layer that cannot be loaded - which is typically because it contains data from a version of the robot server or api package that whatever's currently running can't handle - we error when trying to retrieve it. That includes both a 500 error when trying to access that particular run, which clients can broadly handle, and a 500 error when trying to list all runs, which clients cannot. Without being able to list out all runs, there's no way for clients to find the problematic run - the IDs are UUIDs and cannot be enumerated - and remove it. The only recourse is to delete all the run storage. A different way to handle this problem is to consider a "bad run", a run whose run metadata or engine state summary cannot be loaded, as a first class entity that can be returned from run access endpoints wherever a run could be, without an HTTP level error. This is done everywhere for consistency in this commit, though the argument could be made that it should only be done in the list-all-runs access and other endpoints should continue to error. This bad run contains error information about the cause of the invalid data using a new enumerated error. The bad run will carry all the information that could be loaded - in effect, if the state summary is bad then the run metadata will still be present, and the ID should generally be accessible. Closes EXEC-344

sfoster1 · 2024-03-21T15:29:00Z

with a robot with a run using a state from the future we get this:

{
      "id": "312a3e2c-5156-45dd-b3de-ea84ff6734b1",
      "dataError": {
        "id": "RunLoadingError",
        "title": "Run Loading Error",
        "detail": "There was no engine state data for this run.",
        "meta": {
          "type": "InvalidStoredData",
          "code": "4008",
          "message": "There was no engine state data for this run.",
          "detail": {},
          "wrapping": []
        },
        "errorCode": "4008"
      },
      "createdAt": "2024-03-19T19:08:58.821381+00:00",
      "status": "stopped",
      "current": false,
      "actions": [],
      "errors": [],
      "pipettes": [],
      "modules": [],
      "labware": [],
      "liquids": [],
      "labwareOffsets": [],
      "protocolId": "8aa13211-066a-4b80-b24f-bac669cff09a"
    }

codecov · 2024-03-21T15:33:33Z

Codecov Report

Attention: Patch coverage is 80.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 67.34%. Comparing base (935e84d) to head (464adb1).
Report is 42 commits behind head on edge.

Additional details and impacted files

@@            Coverage Diff             @@
##             edge   #14711      +/-   ##
==========================================
- Coverage   67.34%   67.34%   -0.01%     
==========================================
  Files        2485     2485              
  Lines       71355    71360       +5     
  Branches     9016     9016              
==========================================
+ Hits        48055    48058       +3     
- Misses      21157    21159       +2     
  Partials     2143     2143

Flag	Coverage Δ
shared-data	`75.93% <80.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...bot-server/robot_server/runs/router/base_router.py	`96.19% <ø> (-0.08%)`	⬇️
robot-server/robot_server/runs/run_models.py	`100.00% <ø> (ø)`
robot-server/robot_server/runs/run_store.py	`100.00% <ø> (ø)`
...-data/python/opentrons_shared_data/errors/codes.py	`93.75% <100.00%> (+0.16%)`	⬆️
.../python/opentrons_shared_data/errors/exceptions.py	`60.81% <75.00%> (+0.14%)`	⬆️

DerekMaggio · 2024-03-21T16:20:05Z

robot-server/robot_server/runs/run_data_manager.py

+
+    if run_resource.ok and isinstance(state_summary, StateSummary):
+        return Run.construct(
+            id=run_resource.run_id,
+            protocolId=run_resource.protocol_id,
+            createdAt=run_resource.created_at,
+            actions=run_resource.actions,
+            status=state_summary.status,
+            errors=state_summary.errors,
+            labware=state_summary.labware,
+            labwareOffsets=state_summary.labwareOffsets,
+            pipettes=state_summary.pipettes,
+            modules=state_summary.modules,
+            current=current,
+            completedAt=state_summary.completedAt,
+            startedAt=state_summary.startedAt,
+            liquids=state_summary.liquids,
+        )
+    else:
+        errors: List[EnumeratedError] = []
+        if isinstance(state_summary, BadStateSummary):
+            state = StateSummary.construct(
+                status=EngineStatus.STOPPED,
+                errors=[],
+                labware=[],
+                labwareOffsets=[],
+                pipettes=[],
+                modules=[],
+                liquids=[],
+            )
+            errors.append(state_summary.dataError)
+        else:
+            state = state_summary
+        if not run_resource.ok:
+            errors.append(run_resource.error)
+
+        if len(errors) > 1:
+            run_loading_error = RunLoadingError.from_exc(
+                InvalidStoredData(
+                    message=(
+                        "Data on this run is not valid. The run may have been "
+                        "created on a future software version."
+                    ),
+                    wrapping=errors,
+                )
+            )
+        elif errors:
+            run_loading_error = RunLoadingError.from_exc(errors[0])
+        else:
+            # We should never get here
+            run_loading_error = RunLoadingError.from_exc(
+                AssertionError("Logic error in parsing invalid run.")
+            )
+
+        return BadRun.construct(
+            dataError=run_loading_error,
+            id=run_resource.run_id,
+            protocolId=run_resource.protocol_id,
+            createdAt=run_resource.created_at,
+            actions=run_resource.actions,
+            status=state.status,
+            errors=state.errors,
+            labware=state.labware,
+            labwareOffsets=state.labwareOffsets,
+            pipettes=state.pipettes,
+            modules=state.modules,
+            current=current,
+            completedAt=state.completedAt,
+            startedAt=state.startedAt,
+            liquids=state.liquids,
+        )


Take a look at this version of the logic.
It's less nested, passes all the tests, and doesn't have a case where we get a logic error.

It does make the assumption, if you have no errors, you didn't have a bad run. Is that correct?

Suggested change

if run_resource.ok and isinstance(state_summary, StateSummary):

return Run.construct(

id=run_resource.run_id,

protocolId=run_resource.protocol_id,

createdAt=run_resource.created_at,

actions=run_resource.actions,

status=state_summary.status,

errors=state_summary.errors,

labware=state_summary.labware,

labwareOffsets=state_summary.labwareOffsets,

pipettes=state_summary.pipettes,

modules=state_summary.modules,

current=current,

completedAt=state_summary.completedAt,

startedAt=state_summary.startedAt,

liquids=state_summary.liquids,

)

else:

errors: List[EnumeratedError] = []

if isinstance(state_summary, BadStateSummary):

state = StateSummary.construct(

status=EngineStatus.STOPPED,

errors=[],

labware=[],

labwareOffsets=[],

pipettes=[],

modules=[],

liquids=[],

)

errors.append(state_summary.dataError)

else:

state = state_summary

if not run_resource.ok:

errors.append(run_resource.error)

if len(errors) > 1:

run_loading_error = RunLoadingError.from_exc(

InvalidStoredData(

message=(

"Data on this run is not valid. The run may have been "

"created on a future software version."

),

wrapping=errors,

)

)

elif errors:

run_loading_error = RunLoadingError.from_exc(errors[0])

else:

# We should never get here

run_loading_error = RunLoadingError.from_exc(

AssertionError("Logic error in parsing invalid run.")

)

return BadRun.construct(

dataError=run_loading_error,

id=run_resource.run_id,

protocolId=run_resource.protocol_id,

createdAt=run_resource.created_at,

actions=run_resource.actions,

status=state.status,

errors=state.errors,

labware=state.labware,

labwareOffsets=state.labwareOffsets,

pipettes=state.pipettes,

modules=state.modules,

current=current,

completedAt=state.completedAt,

startedAt=state.startedAt,

liquids=state.liquids,

)

errors: List[EnumeratedError] = []

if isinstance(state_summary, BadStateSummary):

state = StateSummary.construct(

status=EngineStatus.STOPPED,

errors=[],

labware=[],

labwareOffsets=[],

pipettes=[],

modules=[],

liquids=[],

)

errors.append(state_summary.dataError)

else:

state = state_summary

if not run_resource.ok:

errors.append(run_resource.error)

if len(errors) == 0:

return Run.construct(

id=run_resource.run_id,

protocolId=run_resource.protocol_id,

createdAt=run_resource.created_at,

actions=run_resource.actions,

status=state_summary.status,

errors=state_summary.errors,

labware=state_summary.labware,

labwareOffsets=state_summary.labwareOffsets,

pipettes=state_summary.pipettes,

modules=state_summary.modules,

current=current,

completedAt=state_summary.completedAt,

startedAt=state_summary.startedAt,

liquids=state_summary.liquids,

)

if len(errors) == 1:

run_loading_error = RunLoadingError.from_exc(errors[0])

else:

run_loading_error = RunLoadingError.from_exc(

InvalidStoredData(

message=(

"Data on this run is not valid. The run may have been "

"created on a future software version."

),

wrapping=errors,

)

)

return BadRun.construct(

dataError=run_loading_error,

id=run_resource.run_id,

protocolId=run_resource.protocol_id,

createdAt=run_resource.created_at,

actions=run_resource.actions,

status=state.status,

errors=state.errors,

labware=state.labware,

labwareOffsets=state.labwareOffsets,

pipettes=state.pipettes,

modules=state.modules,

current=current,

completedAt=state.completedAt,

startedAt=state.startedAt,

liquids=state.liquids,

)

I think I'll unnest it but I really like having the no-error case first

mjhuff

Nice! From an app perspective, this does seem reasonable to work with.

shared-data/python/opentrons_shared_data/errors/codes.py

robot-server/robot_server/runs/run_models.py

robot-server/robot_server/runs/run_store.py

SyntaxColoring · 2024-03-21T19:06:46Z

robot-server/robot_server/runs/router/base_router.py

-) -> PydanticResponse[SimpleBody[Run]]:
+) -> PydanticResponse[SimpleBody[Union[Run, BadRun]]]:


Technically, this is a breaking HTTP API change, right? A client that was doing response.data.actions.length, for example, will now error. I guess the right way to deal with that would be:

For Opentrons-Version ≥ n, Return BadRuns as you are now.

For Opentrons-Version < n, simply filter out BadRuns.

I'm also happy for this to be deemed not worthwhile to worry about.

To me that's an argument for making the actions not Optional[] and having it be an empty list instead

Yeah, that works in this case. It wouldn't work if there were ever a problem reading a scalar like createdAt, but that's not a problem that we have right now.

SyntaxColoring · 2024-03-21T19:15:04Z

robot-server/robot_server/runs/router/base_router.py

 async def get_run(
    run_data: Run = Depends(get_run_data_from_url),
-) -> PydanticResponse[SimpleBody[Run]]:
+) -> PydanticResponse[SimpleBody[Union[Run, BadRun]]]:


though the argument could be made that it should only be done in the list-all-runs access and other endpoints should continue to error.

Yeah, I'm of that opinion. Can GET /runs/{id} return an HTTP 500 whenever it returns a BadRun?

My thinking is that we have a lot of Python integration test code (and maybe also JS client code) that does stuff like:

run_response = client.get_run() run_response.raise_if_not_http_ok()

And I think it's more correct for it not to proceed as normal in these cases.

For what it's worth, the client side doesn't currently have any conditional behavior based on an HTTP error for GET /runs/{id}, but there could be something more implicit that I'm missing.

I don't have any strong convictions on this either way. I think having the "all runs" and "this run" resources behave uniformly would be preferred, but if it's going to greatly interfere with existing code (or at least make us think something insidious could break), then sure, I'm of the opinion we can continue to error.

The integration test coverage is enough that I did in fact have to change some behavior, and I think that's good.

I do think that semantically, what we're doing here is adding this new kind of resource, and getting a resource successfully - which this now counts as - gives you a 200. I think we move "there was invalid run data" problem out of the realm of an HTTP API concern and into the system that API provides access to and modeling of.

My thinking is that we have a lot of Python integration test code (and maybe also JS client code) that does stuff like:

run_response = client.get_run() run_response.raise_if_not_http_ok()

I'm also quite happy to change all this stuff Where is it?

I'm also quite happy to change all this stuff Where is it?

At least some of the uses of RobotClient.get_run() (haven't looked at them all yet, sorry)

Co-authored-by: Max Marrone <[email protected]>

mjhuff

Types look good with passing CI.

…#14723) # Overview Follow-ups for #14711 (comment). #14711 added safer error propagation for when robot-server encounters bad stored run data. As part of that, if it finds a run where the `state_summary` SQL column is `NULL`, it treats that as bad data and propagates the error to HTTP clients. When you restart the robot while there is an active run, then no state summary will be inserted (this only happens when the run is ended and the state moves from engine to sql) and the run will be bad. We say that this is in fact a bad run because to the client, there is no distinction between state summary and run. A run with an empty state summary does not have correct data and does not represent what occurred. Add a regression test to make sure this is how we handle runs that did not have state summaries persisted. # Testing - [x] create a run with this branch on a flex and restart the flex (or kill the robot server process - this isn't about the details of when things are written to disk, just the lifetime of the data here) and see that the run is now bad --------- Co-authored-by: Seth Foster <[email protected]>

Up to now, if there's a run saved in the persistence layer that cannot be loaded - which is typically because it contains data from a version of the robot server or api package that whatever's currently running can't handle - we error when trying to retrieve it. That includes both a 500 error when trying to access that particular run, which clients can broadly handle, and a 500 error when trying to list all runs, which clients cannot. Without being able to list out all runs, there's no way for clients to find the problematic run - the IDs are UUIDs and cannot be enumerated - and remove it. The only recourse is to delete all the run storage. A different way to handle this problem is to consider a "bad run", a run whose run metadata or engine state summary cannot be loaded, as a first class entity that can be returned from run access endpoints wherever a run could be, without an HTTP level error. This is done everywhere for consistency in this commit, though the argument could be made that it should only be done in the list-all-runs access and other endpoints should continue to error. This bad run contains error information about the cause of the invalid data using a new enumerated error. The bad run will carry all the information that could be loaded - in effect, if the state summary is bad then the run metadata will still be present, and the ID should generally be accessible. Closes EXEC-344 Co-authored-by: Max Marrone <[email protected]>

…#14723) # Overview Follow-ups for #14711 (comment). #14711 added safer error propagation for when robot-server encounters bad stored run data. As part of that, if it finds a run where the `state_summary` SQL column is `NULL`, it treats that as bad data and propagates the error to HTTP clients. When you restart the robot while there is an active run, then no state summary will be inserted (this only happens when the run is ended and the state moves from engine to sql) and the run will be bad. We say that this is in fact a bad run because to the client, there is no distinction between state summary and run. A run with an empty state summary does not have correct data and does not represent what occurred. Add a regression test to make sure this is how we handle runs that did not have state summaries persisted. # Testing - [x] create a run with this branch on a flex and restart the flex (or kill the robot server process - this isn't about the details of when things are written to disk, just the lifetime of the data here) and see that the run is now bad --------- Co-authored-by: Seth Foster <[email protected]>

sfoster1 requested review from a team as code owners March 21, 2024 15:03

sfoster1 requested a review from a team March 21, 2024 15:04

lint

6643a1e

lint s-d

732a487

DerekMaggio reviewed Mar 21, 2024

View reviewed changes

mjhuff reviewed Mar 21, 2024

View reviewed changes

SyntaxColoring approved these changes Mar 21, 2024

View reviewed changes

sfoster1 and others added 5 commits March 21, 2024 15:41

Update robot-server/robot_server/runs/run_store.py

5de197c

Co-authored-by: Max Marrone <[email protected]>

Update robot-server/robot_server/runs/run_store.py

ec88d71

Co-authored-by: Max Marrone <[email protected]>

Update robot-server/robot_server/runs/run_store.py

93e430d

Co-authored-by: Max Marrone <[email protected]>

notes

34540a1

add js types

2225c71

sfoster1 requested a review from a team as a code owner March 21, 2024 20:42

sfoster1 requested review from smb2268 and removed request for a team March 21, 2024 20:42

mjhuff approved these changes Mar 21, 2024

View reviewed changes

lint

464adb1

sfoster1 merged commit e265610 into edge Mar 21, 2024
37 checks passed

sfoster1 deleted the exec-344-bad-run-records branch March 21, 2024 21:07

SyntaxColoring mentioned this pull request Mar 25, 2024

fix(robot-server): Update tests to properly check new bad run records #14723

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(robot-server): Use a BadRun when we cant load #14711

feat(robot-server): Use a BadRun when we cant load #14711

sfoster1 commented Mar 21, 2024 •

edited

Loading

sfoster1 commented Mar 21, 2024

codecov bot commented Mar 21, 2024 •

edited

Loading

DerekMaggio Mar 21, 2024 •

edited

Loading

sfoster1 Mar 21, 2024

mjhuff left a comment

SyntaxColoring Mar 21, 2024

sfoster1 Mar 21, 2024

SyntaxColoring Mar 21, 2024

SyntaxColoring Mar 21, 2024

mjhuff Mar 21, 2024 •

edited

Loading

sfoster1 Mar 21, 2024

sfoster1 Mar 21, 2024

SyntaxColoring Mar 21, 2024

mjhuff left a comment

		) -> PydanticResponse[SimpleBody[Run]]:
		) -> PydanticResponse[SimpleBody[Union[Run, BadRun]]]:

feat(robot-server): Use a BadRun when we cant load #14711

feat(robot-server): Use a BadRun when we cant load #14711

Conversation

sfoster1 commented Mar 21, 2024 • edited Loading

Review requests

Testing

sfoster1 commented Mar 21, 2024

codecov bot commented Mar 21, 2024 • edited Loading

Codecov Report

DerekMaggio Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjhuff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjhuff Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjhuff left a comment

Choose a reason for hiding this comment

sfoster1 commented Mar 21, 2024 •

edited

Loading

codecov bot commented Mar 21, 2024 •

edited

Loading

DerekMaggio Mar 21, 2024 •

edited

Loading

mjhuff Mar 21, 2024 •

edited

Loading