Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(robot-server): Use a BadRun when we cant load #14711

Merged
merged 9 commits into from
Mar 21, 2024

Conversation

sfoster1
Copy link
Member

@sfoster1 sfoster1 commented Mar 21, 2024

Up to now, if there's a run saved in the persistence layer that cannot be loaded - which is typically because it contains data from a version of the robot server or api package that whatever's currently running can't handle - we error when trying to retrieve it. That includes both a 500 error when trying to access that particular run, which clients can broadly handle, and a 500 error when trying to list all runs, which clients cannot. Without being able to list out all runs, there's no way for clients to find the problematic run - the IDs are UUIDs and cannot be enumerated - and remove it. The only recourse is to delete all the run storage.

A different way to handle this problem is to consider a "bad run", a run whose run metadata or engine state summary cannot be loaded, as a first class entity that can be returned from run access endpoints wherever a run could be, without an HTTP level error. This is done everywhere for consistency in this commit, though the argument could be made that it should only be done in the list-all-runs access and other endpoints should continue to error.

This bad run contains error information about the cause of the invalid data using a new enumerated error. The bad run will carry all the information that could be loaded - in effect, if the state summary is bad then the run metadata will still be present, and the ID should generally be accessible.

Closes EXEC-344

Review requests

  • normal stuff
  • from an app perspective, does this seem reasonable to work with?

Testing

  • Put it on some robots and see that we get some useful data

Up to now, if there's a run saved in the persistence layer that cannot
be loaded - which is typically because it contains data from a version
of the robot server or api package that whatever's currently running
can't handle - we error when trying to retrieve it. That includes both a
500 error when trying to access that particular run, which clients can
broadly handle, and a 500 error when trying to list all runs, which
clients cannot. Without being able to list out all runs, there's no way
for clients to find the problematic run - the IDs are UUIDs and cannot
be enumerated - and remove it. The only recourse is to delete all the
run storage.

A different way to handle this problem is to consider a "bad run", a run
whose run metadata or engine state summary cannot be loaded, as a first
class entity that can be returned from run access endpoints wherever a
run could be, without an HTTP level error. This is done everywhere for
consistency in this commit, though the argument could be made that it
should only be done in the list-all-runs access and other endpoints
should continue to error.

This bad run contains error information about the cause of the invalid
data using a new enumerated error. The bad run will carry all the
information that could be loaded - in effect, if the state summary is
bad then the run metadata will still be present, and the ID should
generally be accessible.

Closes EXEC-344
@sfoster1 sfoster1 requested review from a team as code owners March 21, 2024 15:03
@sfoster1 sfoster1 requested a review from a team March 21, 2024 15:04
@sfoster1
Copy link
Member Author

with a robot with a run using a state from the future we get this:

{
      "id": "312a3e2c-5156-45dd-b3de-ea84ff6734b1",
      "dataError": {
        "id": "RunLoadingError",
        "title": "Run Loading Error",
        "detail": "There was no engine state data for this run.",
        "meta": {
          "type": "InvalidStoredData",
          "code": "4008",
          "message": "There was no engine state data for this run.",
          "detail": {},
          "wrapping": []
        },
        "errorCode": "4008"
      },
      "createdAt": "2024-03-19T19:08:58.821381+00:00",
      "status": "stopped",
      "current": false,
      "actions": [],
      "errors": [],
      "pipettes": [],
      "modules": [],
      "labware": [],
      "liquids": [],
      "labwareOffsets": [],
      "protocolId": "8aa13211-066a-4b80-b24f-bac669cff09a"
    }

Copy link

codecov bot commented Mar 21, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 67.34%. Comparing base (935e84d) to head (464adb1).
Report is 42 commits behind head on edge.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             edge   #14711      +/-   ##
==========================================
- Coverage   67.34%   67.34%   -0.01%     
==========================================
  Files        2485     2485              
  Lines       71355    71360       +5     
  Branches     9016     9016              
==========================================
+ Hits        48055    48058       +3     
- Misses      21157    21159       +2     
  Partials     2143     2143              
Flag Coverage Δ
shared-data 75.93% <80.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...bot-server/robot_server/runs/router/base_router.py 96.19% <ø> (-0.08%) ⬇️
robot-server/robot_server/runs/run_models.py 100.00% <ø> (ø)
robot-server/robot_server/runs/run_store.py 100.00% <ø> (ø)
...-data/python/opentrons_shared_data/errors/codes.py 93.75% <100.00%> (+0.16%) ⬆️
.../python/opentrons_shared_data/errors/exceptions.py 60.81% <75.00%> (+0.14%) ⬆️

Comment on lines 34 to 104

if run_resource.ok and isinstance(state_summary, StateSummary):
return Run.construct(
id=run_resource.run_id,
protocolId=run_resource.protocol_id,
createdAt=run_resource.created_at,
actions=run_resource.actions,
status=state_summary.status,
errors=state_summary.errors,
labware=state_summary.labware,
labwareOffsets=state_summary.labwareOffsets,
pipettes=state_summary.pipettes,
modules=state_summary.modules,
current=current,
completedAt=state_summary.completedAt,
startedAt=state_summary.startedAt,
liquids=state_summary.liquids,
)
else:
errors: List[EnumeratedError] = []
if isinstance(state_summary, BadStateSummary):
state = StateSummary.construct(
status=EngineStatus.STOPPED,
errors=[],
labware=[],
labwareOffsets=[],
pipettes=[],
modules=[],
liquids=[],
)
errors.append(state_summary.dataError)
else:
state = state_summary
if not run_resource.ok:
errors.append(run_resource.error)

if len(errors) > 1:
run_loading_error = RunLoadingError.from_exc(
InvalidStoredData(
message=(
"Data on this run is not valid. The run may have been "
"created on a future software version."
),
wrapping=errors,
)
)
elif errors:
run_loading_error = RunLoadingError.from_exc(errors[0])
else:
# We should never get here
run_loading_error = RunLoadingError.from_exc(
AssertionError("Logic error in parsing invalid run.")
)

return BadRun.construct(
dataError=run_loading_error,
id=run_resource.run_id,
protocolId=run_resource.protocol_id,
createdAt=run_resource.created_at,
actions=run_resource.actions,
status=state.status,
errors=state.errors,
labware=state.labware,
labwareOffsets=state.labwareOffsets,
pipettes=state.pipettes,
modules=state.modules,
current=current,
completedAt=state.completedAt,
startedAt=state.startedAt,
liquids=state.liquids,
)
Copy link
Contributor

@DerekMaggio DerekMaggio Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at this version of the logic.
It's less nested, passes all the tests, and doesn't have a case where we get a logic error.

It does make the assumption, if you have no errors, you didn't have a bad run. Is that correct?

Suggested change
if run_resource.ok and isinstance(state_summary, StateSummary):
return Run.construct(
id=run_resource.run_id,
protocolId=run_resource.protocol_id,
createdAt=run_resource.created_at,
actions=run_resource.actions,
status=state_summary.status,
errors=state_summary.errors,
labware=state_summary.labware,
labwareOffsets=state_summary.labwareOffsets,
pipettes=state_summary.pipettes,
modules=state_summary.modules,
current=current,
completedAt=state_summary.completedAt,
startedAt=state_summary.startedAt,
liquids=state_summary.liquids,
)
else:
errors: List[EnumeratedError] = []
if isinstance(state_summary, BadStateSummary):
state = StateSummary.construct(
status=EngineStatus.STOPPED,
errors=[],
labware=[],
labwareOffsets=[],
pipettes=[],
modules=[],
liquids=[],
)
errors.append(state_summary.dataError)
else:
state = state_summary
if not run_resource.ok:
errors.append(run_resource.error)
if len(errors) > 1:
run_loading_error = RunLoadingError.from_exc(
InvalidStoredData(
message=(
"Data on this run is not valid. The run may have been "
"created on a future software version."
),
wrapping=errors,
)
)
elif errors:
run_loading_error = RunLoadingError.from_exc(errors[0])
else:
# We should never get here
run_loading_error = RunLoadingError.from_exc(
AssertionError("Logic error in parsing invalid run.")
)
return BadRun.construct(
dataError=run_loading_error,
id=run_resource.run_id,
protocolId=run_resource.protocol_id,
createdAt=run_resource.created_at,
actions=run_resource.actions,
status=state.status,
errors=state.errors,
labware=state.labware,
labwareOffsets=state.labwareOffsets,
pipettes=state.pipettes,
modules=state.modules,
current=current,
completedAt=state.completedAt,
startedAt=state.startedAt,
liquids=state.liquids,
)
errors: List[EnumeratedError] = []
if isinstance(state_summary, BadStateSummary):
state = StateSummary.construct(
status=EngineStatus.STOPPED,
errors=[],
labware=[],
labwareOffsets=[],
pipettes=[],
modules=[],
liquids=[],
)
errors.append(state_summary.dataError)
else:
state = state_summary
if not run_resource.ok:
errors.append(run_resource.error)
if len(errors) == 0:
return Run.construct(
id=run_resource.run_id,
protocolId=run_resource.protocol_id,
createdAt=run_resource.created_at,
actions=run_resource.actions,
status=state_summary.status,
errors=state_summary.errors,
labware=state_summary.labware,
labwareOffsets=state_summary.labwareOffsets,
pipettes=state_summary.pipettes,
modules=state_summary.modules,
current=current,
completedAt=state_summary.completedAt,
startedAt=state_summary.startedAt,
liquids=state_summary.liquids,
)
if len(errors) == 1:
run_loading_error = RunLoadingError.from_exc(errors[0])
else:
run_loading_error = RunLoadingError.from_exc(
InvalidStoredData(
message=(
"Data on this run is not valid. The run may have been "
"created on a future software version."
),
wrapping=errors,
)
)
return BadRun.construct(
dataError=run_loading_error,
id=run_resource.run_id,
protocolId=run_resource.protocol_id,
createdAt=run_resource.created_at,
actions=run_resource.actions,
status=state.status,
errors=state.errors,
labware=state.labware,
labwareOffsets=state.labwareOffsets,
pipettes=state.pipettes,
modules=state.modules,
current=current,
completedAt=state.completedAt,
startedAt=state.startedAt,
liquids=state.liquids,
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll unnest it but I really like having the no-error case first

Copy link
Contributor

@mjhuff mjhuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! From an app perspective, this does seem reasonable to work with.

robot-server/robot_server/runs/run_models.py Outdated Show resolved Hide resolved
robot-server/robot_server/runs/run_store.py Outdated Show resolved Hide resolved
robot-server/robot_server/runs/run_store.py Show resolved Hide resolved
robot-server/robot_server/runs/run_store.py Outdated Show resolved Hide resolved
) -> PydanticResponse[SimpleBody[Run]]:
) -> PydanticResponse[SimpleBody[Union[Run, BadRun]]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this is a breaking HTTP API change, right? A client that was doing response.data.actions.length, for example, will now error. I guess the right way to deal with that would be:

  • For Opentrons-Versionn, Return BadRuns as you are now.
  • For Opentrons-Version < n, simply filter out BadRuns.

I'm also happy for this to be deemed not worthwhile to worry about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me that's an argument for making the actions not Optional[] and having it be an empty list instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that works in this case. It wouldn't work if there were ever a problem reading a scalar like createdAt, but that's not a problem that we have right now.

Comment on lines 255 to +257
async def get_run(
run_data: Run = Depends(get_run_data_from_url),
) -> PydanticResponse[SimpleBody[Run]]:
) -> PydanticResponse[SimpleBody[Union[Run, BadRun]]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though the argument could be made that it should only be done in the list-all-runs access and other endpoints should continue to error.

Yeah, I'm of that opinion. Can GET /runs/{id} return an HTTP 500 whenever it returns a BadRun?

My thinking is that we have a lot of Python integration test code (and maybe also JS client code) that does stuff like:

run_response = client.get_run()
run_response.raise_if_not_http_ok()

And I think it's more correct for it not to proceed as normal in these cases.

Copy link
Contributor

@mjhuff mjhuff Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, the client side doesn't currently have any conditional behavior based on an HTTP error for GET /runs/{id}, but there could be something more implicit that I'm missing.

I don't have any strong convictions on this either way. I think having the "all runs" and "this run" resources behave uniformly would be preferred, but if it's going to greatly interfere with existing code (or at least make us think something insidious could break), then sure, I'm of the opinion we can continue to error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration test coverage is enough that I did in fact have to change some behavior, and I think that's good.

I do think that semantically, what we're doing here is adding this new kind of resource, and getting a resource successfully - which this now counts as - gives you a 200. I think we move "there was invalid run data" problem out of the realm of an HTTP API concern and into the system that API provides access to and modeling of.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that we have a lot of Python integration test code (and maybe also JS client code) that does stuff like:

run_response = client.get_run()
run_response.raise_if_not_http_ok()

I'm also quite happy to change all this stuff Where is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also quite happy to change all this stuff Where is it?

At least some of the uses of RobotClient.get_run() (haven't looked at them all yet, sorry)

@sfoster1 sfoster1 requested a review from a team as a code owner March 21, 2024 20:42
@sfoster1 sfoster1 requested review from smb2268 and removed request for a team March 21, 2024 20:42
Copy link
Contributor

@mjhuff mjhuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types look good with passing CI.

@sfoster1 sfoster1 merged commit e265610 into edge Mar 21, 2024
37 checks passed
@sfoster1 sfoster1 deleted the exec-344-bad-run-records branch March 21, 2024 21:07
sfoster1 added a commit that referenced this pull request Mar 25, 2024
…#14723)

# Overview

Follow-ups for
#14711 (comment).

#14711 added safer error propagation for when robot-server encounters
bad stored run data. As part of that, if it finds a run where the
`state_summary` SQL column is `NULL`, it treats that as bad data and
propagates the error to HTTP clients.

When you restart the robot while there is an active run, then no state
summary will be inserted (this only happens when the run is ended and
the state moves from engine to sql) and the run will be bad.

We say that this is in fact a bad run because to the client, there is no
distinction between state summary and run. A run with an empty state
summary does not have correct data and does not represent what occurred.

Add a regression test to make sure this is how we handle runs that did
not have state summaries persisted.

# Testing
- [x] create a run with this branch on a flex and restart the flex (or
kill the robot server process - this isn't about the details of when
things are written to disk, just the lifetime of the data here) and see
that the run is now bad

---------

Co-authored-by: Seth Foster <[email protected]>
Carlos-fernandez pushed a commit that referenced this pull request May 20, 2024
Up to now, if there's a run saved in the persistence layer that cannot
be loaded - which is typically because it contains data from a version
of the robot server or api package that whatever's currently running
can't handle - we error when trying to retrieve it. That includes both a
500 error when trying to access that particular run, which clients can
broadly handle, and a 500 error when trying to list all runs, which
clients cannot. Without being able to list out all runs, there's no way
for clients to find the problematic run - the IDs are UUIDs and cannot
be enumerated - and remove it. The only recourse is to delete all the
run storage.

A different way to handle this problem is to consider a "bad run", a run
whose run metadata or engine state summary cannot be loaded, as a first
class entity that can be returned from run access endpoints wherever a
run could be, without an HTTP level error. This is done everywhere for
consistency in this commit, though the argument could be made that it
should only be done in the list-all-runs access and other endpoints
should continue to error.

This bad run contains error information about the cause of the invalid
data using a new enumerated error. The bad run will carry all the
information that could be loaded - in effect, if the state summary is
bad then the run metadata will still be present, and the ID should
generally be accessible.

Closes EXEC-344
Co-authored-by: Max Marrone <[email protected]>
Carlos-fernandez pushed a commit that referenced this pull request May 20, 2024
…#14723)

# Overview

Follow-ups for
#14711 (comment).

#14711 added safer error propagation for when robot-server encounters
bad stored run data. As part of that, if it finds a run where the
`state_summary` SQL column is `NULL`, it treats that as bad data and
propagates the error to HTTP clients.

When you restart the robot while there is an active run, then no state
summary will be inserted (this only happens when the run is ended and
the state moves from engine to sql) and the run will be bad.

We say that this is in fact a bad run because to the client, there is no
distinction between state summary and run. A run with an empty state
summary does not have correct data and does not represent what occurred.

Add a regression test to make sure this is how we handle runs that did
not have state summaries persisted.

# Testing
- [x] create a run with this branch on a flex and restart the flex (or
kill the robot server process - this isn't about the details of when
things are written to disk, just the lifetime of the data here) and see
that the run is now bad

---------

Co-authored-by: Seth Foster <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants