Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): Pause run after recoverable errors #14646

Merged
merged 14 commits into from
Mar 21, 2024
Merged

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Mar 13, 2024

Overview

This closes most of EXEC-301 and closes EXEC-320.

Changelog

What works now (or is intended to work, anyway)

  • When running a JSON protocol on a Flex, if a pickUpTip command fails because of a missing tip, the run pauses now, instead of homing and showing an error message.
  • You can then either:
    • Cancel the run in the usual way, by issuing a stop action via POST /runs/{id}/actions.
    • Resume the run from the next command, by issuing a resume-from-recovery action via POST /runs/{id}/actions.

What doesn't work

  • After a pickUpTip command failure, the robot seems to simultaneously think it does not have a tip attached (a subsequent aspirate command will fail with a "tip not attached" error) and that it does have a tip attached (a subsequent moveTo will act like there's a tip). I haven't had a chance to look into this yet. So the practical usefulness of this is pretty limited so far: you can only meaningfully recover if the protocol doesn't have any aspirates or dispenses.
  • After recovering from an error and proceeding through the rest of the run, the run will end as failed despite the recovery. I see what's causing this, I just haven't gotten to it yet.
  • You can't issue any fixup commands yet.
  • You can't recover from errors in Python protocols yet. EXEC-339 is the next step to accomplish that.
  • This isn't remotely supported by the Opentrons App or Flex on-device display. You'll see weird things like buttons without labels.

Test Plan

Setup

Enable the hidden error recovery feature flag by manually issuing a POST /settings request with {"id": "enableErrorRecoveryExperiments", "value": true}.

Test

  1. Upload this test protocol to a Flex: Recovery test.json.zip. This comes from Protocol Designer, with aspirates and dispenses hand-replaced by moveToWells in order to work around the tip-attached confusion that I've described above.
  2. Fill random wells in the leftmost column of the tip rack with tips.
  3. Start the run.
  4. When the run reaches a missing tip, it should pause. Resume it by issuing a {"data": {"actionType": "resume-from-recovery"}} action to POST /runs/{id}/actions, or cancel it by issuing a {"data": {"actionType": "stop"}} action.

Cleanup

  1. Delete any runs that you just created, to avoid database incompatibilities when you push a different branch. You can do this with the Opentrons App or with manually-issued DELETE /runs/{id} requests.
  2. POST /settings with {"id": "enableErrorRecoveryExperiments", "value": false}.

Review requests

None in particular.

The easiest way to review this might be to filter by commit. I've organized this into refactor commits vs. feature commits.

Risk assessment

Medium. This touches core internal Protocol Engine logic. Although it's all technically well covered by unit tests, those tests are mostly too isolated and low-level to catch actual bugs.

Copy link

codecov bot commented Mar 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.34%. Comparing base (583dcf6) to head (4221574).
Report is 129 commits behind head on edge.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             edge   #14646      +/-   ##
==========================================
- Coverage   67.56%   67.34%   -0.23%     
==========================================
  Files        2521     2485      -36     
  Lines       72251    71342     -909     
  Branches     9311     9014     -297     
==========================================
- Hits        48815    48042     -773     
+ Misses      21240    21157      -83     
+ Partials     2196     2143      -53     
Flag Coverage Δ
g-code-testing 92.43% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
api/src/opentrons/ordered_set.py 100.00% <ø> (ø)
...i/src/opentrons/protocol_engine/actions/actions.py 100.00% <ø> (ø)
...rons/protocol_engine/execution/command_executor.py 100.00% <ø> (ø)
...s/protocol_engine/execution/create_queue_worker.py 100.00% <ø> (ø)
...i/src/opentrons/protocol_engine/protocol_engine.py 100.00% <ø> (ø)
...pi/src/opentrons/protocol_engine/state/commands.py 99.36% <ø> (-0.09%) ⬇️
...opentrons/protocol_runner/legacy_command_mapper.py 98.33% <ø> (ø)

... and 220 files with indirect coverage changes

Copy link
Contributor

@DerekMaggio DerekMaggio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at your new commits.
I like your todos for future tickets. I am stealing that.

Not directly related to your PR, but the logic in handle_action of protocol_engine/state/commands.py is quite complex and easy to get lost in.
I wonder if there is a way to end up cleaning that up? Maybe deferring each of the handling logic to their own function?

Comment on lines +63 to +76
def _is_recoverable(failed_command: Command, exception: Exception) -> bool:
if (
failed_command.commandType == "pickUpTip"
and isinstance(exception, EnumeratedError)
# Hack(?): It seems like this should be ErrorCodes.TIP_PICKUP_FAILED, but that's
# not what gets raised in practice.
and exception.code == ErrorCodes.UNEXPECTED_TIP_REMOVAL
):
return True
else:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can foresee this function getting huge. As we add more recoverable states, will this eventually be replaced with a more standardized approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm also worried about this, and I don't know.

The opposite way to organize this would be to distribute this choice across the command implementations. So, instead of PickUpTipImplementation returning an error, and this deciding whether that error is recoverable, PickUpTipImplementation would return an error and a recoverable boolean (or something like that).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that any executed command, that failed and is recoverable, would throw an EnumeratedError exception?

If so, could the command implementations be required to have an is_recoverable(error_code: ErrorCodes) method?

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are probably places where that's not true today, but we can (and might have to) aim for that to become true as part of this project.

@SyntaxColoring SyntaxColoring marked this pull request as ready for review March 19, 2024 18:51
@SyntaxColoring SyntaxColoring requested a review from a team as a code owner March 19, 2024 18:51
# on the pipette.) However, this currently does the opposite,
# acting as if the command never executed.
type=self._error_recovery_policy(
# todo: should possibly be command_with_new_notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# todo: should possibly be command_with_new_notes

☝️ This one looks left over from development, sorry. I'll remove it.

# todo(mm, 2024-03-13):
# When a command fails recoverably, and we handle it with
# WAIT_FOR_RECOVERY or CONTINUE, we want to update our logical
# protocol state as if the command succeeded. (e.g. if a tip
# pickup failed, pretend that it succeeded and that the tip is now
# on the pipette.) However, this currently does the opposite,
# acting as if the command never executed.

☝️ This one is related to this point from EXEC-301:

  • The failed command should affect the engine's logical state as if it had succeeded, per Slack discussion [link]

Because this is a command failure, not a success, Protocol Engine will not currently accomplish that point.

@@ -288,13 +312,25 @@ def handle_action(self, action: Action) -> None: # noqa: C901
command_id=id, failed_at=action.failed_at, error_occurrence=None
)
self._state.queued_setup_command_ids.clear()
elif (
prev_entry.command.intent == CommandIntent.PROTOCOL
or prev_entry.command.intent is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I might be confusing something but when is this set to None? I thought we use a default value for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 The model defines it as defaulting to None. It might have to do with wanting HTTP commands to default to "setup" but other commands to default to "protocol".

command_id=id, failed_at=action.failed_at, error_occurrence=None
)
self._state.queued_command_ids.clear()
assert_never(prev_entry.command.intent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is new to me :-) good to know about this guy!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably "new" with our recent mypy and typing-extensions updates!

Copy link
Contributor

@TamarZanzouri TamarZanzouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments but makes total sense and it looks great! nice job!

@SyntaxColoring
Copy link
Contributor Author

I've run through the test plan in the PR description, and also run a couple of other random protocols as a smoke test to make sure this didn't break anything else.

@SyntaxColoring SyntaxColoring merged commit 6b652c2 into edge Mar 21, 2024
22 checks passed
@SyntaxColoring SyntaxColoring deleted the nonfatal_errors branch March 21, 2024 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants