feat(api): Pause run after recoverable errors #14646

SyntaxColoring · 2024-03-13T13:57:10Z

Overview

This closes most of EXEC-301 and closes EXEC-320.

Changelog

What works now (or is intended to work, anyway)

When running a JSON protocol on a Flex, if a pickUpTip command fails because of a missing tip, the run pauses now, instead of homing and showing an error message.
You can then either:
- Cancel the run in the usual way, by issuing a stop action via POST /runs/{id}/actions.
- Resume the run from the next command, by issuing a resume-from-recovery action via POST /runs/{id}/actions.

What doesn't work

After a pickUpTip command failure, the robot seems to simultaneously think it does not have a tip attached (a subsequent aspirate command will fail with a "tip not attached" error) and that it does have a tip attached (a subsequent moveTo will act like there's a tip). I haven't had a chance to look into this yet. So the practical usefulness of this is pretty limited so far: you can only meaningfully recover if the protocol doesn't have any aspirates or dispenses.
After recovering from an error and proceeding through the rest of the run, the run will end as failed despite the recovery. I see what's causing this, I just haven't gotten to it yet.
You can't issue any fixup commands yet.
You can't recover from errors in Python protocols yet. EXEC-339 is the next step to accomplish that.
This isn't remotely supported by the Opentrons App or Flex on-device display. You'll see weird things like buttons without labels.

Test Plan

Setup

Enable the hidden error recovery feature flag by manually issuing a POST /settings request with {"id": "enableErrorRecoveryExperiments", "value": true}.

Test

Upload this test protocol to a Flex: Recovery test.json.zip. This comes from Protocol Designer, with aspirates and dispenses hand-replaced by moveToWells in order to work around the tip-attached confusion that I've described above.
Fill random wells in the leftmost column of the tip rack with tips.
Start the run.
When the run reaches a missing tip, it should pause. Resume it by issuing a {"data": {"actionType": "resume-from-recovery"}} action to POST /runs/{id}/actions, or cancel it by issuing a {"data": {"actionType": "stop"}} action.

Cleanup

Delete any runs that you just created, to avoid database incompatibilities when you push a different branch. You can do this with the Opentrons App or with manually-issued DELETE /runs/{id} requests.
POST /settings with {"id": "enableErrorRecoveryExperiments", "value": false}.

Review requests

None in particular.

The easiest way to review this might be to filter by commit. I've organized this into refactor commits vs. feature commits.

Risk assessment

Medium. This touches core internal Protocol Engine logic. Although it's all technically well covered by unit tests, those tests are mostly too isolated and low-level to catch actual bugs.

codecov · 2024-03-13T14:02:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.34%. Comparing base (583dcf6) to head (4221574).
Report is 129 commits behind head on edge.

Additional details and impacted files

@@            Coverage Diff             @@
##             edge   #14646      +/-   ##
==========================================
- Coverage   67.56%   67.34%   -0.23%     
==========================================
  Files        2521     2485      -36     
  Lines       72251    71342     -909     
  Branches     9311     9014     -297     
==========================================
- Hits        48815    48042     -773     
+ Misses      21240    21157      -83     
+ Partials     2196     2143      -53

Flag	Coverage Δ
g-code-testing	`92.43% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
api/src/opentrons/ordered_set.py	`100.00% <ø> (ø)`
...i/src/opentrons/protocol_engine/actions/actions.py	`100.00% <ø> (ø)`
...rons/protocol_engine/execution/command_executor.py	`100.00% <ø> (ø)`
...s/protocol_engine/execution/create_queue_worker.py	`100.00% <ø> (ø)`
...i/src/opentrons/protocol_engine/protocol_engine.py	`100.00% <ø> (ø)`
...pi/src/opentrons/protocol_engine/state/commands.py	`99.36% <ø> (-0.09%)`	⬇️
...opentrons/protocol_runner/legacy_command_mapper.py	`98.33% <ø> (ø)`

... and 220 files with indirect coverage changes

api/src/opentrons/protocol_engine/actions/actions.py

DerekMaggio

Took a look at your new commits.
I like your todos for future tickets. I am stealing that.

Not directly related to your PR, but the logic in handle_action of protocol_engine/state/commands.py is quite complex and easy to get lost in.
I wonder if there is a way to end up cleaning that up? Maybe deferring each of the handling logic to their own function?

DerekMaggio · 2024-03-19T13:52:48Z

api/src/opentrons/protocol_engine/error_recovery_policy.py

+def _is_recoverable(failed_command: Command, exception: Exception) -> bool:
+    if (
+        failed_command.commandType == "pickUpTip"
+        and isinstance(exception, EnumeratedError)
+        # Hack(?): It seems like this should be ErrorCodes.TIP_PICKUP_FAILED, but that's
+        # not what gets raised in practice.
+        and exception.code == ErrorCodes.UNEXPECTED_TIP_REMOVAL
+    ):
+        return True
+    else:
+        return False


I can foresee this function getting huge. As we add more recoverable states, will this eventually be replaced with a more standardized approach?

Yeah, I'm also worried about this, and I don't know.

The opposite way to organize this would be to distribute this choice across the command implementations. So, instead of PickUpTipImplementation returning an error, and this deciding whether that error is recoverable, PickUpTipImplementation would return an error and a recoverable boolean (or something like that).

Is it true that any executed command, that failed and is recoverable, would throw an EnumeratedError exception?

If so, could the command implementations be required to have an is_recoverable(error_code: ErrorCodes) method?

There are probably places where that's not true today, but we can (and might have to) aim for that to become true as part of this project.

…l()`.

This resolves an old todo, and it gives us space to extend it to exclude recoverable errors.

…iting-recovery`.

…w exiting with `resume-from-recovery` and `stop`.

TamarZanzouri · 2024-03-20T18:16:51Z

api/src/opentrons/protocol_engine/execution/command_executor.py

+                    # on the pipette.) However, this currently does the opposite,
+                    # acting as if the command never executed.
+                    type=self._error_recovery_policy(
+                        # todo: should possibly be command_with_new_notes


can you explain this todo?

# todo: should possibly be command_with_new_notes

☝️ This one looks left over from development, sorry. I'll remove it.

# todo(mm, 2024-03-13): # When a command fails recoverably, and we handle it with # WAIT_FOR_RECOVERY or CONTINUE, we want to update our logical # protocol state as if the command succeeded. (e.g. if a tip # pickup failed, pretend that it succeeded and that the tip is now # on the pipette.) However, this currently does the opposite, # acting as if the command never executed.

☝️ This one is related to this point from EXEC-301:

The failed command should affect the engine's logical state as if it had succeeded, per Slack discussion [link]

Because this is a command failure, not a success, Protocol Engine will not currently accomplish that point.

TamarZanzouri · 2024-03-20T18:36:30Z

api/src/opentrons/protocol_engine/state/commands.py

@@ -288,13 +312,25 @@ def handle_action(self, action: Action) -> None:  # noqa: C901
                        command_id=id, failed_at=action.failed_at, error_occurrence=None
                    )
                self._state.queued_setup_command_ids.clear()
+            elif (
+                prev_entry.command.intent == CommandIntent.PROTOCOL
+                or prev_entry.command.intent is None


sorry I might be confusing something but when is this set to None? I thought we use a default value for this

🤷 The model defines it as defaulting to None. It might have to do with wanting HTTP commands to default to "setup" but other commands to default to "protocol".

TamarZanzouri · 2024-03-20T18:40:49Z

api/src/opentrons/protocol_engine/state/commands.py

-                        command_id=id, failed_at=action.failed_at, error_occurrence=None
-                    )
-                self._state.queued_command_ids.clear()
+                assert_never(prev_entry.command.intent)


this is new to me :-) good to know about this guy!

I think it's probably "new" with our recent mypy and typing-extensions updates!

TamarZanzouri

a few comments but makes total sense and it looks great! nice job!

SyntaxColoring · 2024-03-21T15:04:16Z

I've run through the test plan in the PR description, and also run a couple of other random protocols as a smoke test to make sure this didn't break anything else.

DerekMaggio reviewed Mar 13, 2024

View reviewed changes

api/src/opentrons/protocol_engine/actions/actions.py Show resolved Hide resolved

SyntaxColoring force-pushed the nonfatal_errors branch from 73cc326 to 1c3d542 Compare March 18, 2024 22:22

DerekMaggio reviewed Mar 19, 2024

View reviewed changes

SyntaxColoring added 9 commits March 19, 2024 13:14

Refactor: Fix outdated queue_status docstring.

d8a0154

Refactor: Clarify get_command_is_final() and `get_all_commands_fina…

7ec9bf7

…l()`.

Refactor: Move raise out of get_all_commands_final().

dfd2abb

This resolves an old todo, and it gives us space to extend it to exclude recoverable errors.

Add ErrorRecoveryPolicy and feature-flag-based implementation.

6ae0a74

Wire up ErrorRecoveryPolicy.

729d332

Update QueueStatus enum with new AWAITING_RECOVERY value.

4491e58

Allow resume-from-recovery and stop (but nothing else) while `awa…

05c9986

…iting-recovery`.

Put the run into recovery mode when there's a recoverable error. Allo…

641fff2

…w exiting with `resume-from-recovery` and `stop`.

Update failed_command docstring.

44eeb21

SyntaxColoring force-pushed the nonfatal_errors branch from 1c3d542 to 44eeb21 Compare March 19, 2024 18:33

SyntaxColoring marked this pull request as ready for review March 19, 2024 18:51

SyntaxColoring requested a review from a team as a code owner March 19, 2024 18:51

SyntaxColoring requested a review from TamarZanzouri March 19, 2024 18:51

SyntaxColoring added 3 commits March 19, 2024 16:53

Update CommandStore tests.

6770571

More test boilerplate.

e0e1259

Add __repr__ to OrderedSet, for debugging.

bbb518a

SyntaxColoring force-pushed the nonfatal_errors branch from b4341be to bbb518a Compare March 19, 2024 20:57

Fix copy-pasted test docstring.

c04b36a

TamarZanzouri reviewed Mar 20, 2024

View reviewed changes

TamarZanzouri approved these changes Mar 20, 2024

View reviewed changes

Remove outdated todo.

4221574

SyntaxColoring merged commit 6b652c2 into edge Mar 21, 2024
22 checks passed

SyntaxColoring deleted the nonfatal_errors branch March 21, 2024 15:04

SyntaxColoring mentioned this pull request Apr 1, 2024

feat(api): Do not enqueue json commands on protocol load #14759

Merged

Carlos-fernandez pushed a commit that referenced this pull request May 20, 2024

feat(api): Pause run after recoverable errors (#14646)

6d926ad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): Pause run after recoverable errors #14646

feat(api): Pause run after recoverable errors #14646

SyntaxColoring commented Mar 13, 2024 •

edited

Loading

codecov bot commented Mar 13, 2024 •

edited

Loading

DerekMaggio left a comment

DerekMaggio Mar 19, 2024

SyntaxColoring Mar 19, 2024

DerekMaggio Mar 19, 2024

SyntaxColoring Mar 19, 2024 •

edited

Loading

TamarZanzouri Mar 20, 2024

SyntaxColoring Mar 20, 2024

TamarZanzouri Mar 20, 2024

SyntaxColoring Mar 20, 2024

TamarZanzouri Mar 20, 2024

SyntaxColoring Mar 20, 2024

TamarZanzouri left a comment

SyntaxColoring commented Mar 21, 2024

feat(api): Pause run after recoverable errors #14646

feat(api): Pause run after recoverable errors #14646

Conversation

SyntaxColoring commented Mar 13, 2024 • edited Loading

Overview

Changelog

What works now (or is intended to work, anyway)

What doesn't work

Test Plan

Setup

Test

Cleanup

Review requests

Risk assessment

codecov bot commented Mar 13, 2024 • edited Loading

Codecov Report

DerekMaggio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SyntaxColoring Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TamarZanzouri left a comment

Choose a reason for hiding this comment

SyntaxColoring commented Mar 21, 2024

SyntaxColoring commented Mar 13, 2024 •

edited

Loading

codecov bot commented Mar 13, 2024 •

edited

Loading

SyntaxColoring Mar 19, 2024 •

edited

Loading