-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api): Pause run after recoverable errors #14646
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## edge #14646 +/- ##
==========================================
- Coverage 67.56% 67.34% -0.23%
==========================================
Files 2521 2485 -36
Lines 72251 71342 -909
Branches 9311 9014 -297
==========================================
- Hits 48815 48042 -773
+ Misses 21240 21157 -83
+ Partials 2196 2143 -53
Flags with carried forward coverage won't be shown. Click here to find out more.
|
73cc326
to
1c3d542
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at your new commits.
I like your todos for future tickets. I am stealing that.
Not directly related to your PR, but the logic in handle_action of protocol_engine/state/commands.py is quite complex and easy to get lost in.
I wonder if there is a way to end up cleaning that up? Maybe deferring each of the handling logic to their own function?
def _is_recoverable(failed_command: Command, exception: Exception) -> bool: | ||
if ( | ||
failed_command.commandType == "pickUpTip" | ||
and isinstance(exception, EnumeratedError) | ||
# Hack(?): It seems like this should be ErrorCodes.TIP_PICKUP_FAILED, but that's | ||
# not what gets raised in practice. | ||
and exception.code == ErrorCodes.UNEXPECTED_TIP_REMOVAL | ||
): | ||
return True | ||
else: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can foresee this function getting huge. As we add more recoverable states, will this eventually be replaced with a more standardized approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm also worried about this, and I don't know.
The opposite way to organize this would be to distribute this choice across the command implementations. So, instead of PickUpTipImplementation
returning an error, and this deciding whether that error is recoverable, PickUpTipImplementation
would return an error and a recoverable
boolean (or something like that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true that any executed command, that failed and is recoverable, would throw an EnumeratedError exception?
If so, could the command implementations be required to have an is_recoverable(error_code: ErrorCodes)
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are probably places where that's not true today, but we can (and might have to) aim for that to become true as part of this project.
This resolves an old todo, and it gives us space to extend it to exclude recoverable errors.
…iting-recovery`.
…w exiting with `resume-from-recovery` and `stop`.
1c3d542
to
44eeb21
Compare
b4341be
to
bbb518a
Compare
# on the pipette.) However, this currently does the opposite, | ||
# acting as if the command never executed. | ||
type=self._error_recovery_policy( | ||
# todo: should possibly be command_with_new_notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this todo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# todo: should possibly be command_with_new_notes
☝️ This one looks left over from development, sorry. I'll remove it.
# todo(mm, 2024-03-13):
# When a command fails recoverably, and we handle it with
# WAIT_FOR_RECOVERY or CONTINUE, we want to update our logical
# protocol state as if the command succeeded. (e.g. if a tip
# pickup failed, pretend that it succeeded and that the tip is now
# on the pipette.) However, this currently does the opposite,
# acting as if the command never executed.
☝️ This one is related to this point from EXEC-301:
- The failed command should affect the engine's logical state as if it had succeeded, per Slack discussion [link]
Because this is a command failure, not a success, Protocol Engine will not currently accomplish that point.
@@ -288,13 +312,25 @@ def handle_action(self, action: Action) -> None: # noqa: C901 | |||
command_id=id, failed_at=action.failed_at, error_occurrence=None | |||
) | |||
self._state.queued_setup_command_ids.clear() | |||
elif ( | |||
prev_entry.command.intent == CommandIntent.PROTOCOL | |||
or prev_entry.command.intent is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I might be confusing something but when is this set to None? I thought we use a default value for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 The model defines it as defaulting to None
. It might have to do with wanting HTTP commands to default to "setup"
but other commands to default to "protocol"
.
command_id=id, failed_at=action.failed_at, error_occurrence=None | ||
) | ||
self._state.queued_command_ids.clear() | ||
assert_never(prev_entry.command.intent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is new to me :-) good to know about this guy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably "new" with our recent mypy and typing-extensions updates!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few comments but makes total sense and it looks great! nice job!
I've run through the test plan in the PR description, and also run a couple of other random protocols as a smoke test to make sure this didn't break anything else. |
Overview
This closes most of EXEC-301 and closes EXEC-320.
Changelog
What works now (or is intended to work, anyway)
pickUpTip
command fails because of a missing tip, the run pauses now, instead of homing and showing an error message.stop
action viaPOST /runs/{id}/actions
.resume-from-recovery
action viaPOST /runs/{id}/actions
.What doesn't work
pickUpTip
command failure, the robot seems to simultaneously think it does not have a tip attached (a subsequentaspirate
command will fail with a "tip not attached" error) and that it does have a tip attached (a subsequentmoveTo
will act like there's a tip). I haven't had a chance to look into this yet. So the practical usefulness of this is pretty limited so far: you can only meaningfully recover if the protocol doesn't have any aspirates or dispenses.failed
despite the recovery. I see what's causing this, I just haven't gotten to it yet.Test Plan
Setup
Enable the hidden error recovery feature flag by manually issuing a
POST /settings
request with{"id": "enableErrorRecoveryExperiments", "value": true}
.Test
aspirate
s anddispense
s hand-replaced bymoveToWell
s in order to work around the tip-attached confusion that I've described above.{"data": {"actionType": "resume-from-recovery"}}
action toPOST /runs/{id}/actions
, or cancel it by issuing a{"data": {"actionType": "stop"}}
action.Cleanup
DELETE /runs/{id}
requests.POST /settings
with{"id": "enableErrorRecoveryExperiments", "value": false}
.Review requests
None in particular.
The easiest way to review this might be to filter by commit. I've organized this into refactor commits vs. feature commits.
Risk assessment
Medium. This touches core internal Protocol Engine logic. Although it's all technically well covered by unit tests, those tests are mostly too isolated and low-level to catch actual bugs.