-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api): Pause when pick_up_tip()
errors in a Python protocol
#14753
Conversation
ParamSpec, my beloved.
34a759d
to
33225f7
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## edge #14753 +/- ##
==========================================
- Coverage 67.24% 67.24% -0.01%
==========================================
Files 2495 2495
Lines 71254 71253 -1
Branches 8937 8937
==========================================
- Hits 47918 47917 -1
Misses 21235 21235
Partials 2101 2101
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@SyntaxColoring, #14748 contains some error recovery protocols if you need any test cases. |
async def add_and_execute_command_wait_for_recovery( | ||
self, request: commands.CommandCreate | ||
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to ideas for better names for this.
Or maybe we should make it a wait_for_recovery: bool
argument on the existing add_and_execute()
method.
pick_up_tip()
errors in a Python protocol
|
||
# FIX BEFORE MERGE: We should probably only set_last_location() if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree - I think there are three answers each of which is honestly reasonable, but one which is the best (and also hardest). As a preamble, we should probably push last-location down to the engine.
- the best and hardest is, base the decision of setting last location on when the error occurred (during which action) and what the error is. For instance, if you got a stall while moving to the tiprack, we should clear the last location since position is now indeterminate; but if we failed our tip-state consistency check after the pickup, we should set the last location to the rack.
- the simplest and most robust is, always clear the last location on any error
- and a slightly more optimistic version is to always set it on any error since the most likely causes of failure will in fact leave the gantry in the specified location
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at this closer, I think it's appropriate to call this set_last_location()
unconditionally. See this new comment:
opentrons/api/src/opentrons/protocol_api/core/engine/instrument.py
Lines 412 to 423 in 6b79625
self._engine_client.pick_up_tip_wait_for_recovery( | |
pipette_id=self._pipette_id, | |
labware_id=labware_id, | |
well_name=well_name, | |
well_location=well_location, | |
) | |
# Set the "last location" unconditionally, even if the command failed | |
# and was recovered from and we don't know if the pipette is physically here. | |
# This isn't used for path planning, but rather for implicit destination | |
# selection like in `pipette.aspirate(location=None)`. | |
self._protocol_core.set_last_location(location=location, mount=self.get_mount()) |
# FIXME(mm, 2024-04-02): As of https://github.com/Opentrons/opentrons/pull/14726, | ||
# this action is only legal if the command is running, not queued. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot of this estop()
method is due for a rethink. Like, I don't really get why it needs to be so different from stop()
, and why it needs to be messing with things like FailCommandAction
s itself.
3350202
to
895750f
Compare
895750f
to
6b79625
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall, but a couple points:
- I'm a bit concerned about the "last command to fail" logic in the command store. Can we instead save the command id when it fails recoverably and we enter
AWAITING_RECOVERY
and make that state more explicit? - Something feels slightly off about the checks in
ProtocolEngine.add_and_execute_command_wait_for_recovery
since they check two different pieces of state in tight sequence
# which remains `queued` because of the pause. | ||
# 3. The engine is stopped. The returned command will be `queued` | ||
# and won't have a result. | ||
raise ProtocolCommandFailedError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make this its own unique exception type and code so that we can programmatically differentiate it from other errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 97b0ada.
@@ -708,6 +708,15 @@ def get_all_commands_final(self) -> bool: | |||
|
|||
return no_command_running and no_command_to_execute | |||
|
|||
def get_recovery_in_progress_for_command(self, command_id: str) -> bool: | |||
"""Return whether the given command failed and its error recovery is in progress.""" | |||
command_is_most_recent_to_fail = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a little gross to me, could we save the command we're currently recovering from explicitly? like, what happens if a fixit command fails, wouldn't it instantly make this False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in a880096.
queued_command = self.add_command(request) | ||
await self.wait_for_command(command_id=queued_command.id) | ||
completed_command = self._state_store.commands.get(queued_command.id) | ||
await self._state_store.wait_for_not( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it kind of feels like we want to gate this on a synchronous call to whether we're now in recovery for the command, but this in general feels a little race condition-y because of the separation between command state and run state. Specifically,
wait_for_command
returns on the asyncio spin after aFailCommandAction
or aSucceedCommandAction
for this command (we can neglect the queued-and-stopping part for now)- but the
get_recovery_in_progress_for_command
predicate is based on the queue status being awaiting-recovery
Do we guarantee mechanically that the queue status will be set before the asyncio spin after the command terminal action is dispatched? Are we sure this won't occasionally race and return early?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if I understand your concern correctly:
When await self.wait_for_command(command_id=queued_command.id)
returns, we are guaranteed that the action that finalized the command has already been fully processed, and that get_recovery_in_progress_for_command()
will see its results on the state.
When we handle an action, we send it to each of the substores in a loop. Only after that's done do we notify subscribers like this one.
@@ -167,6 +167,7 @@ async def execute(self, command_id: str) -> None: | |||
FailCommandAction( | |||
error=error, | |||
command_id=running_command.id, | |||
running_command=running_command, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am wondering why we need this in the action? dont we have the failed command stored in PE already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a hack, we probably don't need it. See this comment:
opentrons/api/src/opentrons/protocol_engine/actions/actions.py
Lines 177 to 180 in bd14851
# This is a quick hack so FailCommandAction handlers can get the params of the | |
# command that failed. We probably want this to be a new "failure details" | |
# object instead, similar to how succeeded commands can send a "private result" | |
# to Protocol Engine internals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good, added a question and a bit confused about the wait_for_not
although I get what its doing but the logic around it is a bit confusing to me. besides that looks great!
recovery_target_command_id: Optional[str] | ||
"""If we're currently recovering from a command failure, which command it was.""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this as an orthogonal attribute is quick and dirty. Ideally, this ID would not exist at all when QueueStatus
is anything other than AWAITING_RECOVERY
. We could do that by retructuring this state so it's closer to a union of dataclasses instead of a dataclass of unions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you for the changes!
Overview
Closes EXEC-346.
Test Plan
Setup
Enable the hidden error recovery feature flag by manually issuing a
POST /settings
request with{"id": "enableErrorRecoveryExperiments", "value": true}
.Test
Upload and run this protocol.
stop
orresume-from-recovery
action toPOST /runs/{id}/actions
.resume-from-recovery
, thepick_up_tip()
calls later on in the protocol should continue from the next wells in the tip rack. It should not repeatedly try to pick up tips from the same well.Cleanup
POST /settings
with{"id": "enableErrorRecoveryExperiments", "value": false}
.Changelog
ProtocolEngine
and block until the command finishes executing. The main part of this PR is adding variants of that to also wait for the error recovery to finish. We reimplementProtocolContext.pick_up_tip()
, specifically, with that new variant. Other commands will happen later.ProtocolEngine
is modified so that when apickUpTip
fails recoverably, it marks the requested tips as used. This is one way of getting the next PAPIpick_up_tip()
to automatically move on to the next tip. See discussion below.Review requests
See my review comments.
Risk assessment
Low-risk when the
enableErrorRecoveryExperiments
feature flag is off. Medium-risk if it's on, since this starts venturing further into territory where Protocol Engine's understanding of the protocol history and state is different from PAPI's understanding of the protocol history and state.