-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(robot-server): HTTP API for "Ignore error and skip to next step" #16564
Conversation
|
||
* If you've tried to recover from the error by sending your own `intent: "fixit"` | ||
commands to `POST /runs/{id}/commands`, use `"resume-from-recovery"`. It's your | ||
responsibility to ensure your `POST`ed commands leave the robot in a good-enough | ||
state to continue with the protocol. | ||
|
||
* Otherwise, use `"resume-from-recovery-assuming-false-positive"`. | ||
|
||
Do not combine `intent: "fixit"` commands with | ||
`"resume-from-recovery-assuming-false-positive"`—the robot's built-in | ||
false-positive recovery may compete with your own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning against combining intent: "fixit"
commands with "resume-from-recovery-assuming-false-positive"
is because I've thought of an unfortunate danger. This is a contrived example, but bear with me and suppose:
- A
moveLabware
command fails to move something into slot A1. - We enter error recovery mode.
- A client runs a
fixit
command that places some other labware into slot A1, for some reason. - Then the client uses
"resume-from-recovery-assuming-false-positive"
. Given how feat(api): Allow treating errors as false-positives (ignore them and continue with the run) #16556 works, this applies the "missing part" of the state update from step 1, to move the original labware into slot A1. But this conflicts with thefixit
command from step 3.
This is worse than the normal brand of weird error recovery interactions, because it bypasses our usual validation layer and possibly messes up internal state. (Because the conflict happens deep, at the level of the Protocol Engine StateStore
s.) A spectrum of possible effects, from luckiest to unluckiest:
- It has no adverse effect.
- It raises some obscure internal
AssertionError
. - It somehow messes up internal state and does who-knows-what to subsequent commands.
- It makes the run hang.
This gets a lot easier if we can forbid clients from combining fixit
commands with resume-from-recovery-assuming-false-positive
run actions. As in, forbid with an error, not just forbid with documentation like I've done here. But I'm not sure we can do that—don't we home unconditionally at the beginning of error recovery?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we home unconditionally at the beginning of error recovery?
Yes, we do. I can see a future world in which we do other clean-up fixit commands on the client side before resume-from-recovery-assuming-false-positive
, too.
In my personal opinion, it is likely better not to be overly prescriptive with what combinations of recovery commands/actions a client can take. We currently operate under that philosophy for the rest of Error Recovery, and while documenting this behavior is useful, this feels like a spot in which we can just extend that same philosophy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with @mjhuff here. I think this is addressable with a "Please do not do this" type warning, pretty strong language about not executing fixit commands that actually do things and warnings that the buyer must beware if they're going to try - and I think that for the audience of this interface, i.e. us and people who are trying to do weird things with the API, that's a reasonable ask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if people try it and they run into really weird errors, then our goal should be to make the errors less weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, okay.
In my personal opinion, it is likely better not to be overly prescriptive with what combinations of recovery commands/actions a client can take. We currently operate under that philosophy for the rest of Error Recovery...
Right, but we've been able to operate under that philosophy for the rest of Error Recovery because we've constructed boundaries that can keep the robot decently-behaved even when weird stuff happens. e.g. the run controls will always work, and robot state won't tear, and user-visible error messages will have been written to be read by users. And what I'm talking about here is that my implementation of EXEC-676, so far, subverts those very boundaries.
I think that for the audience of this interface, i.e. us and people who are trying to do weird things with the API, that's a reasonable ask.
It's reasonable for us now, but having seen stuff like RQA-2934 and EXEC-760, doesn't this seem like a nuance that will get lost as the frontend code continues to evolve, especially under the hands of different people?
To be clear, I’m totally down to ship this in v8.2. I just want to make sure we're on the same page about accepting the mess I’ve made of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't think it's that much of a mess. Error recovery, broadly, is a thing that is very hard to reason through. It makes sense to me that are a whole ton of very important caveats about the way that you're allowed to handle problems like this. The choice that we're making is
- the backend allows a very general interface that may allow you to do dangerous or incorrect things
- structure is handled and provided by the frontend
I think that this is a good and acceptable tradeoff to allow new and surprising uses of the mechanism. If there's a thing to be improved, in my opinion it is how we signal errors that come from corruption of internal state and trying to get the actions that the robot does in that case to better logically flow from actions. We're okay with having surprising things happen sometimes to people writing error recovery flows; the goal is that after they think about it for a second, they say "okay I guess that makes sense".
RESUME_FROM_RECOVERY_ASSUMING_FALSE_POSITIVE = ( | ||
"resume-from-recovery-assuming-false-positive" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: Anything more concise than resume-from-recovery-assuming-false-positive
?
I do like that the current name specifically claims that this is for false-positives, instead of saying something vaguer like "resume and fix up state." I think it's a helpfully clarifying constraint for how backend engineers should implement commands and their errors going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry! im confused with the name and not really sure about its intent :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the big RunActionType
docstring help? Otherwise let's talk about it in-person.
..., | ||
description="How to handle errors matched by this rule.", | ||
) | ||
ifMatch: ReactionIfMatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove the Field
constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Redoc, the description
here was covering up the more detailed description from the ReactionIfMatch
docstring. I'm not sure if that's a Redoc problem or FastAPI problem or Pydantic problem, but this was a quick fix.
This seems a little more friendly to UI tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
|
||
* If you've tried to recover from the error by sending your own `intent: "fixit"` | ||
commands to `POST /runs/{id}/commands`, use `"resume-from-recovery"`. It's your | ||
responsibility to ensure your `POST`ed commands leave the robot in a good-enough | ||
state to continue with the protocol. | ||
|
||
* Otherwise, use `"resume-from-recovery-assuming-false-positive"`. | ||
|
||
Do not combine `intent: "fixit"` commands with | ||
`"resume-from-recovery-assuming-false-positive"`—the robot's built-in | ||
false-positive recovery may compete with your own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't think it's that much of a mess. Error recovery, broadly, is a thing that is very hard to reason through. It makes sense to me that are a whole ton of very important caveats about the way that you're allowed to handle problems like this. The choice that we're making is
- the backend allows a very general interface that may allow you to do dangerous or incorrect things
- structure is handled and provided by the frontend
I think that this is a good and acceptable tradeoff to allow new and surprising uses of the mechanism. If there's a thing to be improved, in my opinion it is how we signal errors that come from corruption of internal state and trying to get the actions that the robot does in that case to better logically flow from actions. We're okay with having surprising things happen sometimes to people writing error recovery flows; the goal is that after they think about it for a second, they say "okay I guess that makes sense".
Merging to unblock PR #16556. If anyone has suggestions for better names, that's an easy change to make before Monday. |
Overview
This sets the HTTP API for EXEC-676. Closes EXEC-783 and EXEC-784.
Test Plan and Hands on Testing
Nothing much to test yet, since this doesn't yet connect to anything on the frontend (EXEC-791) or backend (EXEC-785).
Changelog
Add a new action type to
POST /runs/{id}/actions
.This kind of thing theoretically requires a database schema bump. I think we're safe without doing anything special here because we're already on a bumped schema since the last release, so when we release this in v8.2, it won't break when users downgrade to v8.1. Also, the
/runs
code has atry
/except
around the part that parses these action types, so even if we didn't have the protection of a schema bump, it would, at worst, show up in the UI as a "bad run".Add a matching
reactionIfMatch
value toPUT /runs/{id}/errorRecoveryPolicy
.See the docstrings in the code for details.
Review requests
See comments below.
Risk assessment
Low.