Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(robot-server): HTTP API for "Ignore error and skip to next step" #16564

Merged
merged 9 commits into from
Oct 24, 2024

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Oct 22, 2024

Overview

This sets the HTTP API for EXEC-676. Closes EXEC-783 and EXEC-784.

Test Plan and Hands on Testing

Nothing much to test yet, since this doesn't yet connect to anything on the frontend (EXEC-791) or backend (EXEC-785).

Changelog

  • Add a new action type to POST /runs/{id}/actions.

    This kind of thing theoretically requires a database schema bump. I think we're safe without doing anything special here because we're already on a bumped schema since the last release, so when we release this in v8.2, it won't break when users downgrade to v8.1. Also, the /runs code has a try/except around the part that parses these action types, so even if we didn't have the protection of a schema bump, it would, at worst, show up in the UI as a "bad run".

  • Add a matching reactionIfMatch value to PUT /runs/{id}/errorRecoveryPolicy.

See the docstrings in the code for details.

Review requests

See comments below.

Risk assessment

Low.

Comment on lines +36 to +46

* If you've tried to recover from the error by sending your own `intent: "fixit"`
commands to `POST /runs/{id}/commands`, use `"resume-from-recovery"`. It's your
responsibility to ensure your `POST`ed commands leave the robot in a good-enough
state to continue with the protocol.

* Otherwise, use `"resume-from-recovery-assuming-false-positive"`.

Do not combine `intent: "fixit"` commands with
`"resume-from-recovery-assuming-false-positive"`—the robot's built-in
false-positive recovery may compete with your own.
Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning against combining intent: "fixit" commands with "resume-from-recovery-assuming-false-positive" is because I've thought of an unfortunate danger. This is a contrived example, but bear with me and suppose:

  1. A moveLabware command fails to move something into slot A1.
  2. We enter error recovery mode.
  3. A client runs a fixit command that places some other labware into slot A1, for some reason.
  4. Then the client uses "resume-from-recovery-assuming-false-positive". Given how feat(api): Allow treating errors as false-positives (ignore them and continue with the run) #16556 works, this applies the "missing part" of the state update from step 1, to move the original labware into slot A1. But this conflicts with the fixit command from step 3.

This is worse than the normal brand of weird error recovery interactions, because it bypasses our usual validation layer and possibly messes up internal state. (Because the conflict happens deep, at the level of the Protocol Engine StateStores.) A spectrum of possible effects, from luckiest to unluckiest:

  • It has no adverse effect.
  • It raises some obscure internal AssertionError.
  • It somehow messes up internal state and does who-knows-what to subsequent commands.
  • It makes the run hang.

This gets a lot easier if we can forbid clients from combining fixit commands with resume-from-recovery-assuming-false-positive run actions. As in, forbid with an error, not just forbid with documentation like I've done here. But I'm not sure we can do that—don't we home unconditionally at the beginning of error recovery?

Copy link
Contributor

@mjhuff mjhuff Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we home unconditionally at the beginning of error recovery?

Yes, we do. I can see a future world in which we do other clean-up fixit commands on the client side before resume-from-recovery-assuming-false-positive, too.

In my personal opinion, it is likely better not to be overly prescriptive with what combinations of recovery commands/actions a client can take. We currently operate under that philosophy for the rest of Error Recovery, and while documenting this behavior is useful, this feels like a spot in which we can just extend that same philosophy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with @mjhuff here. I think this is addressable with a "Please do not do this" type warning, pretty strong language about not executing fixit commands that actually do things and warnings that the buyer must beware if they're going to try - and I think that for the audience of this interface, i.e. us and people who are trying to do weird things with the API, that's a reasonable ask.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if people try it and they run into really weird errors, then our goal should be to make the errors less weird.

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, okay.

In my personal opinion, it is likely better not to be overly prescriptive with what combinations of recovery commands/actions a client can take. We currently operate under that philosophy for the rest of Error Recovery...

Right, but we've been able to operate under that philosophy for the rest of Error Recovery because we've constructed boundaries that can keep the robot decently-behaved even when weird stuff happens. e.g. the run controls will always work, and robot state won't tear, and user-visible error messages will have been written to be read by users. And what I'm talking about here is that my implementation of EXEC-676, so far, subverts those very boundaries.

I think that for the audience of this interface, i.e. us and people who are trying to do weird things with the API, that's a reasonable ask.

It's reasonable for us now, but having seen stuff like RQA-2934 and EXEC-760, doesn't this seem like a nuance that will get lost as the frontend code continues to evolve, especially under the hands of different people?


To be clear, I’m totally down to ship this in v8.2. I just want to make sure we're on the same page about accepting the mess I’ve made of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think it's that much of a mess. Error recovery, broadly, is a thing that is very hard to reason through. It makes sense to me that are a whole ton of very important caveats about the way that you're allowed to handle problems like this. The choice that we're making is

  • the backend allows a very general interface that may allow you to do dangerous or incorrect things
  • structure is handled and provided by the frontend

I think that this is a good and acceptable tradeoff to allow new and surprising uses of the mechanism. If there's a thing to be improved, in my opinion it is how we signal errors that come from corruption of internal state and trying to get the actions that the robot does in that case to better logically flow from actions. We're okay with having surprising things happen sometimes to people writing error recovery flows; the goal is that after they think about it for a second, they say "okay I guess that makes sense".

Comment on lines +53 to +55
RESUME_FROM_RECOVERY_ASSUMING_FALSE_POSITIVE = (
"resume-from-recovery-assuming-false-positive"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: Anything more concise than resume-from-recovery-assuming-false-positive?

I do like that the current name specifically claims that this is for false-positives, instead of saying something vaguer like "resume and fix up state." I think it's a helpfully clarifying constraint for how backend engineers should implement commands and their errors going forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry! im confused with the name and not really sure about its intent :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the big RunActionType docstring help? Otherwise let's talk about it in-person.

...,
description="How to handle errors matched by this rule.",
)
ifMatch: ReactionIfMatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove the Field constructor?

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Redoc, the description here was covering up the more detailed description from the ReactionIfMatch docstring. I'm not sure if that's a Redoc problem or FastAPI problem or Pydantic problem, but this was a quick fix.

@SyntaxColoring SyntaxColoring marked this pull request as ready for review October 23, 2024 22:58
@SyntaxColoring SyntaxColoring requested review from a team as code owners October 23, 2024 22:58
Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Comment on lines +36 to +46

* If you've tried to recover from the error by sending your own `intent: "fixit"`
commands to `POST /runs/{id}/commands`, use `"resume-from-recovery"`. It's your
responsibility to ensure your `POST`ed commands leave the robot in a good-enough
state to continue with the protocol.

* Otherwise, use `"resume-from-recovery-assuming-false-positive"`.

Do not combine `intent: "fixit"` commands with
`"resume-from-recovery-assuming-false-positive"`—the robot's built-in
false-positive recovery may compete with your own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think it's that much of a mess. Error recovery, broadly, is a thing that is very hard to reason through. It makes sense to me that are a whole ton of very important caveats about the way that you're allowed to handle problems like this. The choice that we're making is

  • the backend allows a very general interface that may allow you to do dangerous or incorrect things
  • structure is handled and provided by the frontend

I think that this is a good and acceptable tradeoff to allow new and surprising uses of the mechanism. If there's a thing to be improved, in my opinion it is how we signal errors that come from corruption of internal state and trying to get the actions that the robot does in that case to better logically flow from actions. We're okay with having surprising things happen sometimes to people writing error recovery flows; the goal is that after they think about it for a second, they say "okay I guess that makes sense".

@SyntaxColoring
Copy link
Contributor Author

Merging to unblock PR #16556. If anyone has suggestions for better names, that's an easy change to make before Monday.

@SyntaxColoring SyntaxColoring merged commit eb710c0 into edge Oct 24, 2024
9 checks passed
@SyntaxColoring SyntaxColoring deleted the ignore_and_continue_http_api branch October 24, 2024 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants