feat(robot-server): HTTP API for "Ignore error and skip to next step" #16564

SyntaxColoring · 2024-10-22T18:14:55Z

Overview

This sets the HTTP API for EXEC-676. Closes EXEC-783 and EXEC-784.

Test Plan and Hands on Testing

Nothing much to test yet, since this doesn't yet connect to anything on the frontend (EXEC-791) or backend (EXEC-785).

Changelog

Add a new action type to POST /runs/{id}/actions.

This kind of thing theoretically requires a database schema bump. I think we're safe without doing anything special here because we're already on a bumped schema since the last release, so when we release this in v8.2, it won't break when users downgrade to v8.1. Also, the /runs code has a try/except around the part that parses these action types, so even if we didn't have the protection of a schema bump, it would, at worst, show up in the UI as a "bad run".
Add a matching reactionIfMatch value to PUT /runs/{id}/errorRecoveryPolicy.

See the docstrings in the code for details.

Review requests

See comments below.

Risk assessment

Low.

Update frontend types.

SyntaxColoring · 2024-10-22T18:23:34Z

robot-server/robot_server/runs/action_models.py

+
+    * If you've tried to recover from the error by sending your own `intent: "fixit"`
+      commands to `POST /runs/{id}/commands`, use `"resume-from-recovery"`. It's your
+      responsibility to ensure your `POST`ed commands leave the robot in a good-enough
+      state to continue with the protocol.
+
+    * Otherwise, use `"resume-from-recovery-assuming-false-positive"`.
+
+    Do not combine `intent: "fixit"` commands with
+    `"resume-from-recovery-assuming-false-positive"`—the robot's built-in
+    false-positive recovery may compete with your own.


This warning against combining intent: "fixit" commands with "resume-from-recovery-assuming-false-positive" is because I've thought of an unfortunate danger. This is a contrived example, but bear with me and suppose:

A moveLabware command fails to move something into slot A1.

We enter error recovery mode.

A client runs a fixit command that places some other labware into slot A1, for some reason.

Then the client uses "resume-from-recovery-assuming-false-positive". Given how feat(api): Allow treating errors as false-positives (ignore them and continue with the run) #16556 works, this applies the "missing part" of the state update from step 1, to move the original labware into slot A1. But this conflicts with the fixit command from step 3.

This is worse than the normal brand of weird error recovery interactions, because it bypasses our usual validation layer and possibly messes up internal state. (Because the conflict happens deep, at the level of the Protocol Engine StateStores.) A spectrum of possible effects, from luckiest to unluckiest:

It has no adverse effect.

It raises some obscure internal AssertionError.

It somehow messes up internal state and does who-knows-what to subsequent commands.

It makes the run hang.

This gets a lot easier if we can forbid clients from combining fixit commands with resume-from-recovery-assuming-false-positive run actions. As in, forbid with an error, not just forbid with documentation like I've done here. But I'm not sure we can do that—don't we home unconditionally at the beginning of error recovery?

don't we home unconditionally at the beginning of error recovery?

Yes, we do. I can see a future world in which we do other clean-up fixit commands on the client side before resume-from-recovery-assuming-false-positive, too.

In my personal opinion, it is likely better not to be overly prescriptive with what combinations of recovery commands/actions a client can take. We currently operate under that philosophy for the rest of Error Recovery, and while documenting this behavior is useful, this feels like a spot in which we can just extend that same philosophy.

Yeah, I agree with @mjhuff here. I think this is addressable with a "Please do not do this" type warning, pretty strong language about not executing fixit commands that actually do things and warnings that the buyer must beware if they're going to try - and I think that for the audience of this interface, i.e. us and people who are trying to do weird things with the API, that's a reasonable ask.

And if people try it and they run into really weird errors, then our goal should be to make the errors less weird.

Hm, okay.

In my personal opinion, it is likely better not to be overly prescriptive with what combinations of recovery commands/actions a client can take. We currently operate under that philosophy for the rest of Error Recovery...

Right, but we've been able to operate under that philosophy for the rest of Error Recovery because we've constructed boundaries that can keep the robot decently-behaved even when weird stuff happens. e.g. the run controls will always work, and robot state won't tear, and user-visible error messages will have been written to be read by users. And what I'm talking about here is that my implementation of EXEC-676, so far, subverts those very boundaries.

I think that for the audience of this interface, i.e. us and people who are trying to do weird things with the API, that's a reasonable ask.

It's reasonable for us now, but having seen stuff like RQA-2934 and EXEC-760, doesn't this seem like a nuance that will get lost as the frontend code continues to evolve, especially under the hands of different people?

To be clear, I’m totally down to ship this in v8.2. I just want to make sure we're on the same page about accepting the mess I’ve made of this.

I really don't think it's that much of a mess. Error recovery, broadly, is a thing that is very hard to reason through. It makes sense to me that are a whole ton of very important caveats about the way that you're allowed to handle problems like this. The choice that we're making is

the backend allows a very general interface that may allow you to do dangerous or incorrect things

structure is handled and provided by the frontend

I think that this is a good and acceptable tradeoff to allow new and surprising uses of the mechanism. If there's a thing to be improved, in my opinion it is how we signal errors that come from corruption of internal state and trying to get the actions that the robot does in that case to better logically flow from actions. We're okay with having surprising things happen sometimes to people writing error recovery flows; the goal is that after they think about it for a second, they say "okay I guess that makes sense".

SyntaxColoring · 2024-10-22T18:29:18Z

robot-server/robot_server/runs/action_models.py

+    RESUME_FROM_RECOVERY_ASSUMING_FALSE_POSITIVE = (
+        "resume-from-recovery-assuming-false-positive"
+    )


Naming: Anything more concise than resume-from-recovery-assuming-false-positive?

I do like that the current name specifically claims that this is for false-positives, instead of saying something vaguer like "resume and fix up state." I think it's a helpfully clarifying constraint for how backend engineers should implement commands and their errors going forward.

sorry! im confused with the name and not really sure about its intent :-(

Does the big RunActionType docstring help? Otherwise let's talk about it in-person.

TamarZanzouri · 2024-10-23T15:24:21Z

robot-server/robot_server/runs/error_recovery_models.py

-        ...,
-        description="How to handle errors matched by this rule.",
-    )
+    ifMatch: ReactionIfMatch


why did you remove the Field constructor?

In Redoc, the description here was covering up the more detailed description from the ReactionIfMatch docstring. I'm not sure if that's a Redoc problem or FastAPI problem or Pydantic problem, but this was a quick fix.

This seems a little more friendly to UI tests.

sfoster1

Looks good to me!

sfoster1 · 2024-10-24T14:11:04Z

robot-server/robot_server/runs/action_models.py

+
+    * If you've tried to recover from the error by sending your own `intent: "fixit"`
+      commands to `POST /runs/{id}/commands`, use `"resume-from-recovery"`. It's your
+      responsibility to ensure your `POST`ed commands leave the robot in a good-enough
+      state to continue with the protocol.
+
+    * Otherwise, use `"resume-from-recovery-assuming-false-positive"`.
+
+    Do not combine `intent: "fixit"` commands with
+    `"resume-from-recovery-assuming-false-positive"`—the robot's built-in
+    false-positive recovery may compete with your own.


I really don't think it's that much of a mess. Error recovery, broadly, is a thing that is very hard to reason through. It makes sense to me that are a whole ton of very important caveats about the way that you're allowed to handle problems like this. The choice that we're making is

the backend allows a very general interface that may allow you to do dangerous or incorrect things

structure is handled and provided by the frontend

I think that this is a good and acceptable tradeoff to allow new and surprising uses of the mechanism. If there's a thing to be improved, in my opinion it is how we signal errors that come from corruption of internal state and trying to get the actions that the robot does in that case to better logically flow from actions. We're okay with having surprising things happen sometimes to people writing error recovery flows; the goal is that after they think about it for a second, they say "okay I guess that makes sense".

SyntaxColoring · 2024-10-24T14:45:07Z

Merging to unblock PR #16556. If anyone has suggestions for better names, that's an easy change to make before Monday.

SyntaxColoring added 6 commits October 22, 2024 12:21

Fix status code.

36e98e2

Show full ReactionIfMatch docstring.

7541b70

Add new ReactionIfMatch.

611d1d2

Show full RunActionType docstring.

89ee3dd

Add new RunActionType.

3b8827b

Update client.

fad998f

Update frontend types.

SyntaxColoring commented Oct 22, 2024

View reviewed changes

SyntaxColoring requested review from sfoster1, TamarZanzouri and mjhuff October 22, 2024 18:29

TamarZanzouri reviewed Oct 23, 2024

View reviewed changes

SyntaxColoring added 3 commits October 23, 2024 18:41

Minor docstring tweaks.

5aa91eb

Clean up todo comments.

964d2c0

Use placeholder implementation instead of raising NotImplementedError.

c36ceef

This seems a little more friendly to UI tests.

SyntaxColoring marked this pull request as ready for review October 23, 2024 22:58

SyntaxColoring requested review from a team as code owners October 23, 2024 22:58

sfoster1 approved these changes Oct 24, 2024

View reviewed changes

SyntaxColoring merged commit eb710c0 into edge Oct 24, 2024
9 checks passed

SyntaxColoring deleted the ignore_and_continue_http_api branch October 24, 2024 14:45

SyntaxColoring mentioned this pull request Oct 24, 2024

feat(api): Allow treating errors as false-positives (ignore them and continue with the run) #16556

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(robot-server): HTTP API for "Ignore error and skip to next step" #16564

feat(robot-server): HTTP API for "Ignore error and skip to next step" #16564

SyntaxColoring commented Oct 22, 2024 •

edited

Loading

SyntaxColoring Oct 22, 2024 •

edited

Loading

mjhuff Oct 22, 2024 •

edited

Loading

sfoster1 Oct 22, 2024

sfoster1 Oct 22, 2024

SyntaxColoring Oct 23, 2024 •

edited

Loading

sfoster1 Oct 24, 2024

SyntaxColoring Oct 22, 2024

TamarZanzouri Oct 23, 2024

SyntaxColoring Oct 23, 2024

TamarZanzouri Oct 23, 2024

SyntaxColoring Oct 23, 2024 •

edited

Loading

sfoster1 left a comment

sfoster1 Oct 24, 2024

SyntaxColoring commented Oct 24, 2024

feat(robot-server): HTTP API for "Ignore error and skip to next step" #16564

feat(robot-server): HTTP API for "Ignore error and skip to next step" #16564

Conversation

SyntaxColoring commented Oct 22, 2024 • edited Loading

Overview

Test Plan and Hands on Testing

Changelog

Review requests

Risk assessment

SyntaxColoring Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

mjhuff Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SyntaxColoring Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SyntaxColoring Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

sfoster1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SyntaxColoring commented Oct 24, 2024

SyntaxColoring commented Oct 22, 2024 •

edited

Loading

SyntaxColoring Oct 22, 2024 •

edited

Loading

mjhuff Oct 22, 2024 •

edited

Loading

SyntaxColoring Oct 23, 2024 •

edited

Loading

SyntaxColoring Oct 23, 2024 •

edited

Loading