fix(robot-server): Fix indefinite protocol cancel state #14428

mjhuff · 2024-02-06T01:12:23Z

Overview

This PR fixes a bug in which cancelling a protocol from the desktop app/ODD results in an indefinite stop-requested state, which does not resolve unless the user navigates away and back to the protocol from the desktop app. Certain EngineStatus states are internal to PE and not handled by the robot-server, including stop-requested. This means that when the app requests a STOP, robot-server doesn't internally manage the transition from stop-requested->stopped. Simply emitting refetch flags whenever the robot-server updates the state is therefore insufficient. The current solution is to do what we do for current_commands, which is to poll PE and emit to the app if there's an update. In the future, we should bubble up this event to robot-server from PE (the TODO is already there).

There's a cheeky removal of the RUN_STATUS_STOP_REQUESTED as a run status that currently disables refetching on the app side. After spending some time looking at it with @shlokamin, we were unable to determine why this conditional is here, or why things appear to work even when it is here. Let's just get rid of it.

Test Plan

Initiate a protocol run.
Cancel out of said protocol run.
The cancel modal should not be indefinite now.

Changelog

Fixed indefinite "cancelling run" desktop app/ODD state when pressing "cancel" during a protocol run.

Risk assessment

low. This PR just adds another notification push. Worst case scenario, the app refetches more often than it should.

Certain EngineStatus states are internal to PE and not handled by the robot-server, including stop-requested. This means that when the app requests a STOP, robot-server doesn't internally manage the transition from stop-requested->stopped. Simply emitting refetch flags whenever the robot-server updates the state is therefore insufficient. The current solution is to do what we do for current_commands, which is to poll PE and emit to the app if there's an update. In the future, we should bubble up this event to robot-server from PE.

codecov · 2024-02-06T01:14:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2e9a581) 68.14% compared to head (292c4be) 68.13%.
Report is 5 commits behind head on edge.

Additional details and impacted files

@@            Coverage Diff             @@
##             edge   #14428      +/-   ##
==========================================
- Coverage   68.14%   68.13%   -0.01%     
==========================================
  Files        2522     2522              
  Lines       72033    72729     +696     
  Branches     9219     9571     +352     
==========================================
+ Hits        49087    49557     +470     
- Misses      20756    20895     +139     
- Partials     2190     2277      +87

Flag	Coverage Δ
app	`64.70% <ø> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
app/src/organisms/RunTimeControl/hooks.ts	`71.11% <ø> (ø)`

... and 26 files with indirect coverage changes

SyntaxColoring

Yep, makes sense, thanks.

Have you given any thought to how we can catch this kind of thing with automated tests? It seems like we need some kind of integration test; unit tests would have been insufficient because the robot-server code was doing the "intended thing," it's just that what was intended was not sufficient for correct end-to-end behavior.

robot-server's current integration tests are doing a bit of polling themselves, for things like "wait until the run has completed." We can write a notification client in Python and replace that polling with subscriptions. That won't give us coverage of every field. But maybe it's the best bang for buck nonetheless?

shlokamin · 2024-02-06T14:56:51Z

app/src/organisms/RunTimeControl/hooks.ts

-      !([
-        RUN_STATUS_STOP_REQUESTED,


still confused about how this worked lol

mjhuff · 2024-02-06T14:59:30Z

Yep, makes sense, thanks.

Have you given any thought to how we can catch this kind of thing with automated tests? It seems like we need some kind of integration test; unit tests would have been insufficient because the robot-server code was doing the "intended thing," it's just that what was intended was not sufficient for correct end-to-end behavior.

robot-server's current integration tests are doing a bit of polling themselves, for things like "wait until the run has completed." We can write a notification client in Python and replace that polling with subscriptions. That won't give us coverage of every field. But maybe it's the best bang for buck nonetheless?

Good question. Totally agree that we need integration testing here - working that out is my next ticket item. Thanks for pointing me towards the existing logic - I might see if I can repurpose it instead of inventing a new wheel entirely.

sanni-t

I haven't gone through all details of the notifications work done so far so I might be missing something, but this lgtm. Just a question about possible duplicate notifications.

sanni-t · 2024-02-06T15:02:06Z

robot-server/robot_server/service/notifications/publishers/runs_publisher.py

+            if (
+                current_state_summary_status is not None
+                and self._previous_state_summary_status != current_state_summary_status
+            ):
+                await self._publish_runs_async(run_id=run_id)
+                self._previous_state_summary_status = current_state_summary_status


There will be certain status transitions (like IDLE -> RUNNING -> PAUSED) that will be handled by the robot server and hence (I'm guessing) will have a notification sent out. How is the publisher differentiating between such a transition vs one that requires polling. Because would it not send out duplicate notifications otherwise? Or does that not matter since it's just a tiny redundancy?

Good question (and sorry to have missed this before the merge) - it's a redundancy here. As you note, it's relatively tiny compared to what the app does now.

The idea is that this polling is pretty temporary - we're planning on implementing the callback strategy we discussed yesterday as a more permanent solution to this problem, and we should revisit any redundancy and clean it up then. I think it's sufficient for now, though!

Got it. Thanks for the clarification!

SyntaxColoring · 2024-02-06T15:06:32Z

Good question. Totally agree that we need integration testing here - working that out is my next ticket item. Thanks for pointing me towards the existing logic - I might see if I can repurpose it instead of inventing a new wheel entirely.

Here's one example of where we're doing that, FYI. Grepping for max_retries in *.tavern.yaml shows ~22 instances of this kind of thing.

opentrons/robot-server/tests/integration/http_api/runs/test_protocol_run.tavern.yaml

Line 133 in d81e1c3

max_retries: 10

mjhuff added 2 commits February 5, 2024 17:29

Remove invalid enabled condition

1bb656d

mjhuff requested review from a team as code owners February 6, 2024 01:12

mjhuff requested review from TamarZanzouri and removed request for a team and TamarZanzouri February 6, 2024 01:12

mjhuff changed the title ~~Robot server fix indefinite protocol cancel state~~ fix(robot-server): Fix indefinite protocol cancel state Feb 6, 2024

SyntaxColoring approved these changes Feb 6, 2024

View reviewed changes

shlokamin approved these changes Feb 6, 2024

View reviewed changes

mjhuff merged commit 81ebe39 into edge Feb 6, 2024
33 checks passed

mjhuff deleted the robot-server_fix-indefinite-protocol-cancel-state branch February 6, 2024 15:00

sanni-t approved these changes Feb 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(robot-server): Fix indefinite protocol cancel state #14428

fix(robot-server): Fix indefinite protocol cancel state #14428

mjhuff commented Feb 6, 2024 •

edited by jira bot

Loading

codecov bot commented Feb 6, 2024 •

edited

Loading

SyntaxColoring left a comment

shlokamin Feb 6, 2024

mjhuff commented Feb 6, 2024

sanni-t left a comment

sanni-t Feb 6, 2024

mjhuff Feb 6, 2024 •

edited

Loading

sanni-t Feb 6, 2024 •

edited

Loading

SyntaxColoring commented Feb 6, 2024

fix(robot-server): Fix indefinite protocol cancel state #14428

fix(robot-server): Fix indefinite protocol cancel state #14428

Conversation

mjhuff commented Feb 6, 2024 • edited by jira bot Loading

Overview

Test Plan

Changelog

Risk assessment

codecov bot commented Feb 6, 2024 • edited Loading

Codecov Report

SyntaxColoring left a comment

Choose a reason for hiding this comment

shlokamin Feb 6, 2024

Choose a reason for hiding this comment

mjhuff commented Feb 6, 2024

sanni-t left a comment

Choose a reason for hiding this comment

sanni-t Feb 6, 2024

Choose a reason for hiding this comment

mjhuff Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

sanni-t Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

SyntaxColoring commented Feb 6, 2024

mjhuff commented Feb 6, 2024 •

edited by jira bot

Loading

codecov bot commented Feb 6, 2024 •

edited

Loading

mjhuff Feb 6, 2024 •

edited

Loading

sanni-t Feb 6, 2024 •

edited

Loading