Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(robot-server): Fix indefinite protocol cancel state #14428

Merged
merged 2 commits into from
Feb 6, 2024

Conversation

mjhuff
Copy link
Contributor

@mjhuff mjhuff commented Feb 6, 2024

Closes RQA-2286

Overview

This PR fixes a bug in which cancelling a protocol from the desktop app/ODD results in an indefinite stop-requested state, which does not resolve unless the user navigates away and back to the protocol from the desktop app. Certain EngineStatus states are internal to PE and not handled by the robot-server, including stop-requested. This means that when the app requests a STOP, robot-server doesn't internally manage the transition from stop-requested->stopped. Simply emitting refetch flags whenever the robot-server updates the state is therefore insufficient. The current solution is to do what we do for current_commands, which is to poll PE and emit to the app if there's an update. In the future, we should bubble up this event to robot-server from PE (the TODO is already there).

There's a cheeky removal of the RUN_STATUS_STOP_REQUESTED as a run status that currently disables refetching on the app side. After spending some time looking at it with @shlokamin, we were unable to determine why this conditional is here, or why things appear to work even when it is here. Let's just get rid of it.

Test Plan

  • Initiate a protocol run.
  • Cancel out of said protocol run.
  • The cancel modal should not be indefinite now.

Changelog

  • Fixed indefinite "cancelling run" desktop app/ODD state when pressing "cancel" during a protocol run.

Risk assessment

low. This PR just adds another notification push. Worst case scenario, the app refetches more often than it should.

Certain EngineStatus states are internal to PE and not handled by the robot-server, including
stop-requested. This means that when the app requests a STOP, robot-server doesn't internally manage
the transition from stop-requested->stopped. Simply emitting refetch flags whenever the robot-server
updates the state is therefore insufficient. The current solution is to do what we do for
current_commands, which is to poll PE and emit to the app if there's an update. In the future, we
should bubble up this event to robot-server from PE.
@mjhuff mjhuff requested review from a team as code owners February 6, 2024 01:12
@mjhuff mjhuff requested review from TamarZanzouri and removed request for a team and TamarZanzouri February 6, 2024 01:12
@mjhuff mjhuff changed the title Robot server fix indefinite protocol cancel state fix(robot-server): Fix indefinite protocol cancel state Feb 6, 2024
Copy link

codecov bot commented Feb 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2e9a581) 68.14% compared to head (292c4be) 68.13%.
Report is 5 commits behind head on edge.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             edge   #14428      +/-   ##
==========================================
- Coverage   68.14%   68.13%   -0.01%     
==========================================
  Files        2522     2522              
  Lines       72033    72729     +696     
  Branches     9219     9571     +352     
==========================================
+ Hits        49087    49557     +470     
- Misses      20756    20895     +139     
- Partials     2190     2277      +87     
Flag Coverage Δ
app 64.70% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
app/src/organisms/RunTimeControl/hooks.ts 71.11% <ø> (ø)

... and 26 files with indirect coverage changes

Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, makes sense, thanks.

Have you given any thought to how we can catch this kind of thing with automated tests? It seems like we need some kind of integration test; unit tests would have been insufficient because the robot-server code was doing the "intended thing," it's just that what was intended was not sufficient for correct end-to-end behavior.

robot-server's current integration tests are doing a bit of polling themselves, for things like "wait until the run has completed." We can write a notification client in Python and replace that polling with subscriptions. That won't give us coverage of every field. But maybe it's the best bang for buck nonetheless?

Comment on lines -80 to -81
!([
RUN_STATUS_STOP_REQUESTED,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still confused about how this worked lol

@mjhuff
Copy link
Contributor Author

mjhuff commented Feb 6, 2024

Yep, makes sense, thanks.

Have you given any thought to how we can catch this kind of thing with automated tests? It seems like we need some kind of integration test; unit tests would have been insufficient because the robot-server code was doing the "intended thing," it's just that what was intended was not sufficient for correct end-to-end behavior.

robot-server's current integration tests are doing a bit of polling themselves, for things like "wait until the run has completed." We can write a notification client in Python and replace that polling with subscriptions. That won't give us coverage of every field. But maybe it's the best bang for buck nonetheless?

Good question. Totally agree that we need integration testing here - working that out is my next ticket item. Thanks for pointing me towards the existing logic - I might see if I can repurpose it instead of inventing a new wheel entirely.

@mjhuff mjhuff merged commit 81ebe39 into edge Feb 6, 2024
33 checks passed
@mjhuff mjhuff deleted the robot-server_fix-indefinite-protocol-cancel-state branch February 6, 2024 15:00
Copy link
Member

@sanni-t sanni-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone through all details of the notifications work done so far so I might be missing something, but this lgtm. Just a question about possible duplicate notifications.

Comment on lines +86 to +91
if (
current_state_summary_status is not None
and self._previous_state_summary_status != current_state_summary_status
):
await self._publish_runs_async(run_id=run_id)
self._previous_state_summary_status = current_state_summary_status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be certain status transitions (like IDLE -> RUNNING -> PAUSED) that will be handled by the robot server and hence (I'm guessing) will have a notification sent out. How is the publisher differentiating between such a transition vs one that requires polling. Because would it not send out duplicate notifications otherwise? Or does that not matter since it's just a tiny redundancy?

Copy link
Contributor Author

@mjhuff mjhuff Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question (and sorry to have missed this before the merge) - it's a redundancy here. As you note, it's relatively tiny compared to what the app does now.

The idea is that this polling is pretty temporary - we're planning on implementing the callback strategy we discussed yesterday as a more permanent solution to this problem, and we should revisit any redundancy and clean it up then. I think it's sufficient for now, though!

Copy link
Member

@sanni-t sanni-t Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for the clarification!

@SyntaxColoring
Copy link
Contributor

Good question. Totally agree that we need integration testing here - working that out is my next ticket item. Thanks for pointing me towards the existing logic - I might see if I can repurpose it instead of inventing a new wheel entirely.

Here's one example of where we're doing that, FYI. Grepping for max_retries in *.tavern.yaml shows ~22 instances of this kind of thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants