perf(api): Do not wait for the event loop to be free when reporting APIv2 commands #9238

SyntaxColoring · 2022-01-11T23:01:12Z

Overview

This PR addresses a piece of the robot jankiness problems described in issue #9133.

The ProtocolEngine command list and the HTTP API live in the main thread, and operate within the asyncio event loop. The APIv2 protocol meanwhile runs in its own thread, since APIv2 commands are all synchronous and blocking.

The protocol thread needs to report its ongoing progress to the ProtocolEngine in the main thread, for HTTP polls to be able to inspect it. With our current implementation, every time the protocol completes a command:

The protocol thread schedules a ProtocolEngine update to happen ASAP in the main thread's event loop.
The protocol thread waits for that update to complete.
The protocol thread continues.

The problem is that if the main thread's event loop is, for whatever reason, occupied for a long time, the protocol thread will be stuck waiting. We think this is part of what's causing visibly janky motion when the robot falls under heavy HTTP load.

To fix this, this PR introduces some elasticity between the protocol thread and the main thread. Now, every time the protocol completes a command:

The protocol thread schedules a ProtocolEngine update to happen ASAP in the main thread's event loop.
The protocol thread immediately continues without waiting for that update to complete.

Changelog

Introduce a new class, ThreadAsyncQueue, to act as a generic data channel between threads and async tasks.
Make the APIv2 context insert its progress events into a ThreadAsyncQueue (via the event handlers defined in LegacyContextPlugin).
Concurrently, in a background task running in the main thread's event loop, pull events from the ThreadAsyncQueue and update the ProtocolEngine accordingly.

Review requests

Run some protocols on a real robot or a dev server. Check to see that the run log updates normally. When the protocol completes successfully, check to see that the run log isn't truncated. When the protocol is cancelled, check to see that the cancellation shows up in the right place in the run log.

Risk assessment

Medium. This only barely touches the thread that's actually executing the protocol, so protocol execution shouldn't break. However, our run and ProtocolEngine lifetime is a bit complicated. If I got something about it wrong, we might see messed up run logs, or thread leakage, or task leakage.

codecov · 2022-01-11T23:07:06Z

Codecov Report

Merging #9238 (873eeba) into edge (9406ff6) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             edge    #9238      +/-   ##
==========================================
+ Coverage   75.42%   75.47%   +0.05%     
==========================================
  Files        1824     1826       +2     
  Lines       49459    49592     +133     
  Branches     4729     4729              
==========================================
+ Hits        37303    37431     +128     
- Misses      11290    11295       +5     
  Partials      866      866

Flag	Coverage Δ
api	`84.94% <100.00%> (+0.07%)`	⬆️
robot-server	`91.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...i/src/opentrons/protocol_runner/legacy_wrappers.py	`100.00% <ø> (ø)`
api/src/opentrons/protocol_engine/plugins.py	`91.42% <100.00%> (-0.68%)`	⬇️
...i/src/opentrons/protocol_engine/protocol_engine.py	`100.00% <100.00%> (ø)`
...opentrons/protocol_runner/legacy_context_plugin.py	`100.00% <100.00%> (ø)`
...rc/opentrons/protocol_runner/thread_async_queue.py	`100.00% <100.00%> (ø)`
...pi/src/opentrons/hardware_control/emulation/app.py	`82.85% <0.00%> (-5.15%)`	⬇️
...hardware_control/emulation/scripts/run_smoothie.py	`78.57% <0.00%> (ø)`
api/src/opentrons/hardware_control/simulator.py	`88.65% <0.00%> (+2.75%)`	⬆️
...rc/opentrons/drivers/smoothie_drivers/simulator.py	`79.62% <0.00%> (+3.70%)`	⬆️

SyntaxColoring · 2022-01-11T23:41:20Z

api/src/opentrons/protocol_runner/thread_async_queue.py

+_T = TypeVar("_T")
+
+
+class ThreadAsyncQueue(Generic[_T]):


Any better names for this?

The Python standard library has "queues" (queue.Queue and asyncio.Queue), which lack a close method.

Go and Trio add a close method and call them "channels". AnyIO does the same, but calls them "memory object streams".

I'm leaning towards something "channel"-based, like ThreadAsyncChannel.

I'm partial to "streams" myself, but channel sounds nice, too

Leaving it as Queue for now because after Googling this, the terminology seems pretty messy and unstandardized, and "queue" wins the tie because it matches something that people might already know from the Python standard library.

SyntaxColoring · 2022-01-11T23:43:13Z

api/src/opentrons/protocol_runner/thread_async_queue.py

+class ThreadAsyncQueue(Generic[_T]):
+    """A queue to safely pass values of type `_T` between threads and async tasks.
+
+    All methods are safe to call concurrently from any thread or task.


This is a stronger guarantee than we strictly require for this PR, since we only have one producer thread and one consumer async task.

Weakening the safety guarantees opens up some different implementation options. But after trying a few, they didn't seem that much easier, and they took extra care to explain and understand. So I went with this.

SyntaxColoring · 2022-01-11T23:52:07Z

api/src/opentrons/protocol_runner/thread_async_queue.py

+        Warning:
+            A waiting `get_async()` won't be interrupted by an async cancellation.
+            The proper way to interrupt a waiting `get_async()` is to close the queue,
+            just like you have to do with `get()`.


I haven't looked into this too deeply. There might be a way to get async cancellations to work. But it didn't seem worthwhile, since:

Good producer/consumer practice is to ensure the send side closes eventually, anyway. (In my opinion.)

This same concern exists with the standard queue.Queue.

api/src/opentrons/protocol_runner/thread_async_queue.py

This is getting far-reaching. Is there a better approach?

Reverts commits 9e1f889 and 7044dea.

…nager.

This complication was based on a misunderstanding of how Python generators work.

SyntaxColoring · 2022-01-21T21:41:18Z

api/tests/opentrons/protocol_runner/test_legacy_context_plugin.py

+    # Capture the function that the plugin sets up as its labware load callback.
+    # Also, ensure that all subscribe calls return an actual unsubscribe callable
+    # (instead of Decoy's default `None`) so subject.teardown() works.
+    labware_handler_captor = matchers.Captor()
+    decoy.when(
+        legacy_context.broker.subscribe(topic="command", handler=matchers.Anything())
+    ).then_return(decoy.mock())
+    decoy.when(
+        legacy_context.labware_load_broker.subscribe(callback=labware_handler_captor)
+    ).then_return(decoy.mock())
+    decoy.when(
+        legacy_context.instrument_load_broker.subscribe(callback=matchers.Anything())
+    ).then_return(decoy.mock())
+    decoy.when(
+        legacy_context.module_load_broker.subscribe(callback=matchers.Anything())
+    ).then_return(decoy.mock())


I'm open to ideas for reducing this duplication across tests. My attempts just made the tests harder to follow.

We could also wipe a lot of this away if we merged .labware_load_broker, .instrument_load_broker, and .module_load_broker into a single .equipment_broker.

SyntaxColoring · 2022-01-21T21:51:39Z

api/src/opentrons/protocol_engine/plugins.py

-    def stop(self) -> None:
+    async def stop(self) -> None:
        """Stop any configured plugins."""
        for p in self._plugins:
-            p.teardown()
+            await p.teardown()


I took the easy way out by changing plugin.teardown() and plugin_starter.stop() to be async. I haven't totally thought this through yet, but in another PR, I'd like to refactor this to make it easier to follow the lifetime of the LegacyContextPlugin.

Currently, cleaning up the LegacyContextPlugin works like this:

The protocol thread exits, either gracefully or by exception.

ProtocolRunner calls ProtocolEngine.finish().

ProtocolEngine.finish() calls PluginStarter.stop().

PluginStarter.stop() calls .teardown() on all plugins, which is currently just the LegacyContextPlugin.

But all LegacyContextPlugin really cares about is:

That the protocol thread has exited, so it knows there are no more events coming in.

That the ProtocolEngine is still open, so it knows it can still dispatch actions into the ProtocolEngine as it drains its ThreadAsyncQueue.

So in principle, we shouldn't need to involve ProtocolEngine.finish(). And I think involving ProtocolEngine.finish() makes things a little more fragile.

mcous

LGTM

mcous · 2022-01-25T18:30:50Z

api/src/opentrons/protocol_runner/thread_async_queue.py

+                self._deque.append(value)
+                self._condition.notify()
+
+    def get(self) -> _T:


I don't think this method is used publicly, consider making it private

You're correct, it's not used publicly, and I agree with the spirit of making it private. get_until_closed() is probably what clients should use instead.

I'm keeping it public for now despite this, because get() is part of the "normal" API for channels (as seen in various implementations), and conformity feels like a good thing when implementing multithreading primitives. I'm happy to delete it later if it feels noisy, and we feel like deleting it wouldn't impede understanding of the class.

mcous · 2022-01-25T18:31:41Z

api/src/opentrons/protocol_runner/thread_async_queue.py

+                    # Wait for something to change, then check again.
+                    self._condition.wait()
+
+    def get_until_closed(self) -> Iterable[_T]:


Also not used publicly, consider private method

## Overview This removes a potential source of deadlocks and race conditions. I originally thought this would help with EXEC-417. I'm not sure it actually will at this point, but maybe we should still do it. ## Background When the robot executes a Python protocol, the `ProtocolEngine` and higher-level HTTP stuff live in the main thread, and the user's Python script runs in a worker thread. For older (`apiLevel`≤2.13) protocols, `LegacyContextPlugin` transports progress reports from the worker thread to the main thread. When we initially built all of this, we had [severe performance problems](#9133) where the main thread would get bogged down serializing a big HTTP response or something, and then the backpressure through `LegacyContextPlugin` would stall the protocol thread, causing visible jankiness in robot motion. We did a bunch of stuff to try to fix this. One fix was to [insert an infinite non-blocking queue](#9238) between the two threads to remove the backpressure. I strongly suspect that that fix is unnecessary today. As evidence: It only ever applied to older, `apiLevel`≤2.13, protocols. Newer, `apiLevel`≥2.14 ones, have always run through a different pipeline, which still has backpressure. And the newer pipeline is apparently not perceived to have janky motion. Removing the fix would remove concurrency, which would be a meaningful simplification. For instance, I've seen hints that [this part](https://github.com/Opentrons/opentrons/blob/45e14ca26359720740db744124b464bdcc84264c/api/src/opentrons/protocol_engine/execution/hardware_stopper.py#L70) of run cleanup is racy, because it depends on whether all the `pickUpTip`/`dropTip` commands got consumed from the queue by the time we reach there. ## Test Plan and Hands on Testing * [x] On an OT-2, run some protocols that are movement-intensive (e.g. lots of `touch_tip()`s) and have `apiLevel` ≤ 2.13. Click around in the app and induce some network requests. There might be some jankiness, but make sure it's not severe. Really, I think we just need to make sure it's not any worse than `apiLevel` ≥ 2.14. * I'm noticing jank lasting ~0-1 seconds when you: navigate to the device details page (understandable); navigate through the device settings (weird); and navigate to the device list page (concerning). All of this applies equally to `apiLevel≥2.14`, so if we're concerned about this, retaining this non-blocking queue in `LegacyContextPlugin` is not the correct solution. ## Changelog * Delete the non-blocking queue between the Python protocol thread . Replace it with simple inter-thread function calls. ## Review requests Is this a good idea? ## Risk assessment Medium. There is a risk that this *is* covering up enduring jankiness problems, and removing it will bring the problems back. This may not be obvious to us in the office because we're more focused on Flexes, which don't run `apiLevel`≤2.13 protocols. But there's also risk to having the concurrency, so... 🤷

SyntaxColoring changed the title ~~WIP: Isolate apiv2 brokers from event loop~~ perf(api): Do not wait for the event loop to be free when reporting APIv2 commands Jan 11, 2022

SyntaxColoring changed the base branch from edge to engine_halt-race-condition January 11, 2022 23:02

Base automatically changed from engine_halt-race-condition to edge January 11, 2022 23:49

SyntaxColoring commented Jan 12, 2022

View reviewed changes

SyntaxColoring linked an issue Jan 13, 2022 that may be closed by this pull request

bug: Protocol execution stalls when unrelated tasks hog the main thread's async event loop #9133

Closed

SyntaxColoring force-pushed the isolate_apiv2_brokers_from_event_loop branch from 83d1cb8 to 4bb753b Compare January 14, 2022 17:13

SyntaxColoring added 13 commits January 18, 2022 14:53

Add ThreadAsyncQueue.

bbd6bb2

Add hint to ProtocolEngine.stop() docstring.

ba05ee0

Delete obsolete dispatch_threadsafe().

8578f21

Make setup and teardown async in calling code.

00a0df1

Use a queue to pass actions to the ProtocolEngine.

97bb7aa

Clarify comment in legacy_wrappers.

5cbcae1

WIP: Update tests and convert more stuff to async?

6f6a68a

This is getting far-reaching. Is there a better approach?

More async fixups.

cc10f9c

Add todo for making add_plugin synchronous.

bca11bd

Lint.

2f3fb6c

Add missing await.

5c4566e

More test fixups.

7044dea

Add perf debug logs.

9e1f889

SyntaxColoring force-pushed the isolate_apiv2_brokers_from_event_loop branch from 4bb753b to 9e1f889 Compare January 18, 2022 20:45

SyntaxColoring added 3 commits January 19, 2022 12:43

Add more debug logs.

731befe

Revert debug logs.

de14507

Reverts commits 9e1f889 and 7044dea.

Use async iteration to simplify consuming from the queue.

2a836b8

SyntaxColoring mentioned this pull request Jan 20, 2022

bug: Protocol execution stalls when unrelated tasks hog the main thread's async event loop #9133

Closed

SyntaxColoring added 5 commits January 21, 2022 11:45

Minor comment, docstring, and naming revisions.

ebe737c

Delete .putting() in favor of making the object itself a context ma…

9c73d5e

…nager.

Simplify get_async_until_closed().

5fb3c06

This complication was based on a misunderstanding of how Python generators work.

Add missing docstrings.

7f94b1b

Simplify docstrings.

ac85811

SyntaxColoring added 2 commits January 21, 2022 15:15

Make setup, loading, etc. non-async again.

bb61b51

Siimplify LegacyContextPlugin tests at the cost of duplication.

ca8cf1e

SyntaxColoring commented Jan 21, 2022

View reviewed changes

SyntaxColoring requested review from mcous and sanni-t January 21, 2022 22:00

SyntaxColoring marked this pull request as ready for review January 21, 2022 22:01

SyntaxColoring requested a review from a team as a code owner January 21, 2022 22:01

Delete spurious # todo.

873eeba

mcous approved these changes Jan 25, 2022

View reviewed changes

SyntaxColoring merged commit 6ab9b10 into edge Jan 26, 2022

SyntaxColoring deleted the isolate_apiv2_brokers_from_event_loop branch January 26, 2022 15:07

SyntaxColoring mentioned this pull request Aug 22, 2024

refactor(api): Remove concurrency from LegacyContextPlugin #16098

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(api): Do not wait for the event loop to be free when reporting APIv2 commands #9238

perf(api): Do not wait for the event loop to be free when reporting APIv2 commands #9238

SyntaxColoring commented Jan 11, 2022 •

edited

Loading

codecov bot commented Jan 11, 2022 •

edited

Loading

SyntaxColoring Jan 11, 2022

mcous Jan 25, 2022

SyntaxColoring Jan 26, 2022

SyntaxColoring Jan 11, 2022 •

edited

Loading

SyntaxColoring Jan 11, 2022 •

edited

Loading

SyntaxColoring Jan 21, 2022

SyntaxColoring Jan 21, 2022

mcous left a comment

mcous Jan 25, 2022

SyntaxColoring Jan 26, 2022 •

edited

Loading

mcous Jan 25, 2022

perf(api): Do not wait for the event loop to be free when reporting APIv2 commands #9238

perf(api): Do not wait for the event loop to be free when reporting APIv2 commands #9238

Conversation

SyntaxColoring commented Jan 11, 2022 • edited Loading

Overview

Changelog

Review requests

Risk assessment

codecov bot commented Jan 11, 2022 • edited Loading

Codecov Report

SyntaxColoring Jan 11, 2022

Choose a reason for hiding this comment

mcous Jan 25, 2022

Choose a reason for hiding this comment

SyntaxColoring Jan 26, 2022

Choose a reason for hiding this comment

SyntaxColoring Jan 11, 2022 • edited Loading

Choose a reason for hiding this comment

SyntaxColoring Jan 11, 2022 • edited Loading

Choose a reason for hiding this comment

SyntaxColoring Jan 21, 2022

Choose a reason for hiding this comment

SyntaxColoring Jan 21, 2022

Choose a reason for hiding this comment

mcous left a comment

Choose a reason for hiding this comment

mcous Jan 25, 2022

Choose a reason for hiding this comment

SyntaxColoring Jan 26, 2022 • edited Loading

Choose a reason for hiding this comment

mcous Jan 25, 2022

Choose a reason for hiding this comment

SyntaxColoring commented Jan 11, 2022 •

edited

Loading

codecov bot commented Jan 11, 2022 •

edited

Loading

SyntaxColoring Jan 11, 2022 •

edited

Loading

SyntaxColoring Jan 11, 2022 •

edited

Loading

SyntaxColoring Jan 26, 2022 •

edited

Loading