fix(docs): prevent deadlocks with streams returned from docs actor #2346

Frando · 2024-06-04T21:10:28Z

Description

The iroh-docs actor loop can easily be deadlocked from the client side: If you call any RPC method that returns a stream, and the stream is longer than what the RPC layer buffers, and you call and await any other docs method while consuming the stream, the docs actor will deadlock.
(It will only happen though if the stream is longer than the capacity of the intermediate channel that goes from the actor to the RPC layer, which is why this does not always happen)

This is the case for all methods that return iterators. The solution is twofold:

Run single-threaded executor in iroh-docs actor loop
For actions returning iterators/streams, spawn a task on that executor to forward the store iterator into the stream, yielding when the receiver is not consuming fast enough

To be able to spawn the iterators onto a task, they have to be 'static. Which they can be - but only when operating on snapshots.

So this PR fixes the potential for deadlock. It has the downside, however, that whenever calling a docs client function that returns an iterator, the current write transaction will be committed first, which has a perfomance penalty. However this is preferable to deadlocks, IMO.

Breaking Changes

Notes & open questions

This will need tests and likely documentation of the perfomance implications.

Change checklist

Self-review.
Documentation updates if relevant.
Tests if relevant.
All breaking changes documented.

rklaehn · 2024-06-05T08:50:55Z

So the downside of this is that you will commit write transactions much more often than before. This makes a big difference on mac and probably windows, but not on linux.

For some of the iterators (replicas and authors) it might be better to just collect into a vec. Or are we designing for millions of authors or replicas?

I guess it is a valid fix nevertheless. But: will this code be retained as we move to willow, or is it going to be replaced? If it is going to be replaced I would say this is fine if it works.

Frando · 2024-06-05T09:41:10Z

In willow we will likely have the exact same situation: Iterators from the store have to yield if the consuming stream does not process the items fast enough, so they have to be static as well.
I have not started the redb store for willow yet but my current plan would be to use the transaction/snapshot logic from the current docs store.
I agree that this is unfortunate because it defeats the purpose of the batching as soon as you are using streams at the same time. I haven't come up with a better alternative so far though.

rklaehn · 2024-06-05T09:57:35Z

I agree that this is unfortunate because it defeats the purpose of the batching as soon as you are using streams at the same time. I haven't come up with a better alternative so far though.

Not sure what can be done about it short of writing our own storage. If you want a long lived iterator, you need a snapshot. The only thing that comes to mind is to attempt to get a few items and don't create a snapshot if you get all of them.

E.g. if there are 10 authors in total, you just make a vec and use into_iter. If after 10 you are not at the end, you need a snapshot. But that's a bit of a rube goldberg construction that you might not want to do initially.

rklaehn · 2024-06-05T11:02:44Z

iroh-docs/src/store/fs.rs

+    pub fn snapshot_owned(&mut self) -> Result<ReadOnlyTables> {
+        // make sure the current transaction is committed
+        self.flush()?;
+        assert!(matches!(self.transaction, CurrentTransaction::None));


What if we are already in a read transaction? (We don't use that frequently, but maybe we should).

Then flush is a noop and this assertion will fail as far as I can see.

E.g. you want multiple iterators without writing something.

Ah, never mind. Flush takes the transaction. But we should maybe not use flush here but just ensure that we are in read mode...

I think you should be able to just open another read txn if you are already in read mode, and then wrap that in a ReadonlyTables...

rklaehn · 2024-06-05T11:06:12Z

I fully get the whole snapshot thing, unfortunate as it is. But what I don't get is why you need all that stuff with the single threaded executor.

Frando · 2024-06-05T13:58:04Z

But what I don't get is why you need all that stuff with the single threaded executor.

We have to yield from the iterator-to-channel loop if the channel is full. If we didn't, the actor can be deadlocked from the user-facing client API.

On main the following hangs at the first open call forever if the node has more docs than the capacity of the list reply channel:

let mut stream = node.docs.list().await?;
while let Some((id, _)) = stream.try_next().await.unwrap() {
   let _doc = node.docs.open(id).await?;
}

With the PR and the single threaded executor this works fine because forwarding the list iterator to the reply channel happens concurrently to processing new incoming actor messages.

rklaehn · 2024-06-05T14:10:32Z

But what I don't get is why you need all that stuff with the single threaded executor.

We have to yield from the iterator-to-channel loop if the channel is full. If we didn't, the actor can be deadlocked from the user-facing client API.

On main the following hangs at the first open call forever if the node has more docs than the capacity of the list reply channel:
let mut stream = node.docs.list().await?;
while let Some((id, _)) = stream.try_next().await.unwrap() {
   let _doc = node.docs.open(id).await?;
}
With the PR and the single threaded executor this works fine because forwarding the list iterator to the reply channel happens concurrently to processing new incoming actor messages.

Ah, now I get it. This was a thread before, no async at all. But now we need the ability to yield, hence an async executor. But we still want things to be single threaded for redb, hence a local async executor?

Frando added 2 commits June 4, 2024 23:09

fix: deadlock for list_docs

0addb83

fix: make all iterators static and spawn them on tasks

04137da

Frando force-pushed the fix-docs-deadlock branch from e53e3f5 to d9fe06e Compare June 4, 2024 22:16

Frando changed the title ~~fix(docs): deadlock in list_docs~~ fix(docs): prevent deadlocks with streams returned from docs actor Jun 4, 2024

fixup & cleanup

ada1d0a

Frando force-pushed the fix-docs-deadlock branch from d5d6057 to ada1d0a Compare June 4, 2024 22:38

Frando mentioned this pull request Jun 4, 2024

Deadlock in iroh-docs actor loop #2345

Closed

Frando marked this pull request as ready for review June 4, 2024 22:50

Frando requested a review from rklaehn June 5, 2024 08:19

rklaehn reviewed Jun 5, 2024

View reviewed changes

add test with many docs

d73f28b

rklaehn approved these changes Jun 5, 2024

View reviewed changes

Frando added this pull request to the merge queue Jun 6, 2024

Merged via the queue into main with commit 98914ee Jun 6, 2024
25 checks passed

dignifiedquire deleted the fix-docs-deadlock branch June 6, 2024 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docs): prevent deadlocks with streams returned from docs actor #2346

fix(docs): prevent deadlocks with streams returned from docs actor #2346

Frando commented Jun 4, 2024 •

edited

Loading

rklaehn commented Jun 5, 2024

Frando commented Jun 5, 2024 •

edited

Loading

rklaehn commented Jun 5, 2024

rklaehn Jun 5, 2024 •

edited

Loading

rklaehn Jun 5, 2024

rklaehn commented Jun 5, 2024

Frando commented Jun 5, 2024

rklaehn commented Jun 5, 2024

fix(docs): prevent deadlocks with streams returned from docs actor #2346

fix(docs): prevent deadlocks with streams returned from docs actor #2346

Conversation

Frando commented Jun 4, 2024 • edited Loading

Description

Breaking Changes

Notes & open questions

Change checklist

rklaehn commented Jun 5, 2024

Frando commented Jun 5, 2024 • edited Loading

rklaehn commented Jun 5, 2024

rklaehn Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

rklaehn Jun 5, 2024

Choose a reason for hiding this comment

rklaehn commented Jun 5, 2024

Frando commented Jun 5, 2024

rklaehn commented Jun 5, 2024

Frando commented Jun 4, 2024 •

edited

Loading

Frando commented Jun 5, 2024 •

edited

Loading

rklaehn Jun 5, 2024 •

edited

Loading