Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(docs): prevent deadlocks with streams returned from docs actor #2346

Merged
merged 4 commits into from
Jun 6, 2024

Conversation

Frando
Copy link
Member

@Frando Frando commented Jun 4, 2024

Description

Fixes #2345

The iroh-docs actor loop can easily be deadlocked from the client side: If you call any RPC method that returns a stream, and the stream is longer than what the RPC layer buffers, and you call and await any other docs method while consuming the stream, the docs actor will deadlock.
(It will only happen though if the stream is longer than the capacity of the intermediate channel that goes from the actor to the RPC layer, which is why this does not always happen)

This is the case for all methods that return iterators. The solution is twofold:

  • Run single-threaded executor in iroh-docs actor loop
  • For actions returning iterators/streams, spawn a task on that executor to forward the store iterator into the stream, yielding when the receiver is not consuming fast enough

To be able to spawn the iterators onto a task, they have to be 'static. Which they can be - but only when operating on snapshots.

So this PR fixes the potential for deadlock. It has the downside, however, that whenever calling a docs client function that returns an iterator, the current write transaction will be committed first, which has a perfomance penalty. However this is preferable to deadlocks, IMO.

Breaking Changes

Notes & open questions

This will need tests and likely documentation of the perfomance implications.

Change checklist

  • Self-review.
  • Documentation updates if relevant.
  • Tests if relevant.
  • All breaking changes documented.

@Frando Frando changed the title fix(docs): deadlock in list_docs fix(docs): prevent deadlocks with streams returned from docs actor Jun 4, 2024
@Frando Frando marked this pull request as ready for review June 4, 2024 22:50
@Frando Frando requested a review from rklaehn June 5, 2024 08:19
@rklaehn
Copy link
Contributor

rklaehn commented Jun 5, 2024

So the downside of this is that you will commit write transactions much more often than before. This makes a big difference on mac and probably windows, but not on linux.

For some of the iterators (replicas and authors) it might be better to just collect into a vec. Or are we designing for millions of authors or replicas?

I guess it is a valid fix nevertheless. But: will this code be retained as we move to willow, or is it going to be replaced? If it is going to be replaced I would say this is fine if it works.

@Frando
Copy link
Member Author

Frando commented Jun 5, 2024

In willow we will likely have the exact same situation: Iterators from the store have to yield if the consuming stream does not process the items fast enough, so they have to be static as well.
I have not started the redb store for willow yet but my current plan would be to use the transaction/snapshot logic from the current docs store.
I agree that this is unfortunate because it defeats the purpose of the batching as soon as you are using streams at the same time. I haven't come up with a better alternative so far though.

@rklaehn
Copy link
Contributor

rklaehn commented Jun 5, 2024

I agree that this is unfortunate because it defeats the purpose of the batching as soon as you are using streams at the same time. I haven't come up with a better alternative so far though.

Not sure what can be done about it short of writing our own storage. If you want a long lived iterator, you need a snapshot. The only thing that comes to mind is to attempt to get a few items and don't create a snapshot if you get all of them.

E.g. if there are 10 authors in total, you just make a vec and use into_iter. If after 10 you are not at the end, you need a snapshot. But that's a bit of a rube goldberg construction that you might not want to do initially.

pub fn snapshot_owned(&mut self) -> Result<ReadOnlyTables> {
// make sure the current transaction is committed
self.flush()?;
assert!(matches!(self.transaction, CurrentTransaction::None));
Copy link
Contributor

@rklaehn rklaehn Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we are already in a read transaction? (We don't use that frequently, but maybe we should).

Then flush is a noop and this assertion will fail as far as I can see.

E.g. you want multiple iterators without writing something.

Ah, never mind. Flush takes the transaction. But we should maybe not use flush here but just ensure that we are in read mode...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to just open another read txn if you are already in read mode, and then wrap that in a ReadonlyTables...

@rklaehn
Copy link
Contributor

rklaehn commented Jun 5, 2024

I fully get the whole snapshot thing, unfortunate as it is. But what I don't get is why you need all that stuff with the single threaded executor.

@Frando
Copy link
Member Author

Frando commented Jun 5, 2024

But what I don't get is why you need all that stuff with the single threaded executor.

We have to yield from the iterator-to-channel loop if the channel is full. If we didn't, the actor can be deadlocked from the user-facing client API.

On main the following hangs at the first open call forever if the node has more docs than the capacity of the list reply channel:

let mut stream = node.docs.list().await?;
while let Some((id, _)) = stream.try_next().await.unwrap() {
   let _doc = node.docs.open(id).await?;
}

With the PR and the single threaded executor this works fine because forwarding the list iterator to the reply channel happens concurrently to processing new incoming actor messages.

@rklaehn
Copy link
Contributor

rklaehn commented Jun 5, 2024

But what I don't get is why you need all that stuff with the single threaded executor.

We have to yield from the iterator-to-channel loop if the channel is full. If we didn't, the actor can be deadlocked from the user-facing client API.

On main the following hangs at the first open call forever if the node has more docs than the capacity of the list reply channel:

let mut stream = node.docs.list().await?;
while let Some((id, _)) = stream.try_next().await.unwrap() {
   let _doc = node.docs.open(id).await?;
}

With the PR and the single threaded executor this works fine because forwarding the list iterator to the reply channel happens concurrently to processing new incoming actor messages.

Ah, now I get it. This was a thread before, no async at all. But now we need the ability to yield, hence an async executor. But we still want things to be single threaded for redb, hence a local async executor?

@Frando Frando added this pull request to the merge queue Jun 6, 2024
Merged via the queue into main with commit 98914ee Jun 6, 2024
25 checks passed
@dignifiedquire dignifiedquire deleted the fix-docs-deadlock branch June 6, 2024 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Deadlock in iroh-docs actor loop
2 participants