Get fsm error refinement #1362

rklaehn · 2023-08-16T10:17:03Z

Description

The purpose of this refactoring is to make it easier to surface the underlying quinn errors, and to be able to handle the case where the remote does not have the data more easily.

Notes & open questions

This is WIP because there are still some changes needed in bao-tree DecodeError to handle the "blob found, but a part of the blob that was requested is not there or invalid" case and to surface quinn::ReadError

Should NotFound or ChunkNotFound contain the hash for which something was not found? If you request a single thing it is clear what was not found, but for a collection request it might be useful to know what hash was not found.

I don't think I want to add the offset of the chunk - that would require additional bookkeeping. Or maybe not? In any case these are things that can be done in a subsequent PR.

To see how this is meant to be used - look at the 2 new tests in provide.rs

Change checklist

Self-review.
Documentation updates if relevant.
Tests if relevant.

Implements #1348 and #962

We can now check the error to see if the remote has stopped sending data, which indicates that it does not have anything for the hash.

this allows us to surface the quinn::ReadError

rklaehn · 2023-08-16T14:25:16Z

This is a related PR for bao-tree that would allow us to have even more precise decode errors: n0-computer/bao-tree#27

iroh-bytes/src/get.rs

divagant-martian · 2023-08-16T15:08:38Z

iroh-bytes/src/get.rs

+    /// Decode error that you can get once you have sent the request and are
+    /// decoding the response, e.g. from [`AtBlobContent::next`].
+    ///
+    /// This is similar to [`bao_tree::io::DecodeError`], but takes into account
+    /// that we are reading from a [`quinn::RecvStream`], so read errors will be
+    /// propagated as [`DecodeError::Read`], containing a [`quinn::ReadError`].
+    /// This carries more concrete information about the error than an [`io::Error`].
+    ///
+    /// When the provider finds that it does not have a chunk that we requested,
+    /// or that the chunk is invalid, it will stop sending data without producing
+    /// an error. This is indicated by the [`DecodeError::ChunkNotFound`] variant,
+    /// which can be used to detect that data is missing but the connection as well
+    /// that the provider is otherwise healthy.
+    ///
+    /// The [`DecodeError::ParentHashMismatch`] and [`DecodeError::LeafHashMismatch`]
+    /// variants indicate that the provider has sent us invalid data. A well-behaved
+    /// provider should never do this, so this is an indication that the provider is
+    /// not behaving correctly.
+    ///
+    /// The [`DecodeError::InvalidQueryRange`] variant indicates that the we requested
+    /// a range that is invalid for the current blob. E.g. we requested chunk 5 for
+    /// a blob that is only 2 chunks large.
+    ///
+    /// The [`DecodeError::Io`] variant is just a fallback for any other io error that
+    /// is not actually a [`quinn::ReadError`].


such nice docs, ty

iroh/tests/provide.rs

Co-authored-by: Divma <[email protected]>

checks for EOF are now done in bao-tree, the only thing we need to do is to try downcasting to quinn::ReadError

# Conflicts: # iroh-bytes/src/get.rs # iroh/tests/provide.rs

# Conflicts: # iroh-bytes/src/protocol.rs

## Description Adds the `Downloader` as specified in #1334 plus some backchannel convos Features include: - Support collections - Add delays to downloads - Add download retries with an incremental backoff - Keeping peers for a bit longer than necessary in hopes they will be useful again - Having the concept of intents and deduplicating downloads efforts - Cancelling download intents - Limiting how many concurrent requests are done in total - Limiting how many concurrent requests are done per peer - Limiting the number of open connections in total - Basic error management in the form of deciding whether a peer should be dropped, the request should be dropped, or if the request should be retried ## Notes & open questions ### TODOs - A remaining TODO in the code is whether something special should be done when dropping quic connections - Should downloads have a timeout? - ~<sup>I know I've said this a hundred times with a hundred different things but would love to test this as well under stress scenarios and a large number of peers. don't hate me</sup>~ In reality after abstracting away all the IO most scenarios can be simulated easily. What would remain for a _much_ later time when the need and opportunity for real case testing scenario arises is to tune the concurrency parameters ### Future work #### Downloading Ranges There was the requirement of downloading a Hash, a range of a Hash, a collection and (not mentioned but potentially implied) ranges of collections. There is no support for ranges right now because of the great duplication of the `get` code in order to take advantage of proper errors added in #1362. In principle, adding ranges should be really easy. This is because it's an extension of the `DownloadKind` and would simply need calculating the missing ranges not based on the difference between what we have and the whole range, but the given range. I would prefer to find a way to deduplicate the get code before doing this extension. Also, as far as I can tell, there is no need for this yet. #### Prioritizing candidates per role: `Provider` and `Candidate` A nice extension, as discussed at some point, is to differentiate candidates we know have the data, from those that _might_ have the data. This has added benefit that when a peer is available to perform another download under the concurrency limits, a hash we know they have could be downloaded right away instead of waiting for the delay. At this point making this doesn't make sense because we will likely attempt a download before the peer has retrieved the data themselves. To implement this, we would need to add the notifications of fully downloaded hashes as available into gossip first. #### Leveraging the info from gossip When declaring that a hash `X` should be downloaded, it's also an option to query gossip for peers that are subscribed to the topic to which `X` belongs to use them as candidates. This could be done connecting the `ProviderMap` to `gossip`. For now I don't see the need to do this. ### Open questions about Future work - In line with the described work from above, the registry only allows querying for peer candidates to a hash since that's as good as it gets in terms of what we know from a remote right now. It's not clear to me if we would want to change this to have better availability information with #1413 in progress. - More future work: downloading a large data set/blob from multiple peers would most likely require us to do a three step process: 1. understanding the data layout/size. 2. splitting the download. 3. actually performing the separate downloads. Generally curious how this will end. My question here is whether we should do this for every download, or just on data that we expect to be big. Is there any way to obtain such hint without relying on a query every single time? ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant.

## Description The purpose of this refactoring is to make it easier to surface the underlying quinn errors, and to be able to handle the case where the remote does not have the data more easily. ## Notes & open questions ~This is WIP because there are still some changes needed in bao-tree DecodeError to handle the "blob found, but a part of the blob that was requested is not there or invalid" case and to surface quinn::ReadError~ Should NotFound or ChunkNotFound contain the hash for which something was not found? If you request a single thing it is clear what was not found, but for a collection request it might be useful to know what hash was not found. I don't think I want to add the offset of the chunk - that would require additional bookkeeping. Or maybe not? In any case these are things that can be done in a subsequent PR. To see how this is meant to be used - look at the 2 new tests in provide.rs ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. Implements #1348 and #962 --------- Co-authored-by: Divma <[email protected]>

## Description Adds the `Downloader` as specified in #1334 plus some backchannel convos Features include: - Support collections - Add delays to downloads - Add download retries with an incremental backoff - Keeping peers for a bit longer than necessary in hopes they will be useful again - Having the concept of intents and deduplicating downloads efforts - Cancelling download intents - Limiting how many concurrent requests are done in total - Limiting how many concurrent requests are done per peer - Limiting the number of open connections in total - Basic error management in the form of deciding whether a peer should be dropped, the request should be dropped, or if the request should be retried ## Notes & open questions ### TODOs - A remaining TODO in the code is whether something special should be done when dropping quic connections - Should downloads have a timeout? - ~<sup>I know I've said this a hundred times with a hundred different things but would love to test this as well under stress scenarios and a large number of peers. don't hate me</sup>~ In reality after abstracting away all the IO most scenarios can be simulated easily. What would remain for a _much_ later time when the need and opportunity for real case testing scenario arises is to tune the concurrency parameters ### Future work #### Downloading Ranges There was the requirement of downloading a Hash, a range of a Hash, a collection and (not mentioned but potentially implied) ranges of collections. There is no support for ranges right now because of the great duplication of the `get` code in order to take advantage of proper errors added in #1362. In principle, adding ranges should be really easy. This is because it's an extension of the `DownloadKind` and would simply need calculating the missing ranges not based on the difference between what we have and the whole range, but the given range. I would prefer to find a way to deduplicate the get code before doing this extension. Also, as far as I can tell, there is no need for this yet. #### Prioritizing candidates per role: `Provider` and `Candidate` A nice extension, as discussed at some point, is to differentiate candidates we know have the data, from those that _might_ have the data. This has added benefit that when a peer is available to perform another download under the concurrency limits, a hash we know they have could be downloaded right away instead of waiting for the delay. At this point making this doesn't make sense because we will likely attempt a download before the peer has retrieved the data themselves. To implement this, we would need to add the notifications of fully downloaded hashes as available into gossip first. #### Leveraging the info from gossip When declaring that a hash `X` should be downloaded, it's also an option to query gossip for peers that are subscribed to the topic to which `X` belongs to use them as candidates. This could be done connecting the `ProviderMap` to `gossip`. For now I don't see the need to do this. ### Open questions about Future work - In line with the described work from above, the registry only allows querying for peer candidates to a hash since that's as good as it gets in terms of what we know from a remote right now. It's not clear to me if we would want to change this to have better availability information with #1413 in progress. - More future work: downloading a large data set/blob from multiple peers would most likely require us to do a three step process: 1. understanding the data layout/size. 2. splitting the download. 3. actually performing the separate downloads. Generally curious how this will end. My question here is whether we should do this for every download, or just on data that we expect to be big. Is there any way to obtain such hint without relying on a query every single time? ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant.

rklaehn added 2 commits August 16, 2023 10:53

docs: fix wrong message for write_lp

095c31f

refactor: refine error code for AtConnected::next

04996cc

rklaehn changed the title ~~Fsm error refinement~~ Get fsm error refinement Aug 16, 2023

rklaehn added 2 commits August 16, 2023 12:23

refactor: add AtBlobHeaderNextError

61569c3

We can now check the error to see if the remote has stopped sending data, which indicates that it does not have anything for the hash.

tests: add test_not_found to test behaviour when data is not there

884b955

rklaehn force-pushed the fsm-error-refinement branch from 9d6362f to 884b955 Compare August 16, 2023 10:23

rklaehn added 2 commits August 16, 2023 13:54

refactor: Add our own DecodeError

4599797

this allows us to surface the quinn::ReadError

Merge branch 'main' into fsm-error-refinement

c4aefec

rklaehn marked this pull request as ready for review August 16, 2023 12:00

rklaehn force-pushed the fsm-error-refinement branch from 510c34b to 4099e32 Compare August 16, 2023 12:34

tests: add test for the ChunkNotFound case

e576a7b

rklaehn force-pushed the fsm-error-refinement branch from 4099e32 to e576a7b Compare August 16, 2023 12:37

rklaehn requested review from Frando and divagant-martian August 16, 2023 12:39

divagant-martian reviewed Aug 16, 2023

View reviewed changes

rklaehn and others added 7 commits August 16, 2023 17:24

Update iroh/tests/provide.rs

af872f0

Co-authored-by: Divma <[email protected]>

Update iroh/tests/provide.rs

47b0045

Co-authored-by: Divma <[email protected]>

Update iroh-bytes/src/get.rs

b7d42d7

Co-authored-by: Divma <[email protected]>

Update iroh-bytes/src/get.rs

d03a913

Co-authored-by: Divma <[email protected]>

Update iroh-bytes/src/get.rs

2a6b23c

Co-authored-by: Divma <[email protected]>

Update iroh-bytes/src/get.rs

9041c9f

Co-authored-by: Divma <[email protected]>

Merge branch 'main' into fsm-error-refinement

c4fb6cc

b5 added this to the v0.5.2 - Sync milestone Aug 16, 2023

rklaehn mentioned this pull request Aug 17, 2023

Precise errors when encoding data #1370

Closed

switch to latest version of bao-tree and make use of new precise errors

357e9e1

checks for EOF are now done in bao-tree, the only thing we need to do is to try downcasting to quinn::ReadError

rklaehn force-pushed the fsm-error-refinement branch from 87e7417 to 357e9e1 Compare August 23, 2023 12:33

rklaehn added 3 commits August 23, 2023 15:47

Merge branch 'main' into fsm-error-refinement

5a1b2b9

Merge branch 'main' into fsm-error-refinement

d2c8213

Merge branch 'main' into fsm-error-refinement

c6c236c

Merge branch 'main' into fsm-error-refinement

a47324a

rklaehn force-pushed the fsm-error-refinement branch from 3d1a6ae to a47324a Compare August 24, 2023 13:21

rklaehn mentioned this pull request Aug 24, 2023

Range spec seq docs #1402

Merged

3 tasks

rklaehn added 2 commits August 24, 2023 19:06

Merge branch 'main' into fsm-error-refinement

535286a

# Conflicts: # iroh-bytes/src/get.rs # iroh/tests/provide.rs

Merge branch 'main' into fsm-error-refinement

1bf2cea

# Conflicts: # iroh-bytes/src/protocol.rs

rklaehn requested a review from divagant-martian August 25, 2023 06:03

dignifiedquire approved these changes Aug 25, 2023

View reviewed changes

rklaehn added this pull request to the merge queue Aug 25, 2023

Merged via the queue into main with commit b70814d Aug 25, 2023

rklaehn deleted the fsm-error-refinement branch August 25, 2023 09:19

rklaehn mentioned this pull request Aug 25, 2023

Distinguish between errors in iroh bytes get requests #1348

Closed

divagant-martian mentioned this pull request Sep 5, 2023

feat(iroh): downloader #1420

Merged

3 tasks

ramfox mentioned this pull request Sep 26, 2023

Make the get request errors more precise #962

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get fsm error refinement #1362

Get fsm error refinement #1362

rklaehn commented Aug 16, 2023 •

edited

Loading

rklaehn commented Aug 16, 2023 •

edited

Loading

divagant-martian Aug 16, 2023

Get fsm error refinement #1362

Get fsm error refinement #1362

Conversation

rklaehn commented Aug 16, 2023 • edited Loading

Description

Notes & open questions

Change checklist

rklaehn commented Aug 16, 2023 • edited Loading

divagant-martian Aug 16, 2023

Choose a reason for hiding this comment

rklaehn commented Aug 16, 2023 •

edited

Loading

rklaehn commented Aug 16, 2023 •

edited

Loading