-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get fsm error refinement #1362
Merged
Merged
Get fsm error refinement #1362
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We can now check the error to see if the remote has stopped sending data, which indicates that it does not have anything for the hash.
9d6362f
to
884b955
Compare
this allows us to surface the quinn::ReadError
510c34b
to
4099e32
Compare
4099e32
to
e576a7b
Compare
This is a related PR for bao-tree that would allow us to have even more precise decode errors: n0-computer/bao-tree#27 |
iroh-bytes/src/get.rs
Outdated
Comment on lines
569
to
593
/// Decode error that you can get once you have sent the request and are | ||
/// decoding the response, e.g. from [`AtBlobContent::next`]. | ||
/// | ||
/// This is similar to [`bao_tree::io::DecodeError`], but takes into account | ||
/// that we are reading from a [`quinn::RecvStream`], so read errors will be | ||
/// propagated as [`DecodeError::Read`], containing a [`quinn::ReadError`]. | ||
/// This carries more concrete information about the error than an [`io::Error`]. | ||
/// | ||
/// When the provider finds that it does not have a chunk that we requested, | ||
/// or that the chunk is invalid, it will stop sending data without producing | ||
/// an error. This is indicated by the [`DecodeError::ChunkNotFound`] variant, | ||
/// which can be used to detect that data is missing but the connection as well | ||
/// that the provider is otherwise healthy. | ||
/// | ||
/// The [`DecodeError::ParentHashMismatch`] and [`DecodeError::LeafHashMismatch`] | ||
/// variants indicate that the provider has sent us invalid data. A well-behaved | ||
/// provider should never do this, so this is an indication that the provider is | ||
/// not behaving correctly. | ||
/// | ||
/// The [`DecodeError::InvalidQueryRange`] variant indicates that the we requested | ||
/// a range that is invalid for the current blob. E.g. we requested chunk 5 for | ||
/// a blob that is only 2 chunks large. | ||
/// | ||
/// The [`DecodeError::Io`] variant is just a fallback for any other io error that | ||
/// is not actually a [`quinn::ReadError`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
such nice docs, ty
Co-authored-by: Divma <[email protected]>
Co-authored-by: Divma <[email protected]>
Co-authored-by: Divma <[email protected]>
Co-authored-by: Divma <[email protected]>
Co-authored-by: Divma <[email protected]>
Co-authored-by: Divma <[email protected]>
checks for EOF are now done in bao-tree, the only thing we need to do is to try downcasting to quinn::ReadError
87e7417
to
357e9e1
Compare
3d1a6ae
to
a47324a
Compare
# Conflicts: # iroh-bytes/src/get.rs # iroh/tests/provide.rs
# Conflicts: # iroh-bytes/src/protocol.rs
dignifiedquire
approved these changes
Aug 25, 2023
Merged
3 tasks
github-merge-queue bot
pushed a commit
that referenced
this pull request
Sep 13, 2023
## Description Adds the `Downloader` as specified in #1334 plus some backchannel convos Features include: - Support collections - Add delays to downloads - Add download retries with an incremental backoff - Keeping peers for a bit longer than necessary in hopes they will be useful again - Having the concept of intents and deduplicating downloads efforts - Cancelling download intents - Limiting how many concurrent requests are done in total - Limiting how many concurrent requests are done per peer - Limiting the number of open connections in total - Basic error management in the form of deciding whether a peer should be dropped, the request should be dropped, or if the request should be retried ## Notes & open questions ### TODOs - A remaining TODO in the code is whether something special should be done when dropping quic connections - Should downloads have a timeout? - ~<sup>I know I've said this a hundred times with a hundred different things but would love to test this as well under stress scenarios and a large number of peers. don't hate me</sup>~ In reality after abstracting away all the IO most scenarios can be simulated easily. What would remain for a _much_ later time when the need and opportunity for real case testing scenario arises is to tune the concurrency parameters ### Future work #### Downloading Ranges There was the requirement of downloading a Hash, a range of a Hash, a collection and (not mentioned but potentially implied) ranges of collections. There is no support for ranges right now because of the great duplication of the `get` code in order to take advantage of proper errors added in #1362. In principle, adding ranges should be really easy. This is because it's an extension of the `DownloadKind` and would simply need calculating the missing ranges not based on the difference between what we have and the whole range, but the given range. I would prefer to find a way to deduplicate the get code before doing this extension. Also, as far as I can tell, there is no need for this yet. #### Prioritizing candidates per role: `Provider` and `Candidate` A nice extension, as discussed at some point, is to differentiate candidates we know have the data, from those that _might_ have the data. This has added benefit that when a peer is available to perform another download under the concurrency limits, a hash we know they have could be downloaded right away instead of waiting for the delay. At this point making this doesn't make sense because we will likely attempt a download before the peer has retrieved the data themselves. To implement this, we would need to add the notifications of fully downloaded hashes as available into gossip first. #### Leveraging the info from gossip When declaring that a hash `X` should be downloaded, it's also an option to query gossip for peers that are subscribed to the topic to which `X` belongs to use them as candidates. This could be done connecting the `ProviderMap` to `gossip`. For now I don't see the need to do this. ### Open questions about Future work - In line with the described work from above, the registry only allows querying for peer candidates to a hash since that's as good as it gets in terms of what we know from a remote right now. It's not clear to me if we would want to change this to have better availability information with #1413 in progress. - More future work: downloading a large data set/blob from multiple peers would most likely require us to do a three step process: 1. understanding the data layout/size. 2. splitting the download. 3. actually performing the separate downloads. Generally curious how this will end. My question here is whether we should do this for every download, or just on data that we expect to be big. Is there any way to obtain such hint without relying on a query every single time? ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant.
matheus23
pushed a commit
that referenced
this pull request
Nov 14, 2024
## Description The purpose of this refactoring is to make it easier to surface the underlying quinn errors, and to be able to handle the case where the remote does not have the data more easily. ## Notes & open questions ~This is WIP because there are still some changes needed in bao-tree DecodeError to handle the "blob found, but a part of the blob that was requested is not there or invalid" case and to surface quinn::ReadError~ Should NotFound or ChunkNotFound contain the hash for which something was not found? If you request a single thing it is clear what was not found, but for a collection request it might be useful to know what hash was not found. I don't think I want to add the offset of the chunk - that would require additional bookkeeping. Or maybe not? In any case these are things that can be done in a subsequent PR. To see how this is meant to be used - look at the 2 new tests in provide.rs ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. Implements #1348 and #962 --------- Co-authored-by: Divma <[email protected]>
matheus23
pushed a commit
that referenced
this pull request
Nov 14, 2024
## Description Adds the `Downloader` as specified in #1334 plus some backchannel convos Features include: - Support collections - Add delays to downloads - Add download retries with an incremental backoff - Keeping peers for a bit longer than necessary in hopes they will be useful again - Having the concept of intents and deduplicating downloads efforts - Cancelling download intents - Limiting how many concurrent requests are done in total - Limiting how many concurrent requests are done per peer - Limiting the number of open connections in total - Basic error management in the form of deciding whether a peer should be dropped, the request should be dropped, or if the request should be retried ## Notes & open questions ### TODOs - A remaining TODO in the code is whether something special should be done when dropping quic connections - Should downloads have a timeout? - ~<sup>I know I've said this a hundred times with a hundred different things but would love to test this as well under stress scenarios and a large number of peers. don't hate me</sup>~ In reality after abstracting away all the IO most scenarios can be simulated easily. What would remain for a _much_ later time when the need and opportunity for real case testing scenario arises is to tune the concurrency parameters ### Future work #### Downloading Ranges There was the requirement of downloading a Hash, a range of a Hash, a collection and (not mentioned but potentially implied) ranges of collections. There is no support for ranges right now because of the great duplication of the `get` code in order to take advantage of proper errors added in #1362. In principle, adding ranges should be really easy. This is because it's an extension of the `DownloadKind` and would simply need calculating the missing ranges not based on the difference between what we have and the whole range, but the given range. I would prefer to find a way to deduplicate the get code before doing this extension. Also, as far as I can tell, there is no need for this yet. #### Prioritizing candidates per role: `Provider` and `Candidate` A nice extension, as discussed at some point, is to differentiate candidates we know have the data, from those that _might_ have the data. This has added benefit that when a peer is available to perform another download under the concurrency limits, a hash we know they have could be downloaded right away instead of waiting for the delay. At this point making this doesn't make sense because we will likely attempt a download before the peer has retrieved the data themselves. To implement this, we would need to add the notifications of fully downloaded hashes as available into gossip first. #### Leveraging the info from gossip When declaring that a hash `X` should be downloaded, it's also an option to query gossip for peers that are subscribed to the topic to which `X` belongs to use them as candidates. This could be done connecting the `ProviderMap` to `gossip`. For now I don't see the need to do this. ### Open questions about Future work - In line with the described work from above, the registry only allows querying for peer candidates to a hash since that's as good as it gets in terms of what we know from a remote right now. It's not clear to me if we would want to change this to have better availability information with #1413 in progress. - More future work: downloading a large data set/blob from multiple peers would most likely require us to do a three step process: 1. understanding the data layout/size. 2. splitting the download. 3. actually performing the separate downloads. Generally curious how this will end. My question here is whether we should do this for every download, or just on data that we expect to be big. Is there any way to obtain such hint without relying on a query every single time? ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The purpose of this refactoring is to make it easier to surface the underlying quinn errors, and to be able to handle the case where the remote does not have the data more easily.
Notes & open questions
This is WIP because there are still some changes needed in bao-tree DecodeError to handle the "blob found, but a part of the blob that was requested is not there or invalid" case and to surface quinn::ReadErrorShould NotFound or ChunkNotFound contain the hash for which something was not found? If you request a single thing it is clear what was not found, but for a collection request it might be useful to know what hash was not found.
I don't think I want to add the offset of the chunk - that would require additional bookkeeping. Or maybe not? In any case these are things that can be done in a subsequent PR.
To see how this is meant to be used - look at the 2 new tests in provide.rs
Change checklist
Implements #1348 and #962