-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added fixes for handling fetch_range extending beyond length of the file #247
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for a great package 🎉 I'm new to both adlfs
and fsspec
, but we had a use case which apparently adlfs
solves brilliantly.
Context: We have been facing the following challenge: Downloading a very small subset of columns from a large .parquet
file has very poor performance on adlfs>=0.3
. For adlfs<0.3
the performance is good.
At least in my case, I've pinned it down to the same thing as pointed out in #241: More bytes than necessary are downloaded from blob storage.
I have confirmed that changing the input to Azure's download_blob
from length=end
to length=end-start
gives same performance also on adlfs>=0.3
in my case, as for adlfs<0.3
. I.e. it might be that this PR also fix #57 at the same time (not 100% sure if the issue over there is similar, but it involves the same >=< 0.3
version border at least) .
adlfs/spec.py
Outdated
@@ -1769,20 +1772,23 @@ def connect_client(self): | |||
f"Unable to fetch container_client with provided params for {e}!!" | |||
) | |||
|
|||
async def _async_fetch_range(self, start: int, end: int, **kwargs): | |||
async def _async_fetch_range(self, start: int, length: int = None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is ultimately overriding _fetch_range
in fsspec.spec.AbstractBufferedFile
right? If so, I guess the function signature needs to be kept start
and end
here, since that is what fsspec
assumes?
async with self.container_client: | ||
stream = await self.container_client.download_blob( | ||
blob=self.blob, offset=start, length=end | ||
blob=self.blob, offset=start, length=length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and then this line instead simply becomes
blob=self.blob, offset=start, length=length | |
blob=self.blob, offset=start, length=end-start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @anders-kiaer. I appreciate the feedback, and the confirmation that this issue resolves the challenge cited in #57.
I'll align the params in fetch_range to fsspec.AbstractBufferedFile, but I also want to account for the situation where end > self.size, and also for the eventuality that length is None, which is valid for the Azure SDK. Can you take a look at this branch and provide feedback on performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick benchmark using a 2.2 GB .parquet
file in Azure blob storage (106145 rows x 6837 columns),
and timed how much time it took to extract one column.
Results:
adlfs==0.2.4
+azure-storage-blob==2.1.0
: 0.68 ± 0.11 secondsadlfs==0.7.6
+azure-storage-blob==12.8.1
: 47.1 ± 1.22 secondsadlfs==(this branch)
+azure-storage-blob==12.8.1
: 0.64 ± 0.09 seconds
So that is a decent 99% reduction in execution time compared to 0.7.6
, and also slightly faster (however well within 1σ though) than 0.2.4
in my unofficial benchmark + with my specific .parquet
file. The performance improvement would obviously depend on where in the blob/file you want to extract from (extracting things from the middle of the blob/file would see a huge performance benefit after this PR, extracting from the beginning or end of the file will see the same performance as before the PR).
Fix for #241