Added fixes for handling fetch_range extending beyond length of the file #247

hayesgb · 2021-06-09T15:15:10Z

Fix for #241

anders-kiaer

Thanks for a great package 🎉 I'm new to both adlfs and fsspec, but we had a use case which apparently adlfs solves brilliantly.

Context: We have been facing the following challenge: Downloading a very small subset of columns from a large .parquet file has very poor performance on adlfs>=0.3. For adlfs<0.3 the performance is good.

At least in my case, I've pinned it down to the same thing as pointed out in #241: More bytes than necessary are downloaded from blob storage.

I have confirmed that changing the input to Azure's download_blob from length=end to length=end-start gives same performance also on adlfs>=0.3 in my case, as for adlfs<0.3. I.e. it might be that this PR also fix #57 at the same time (not 100% sure if the issue over there is similar, but it involves the same >=< 0.3 version border at least) .

anders-kiaer · 2021-06-12T18:21:24Z

adlfs/spec.py

@@ -1769,20 +1772,23 @@ def connect_client(self):
                f"Unable to fetch container_client with provided params for {e}!!"
            )

-    async def _async_fetch_range(self, start: int, end: int, **kwargs):
+    async def _async_fetch_range(self, start: int, length: int = None, **kwargs):


This function is ultimately overriding _fetch_range in fsspec.spec.AbstractBufferedFile right? If so, I guess the function signature needs to be kept start and end here, since that is what fsspec assumes?

anders-kiaer · 2021-06-12T18:22:21Z

adlfs/spec.py

        async with self.container_client:
            stream = await self.container_client.download_blob(
-                blob=self.blob, offset=start, length=end
+                blob=self.blob, offset=start, length=length


...and then this line instead simply becomes

Suggested change

blob=self.blob, offset=start, length=length

blob=self.blob, offset=start, length=end-start

Thanks @anders-kiaer. I appreciate the feedback, and the confirmation that this issue resolves the challenge cited in #57.

I'll align the params in fetch_range to fsspec.AbstractBufferedFile, but I also want to account for the situation where end > self.size, and also for the eventuality that length is None, which is valid for the Azure SDK. Can you take a look at this branch and provide feedback on performance?

anders-kiaer

I took a quick benchmark using a 2.2 GB .parquet file in Azure blob storage (106145 rows x 6837 columns),
and timed how much time it took to extract one column.

Results:

adlfs==0.2.4 + azure-storage-blob==2.1.0: 0.68 ± 0.11 seconds
adlfs==0.7.6 + azure-storage-blob==12.8.1: 47.1 ± 1.22 seconds
adlfs==(this branch) + azure-storage-blob==12.8.1: 0.64 ± 0.09 seconds

So that is a decent 99% reduction in execution time compared to 0.7.6, and also slightly faster (however well within 1σ though) than 0.2.4 in my unofficial benchmark + with my specific .parquet file. The performance improvement would obviously depend on where in the blob/file you want to extract from (extracting things from the middle of the blob/file would see a huge performance benefit after this PR, extracting from the beginning or end of the file will see the same performance as before the PR).

adlfs/spec.py

…h Azure requirements

Co-authored-by: Anders Fredrik Kiær <[email protected]>

…ix_fetchrange

Added fixes for handling fetch_range extending beyond length of the file

c937e69

anders-kiaer reviewed Jun 12, 2021

View reviewed changes

anders-kiaer mentioned this pull request Jun 12, 2021

azure-storage-blob >=v12 causes slow, high memory dd.read_csv #57

Closed

hayesgb added 2 commits June 12, 2021 17:07

Align fetch_range to fsspec

09233dd

Updated tests to remove underscores from container names

1dbbf55

anders-kiaer reviewed Jun 13, 2021

View reviewed changes

adlfs/spec.py Outdated Show resolved Hide resolved

adlfs/spec.py Outdated Show resolved Hide resolved

hayesgb and others added 12 commits June 13, 2021 09:23

Added check to mkdir on container names that validates compliance wit…

793cd42

…h Azure requirements

Linting

87d18a2

Updated release version in CHANGELOG to calendar convention

ec6e21d

Update adlfs/spec.py

5f447ea

Co-authored-by: Anders Fredrik Kiær <[email protected]>

Updated docs on _fetch_range

5070722

Merge branch 'fix_fetchrange' of https://github.com/dask/adlfs into f…

e77cca6

…ix_fetchrange

rework check of container_name in mkdir

16e70a9

test mkdir with exception handling

fd75e68

Updated exception handling in get_properties method

575020b

Moved exception handling in _container_exists

2f78da1

Improved error handling in mkdir when creating container

98ebada

Updatec CHANGELOG

c82ef98

hayesgb merged commit 6cfde2d into master Jun 14, 2021

hayesgb deleted the fix_fetchrange branch June 14, 2021 12:46

hayesgb mentioned this pull request Jun 14, 2021

AzureBlobFile._fetch_range fetches too many bytes? #241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added fixes for handling fetch_range extending beyond length of the file #247

Added fixes for handling fetch_range extending beyond length of the file #247

hayesgb commented Jun 9, 2021

anders-kiaer left a comment •

edited

Loading

anders-kiaer Jun 12, 2021

anders-kiaer Jun 12, 2021

hayesgb Jun 12, 2021

anders-kiaer left a comment •

edited

Loading

	blob=self.blob, offset=start, length=length
	blob=self.blob, offset=start, length=end-start

Added fixes for handling fetch_range extending beyond length of the file #247

Added fixes for handling fetch_range extending beyond length of the file #247

Conversation

hayesgb commented Jun 9, 2021

anders-kiaer left a comment • edited Loading

Choose a reason for hiding this comment

anders-kiaer Jun 12, 2021

Choose a reason for hiding this comment

anders-kiaer Jun 12, 2021

Choose a reason for hiding this comment

hayesgb Jun 12, 2021

Choose a reason for hiding this comment

anders-kiaer left a comment • edited Loading

Choose a reason for hiding this comment

anders-kiaer left a comment •

edited

Loading

anders-kiaer left a comment •

edited

Loading