Optimized fsspec data transfer for remote file-systems #9265

rjzamora · 2021-09-21T15:47:36Z

This PR strips the pyarrow-NativeFile component out of #9225 (since those changes are not yet stable). I feel that it is reasonable to start by merging these fsspec-specific optimizations for 21.10, because they are stable and already result in a significant performance boost over the existing approach to remote storage. I still think it is very important that we eventually plumb NativeFile support into python (cudf and dask_cudf), but we will likely need to target 21.12 for that improvement.

…parquet

…ot stable

…parquet

rjzamora

Adding some notes.

rjzamora · 2021-09-21T16:39:00Z

python/cudf/cudf/io/csv.py

        iotypes=(BytesIO, StringIO),
+        byte_ranges=[byte_range] if byte_range else None,
+        clip_dummy_buffer=True if byte_range else False,
        **kwargs,
    )


Notes:

When byte_range is specified in a read_csv call, we only need to transfer that byte range from remote storage.

By default, the byte-ranges that are transferred from remote storage will be coppied into a local dummy buffer. We call this a dummy buffer, because it is likely to contain many "empty" bytes that libcudf will ultimately ignore

We use clip_dummy_buffer=True to avoid the actual allocation of "empty" bytes when byte_range= is specified. This option informs get_filepath_or_buffer that the local dummy buffer does not need to be the size of the entire remote file, and can be clipped down to the exact byte_range size.

When clip_dummy_buffer=True, the byte_range argument passed down to libcudf must be adjusted to a zero offset (see code block below).

The clip_dummy_buffer=True optimization cannot be used for parquet (for now), because the footer metadata includes many specific column-chunk offesets.

rjzamora · 2021-09-21T16:40:33Z

python/cudf/cudf/io/csv.py

+    # Adjust byte_range for clipped dummy buffers
+    use_byte_range = byte_range
+    if byte_range and isinstance(filepath_or_buffer, BytesIO):
+        if byte_range[1] == filepath_or_buffer.getbuffer().nbytes:
+            use_byte_range = (0, byte_range[1])


As discussed above, this is where we reset byte_range to a zero offset when we are using a local dummy buffer.

rjzamora · 2021-09-21T16:44:35Z

python/cudf/cudf/io/parquet.py

+def _process_row_groups(paths, fs, filters=None, row_groups=None):
+
+    # Deal with case that the user passed in a directory name
+    file_list = paths
+    if len(paths) == 1 and ioutils.is_directory(paths[0]):
+        paths = ioutils.stringify_pathlike(paths[0])
+
+    # Convert filters to ds.Expression
+    if filters is not None:
+        filters = pq._filters_to_expression(filters)
+
+    # Initialize ds.FilesystemDataset
+    dataset = ds.dataset(
+        paths, filesystem=fs, format="parquet", partitioning="hive",
+    )
+    file_list = dataset.files
+    if len(file_list) == 0:
+        raise FileNotFoundError(f"{paths} could not be resolved to any files")
+
+    if filters is not None:
+        # Load IDs of filtered row groups for each file in dataset
+        filtered_rg_ids = defaultdict(list)
+        for fragment in dataset.get_fragments(filter=filters):
+            for rg_fragment in fragment.split_by_row_group(filters):
+                for rg_info in rg_fragment.row_groups:
+                    filtered_rg_ids[rg_fragment.path].append(rg_info.id)
+
+        # Initialize row_groups to be selected
+        if row_groups is None:
+            row_groups = [None for _ in dataset.files]
+
+        # Store IDs of selected row groups for each file
+        for i, file in enumerate(dataset.files):
+            if row_groups[i] is None:
+                row_groups[i] = filtered_rg_ids[file]
+            else:
+                row_groups[i] = filter(
+                    lambda id: id in row_groups[i], filtered_rg_ids[file]
+                )
+
+    return file_list, row_groups


This PR copies the existing filtering logic into a dedicated helper function. The general purpos of this function is to (1) expand directory input into a list of paths (using the pyarrow dataset API), and (2) to apply row-group filters.

Main take-away: This is just a re-organization of existing logic.

The general purpose of this function is to (1) expand directory input into a list of paths (using the pyarrow dataset API), and (2) to apply row-group filters.

Perhaps this could be a helpful comment to add at the top of this function?

rjzamora · 2021-09-21T16:50:52Z

python/cudf/cudf/io/parquet.py

+def _get_byte_ranges(file_list, row_groups, columns, fs):
+
+    if row_groups is None:
+        if columns is None:
+            return None, None, None  # No reason to construct this
+        row_groups = [None for path in file_list]
+
+    # Construct a list of required byte-ranges for every file
+    all_byte_ranges, all_footers, all_sizes = [], [], []
+    for path, rgs in zip(file_list, row_groups):
+
+        # Step 0 - Get size of file
+        if fs is None:
+            file_size = path.size
+        else:
+            file_size = fs.size(path)
+
+        # Step 1 - Get 32 KB from tail of file.
+        #
+        # This "sample size" can be tunable, but should
+        # always be >= 8 bytes (so we can read the footer size)
+        tail_size = min(32_000, file_size)
+        if fs is None:
+            path.seek(file_size - tail_size)
+            footer_sample = path.read(tail_size)
+        else:
+            footer_sample = fs.tail(path, tail_size)
+
+        # Step 2 - Read the footer size and re-read a larger
+        #          tail if necessary
+        footer_size = int.from_bytes(footer_sample[-8:-4], "little")
+        if tail_size < (footer_size + 8):
+            if fs is None:
+                path.seek(file_size - (footer_size + 8))
+                footer_sample = path.read(footer_size + 8)
+            else:
+                footer_sample = fs.tail(path, footer_size + 8)
+
+        # Step 3 - Collect required byte ranges
+        byte_ranges = []
+        md = pq.ParquetFile(io.BytesIO(footer_sample)).metadata
+        for r in range(md.num_row_groups):
+            # Skip this row-group if we are targetting
+            # specific row-groups
+            if rgs is None or r in rgs:
+                row_group = md.row_group(r)
+                for c in range(row_group.num_columns):
+                    column = row_group.column(c)
+                    name = column.path_in_schema
+                    # Skip this column if we are targetting a
+                    # specific columns
+                    if columns is None or name in columns:
+                        file_offset0 = column.dictionary_page_offset
+                        if file_offset0 is None:
+                            file_offset0 = column.data_page_offset
+                        num_bytes = column.total_uncompressed_size
+                        byte_ranges.append((file_offset0, num_bytes))
+
+        all_byte_ranges.append(byte_ranges)
+        all_footers.append(footer_sample)
+        all_sizes.append(file_size)
+    return all_byte_ranges, all_footers, all_sizes


The _get_byte_ranges utility corresponds to new logic. This is where we collect a footer-metadata sample from each parquet file, and then use that metadata to define the exact byte-ranges that will be needed to read the target column-chunks from the file. This utility is only used for remote storage. The calculated byte-range information is used within cudf.io.ioutils.get_filepath_or_buffer (which uses _fsspec_data_transfer to convert non-local fsspec file objects into local byte buffers).

Ditto as above -- this is super helpful and could perhaps be useful in the code as a comment!

codecov · 2021-09-21T18:44:17Z

Codecov Report

Merging #9265 (13c8f5b) into branch-21.10 (3ee3ecf) will increase coverage by 0.01%.
The diff coverage is 11.96%.

@@               Coverage Diff                @@
##           branch-21.10    #9265      +/-   ##
================================================
+ Coverage         10.85%   10.87%   +0.01%     
================================================
  Files               115      116       +1     
  Lines             19158    19328     +170     
================================================
+ Hits               2080     2102      +22     
- Misses            17078    17226     +148

Impacted Files	Coverage Δ
python/cudf/cudf/__init__.py	`0.00% <ø> (ø)`
python/cudf/cudf/_lib/__init__.py	`0.00% <ø> (ø)`
python/cudf/cudf/io/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/csv.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/text.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/parquet.py	`89.61% <89.28%> (+0.34%)`	⬆️
python/dask_cudf/dask_cudf/accessors.py	`89.74% <0.00%> (-1.93%)`	⬇️
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f08d6f1...13c8f5b. Read the comment docs.

python/cudf/cudf/io/csv.py

shwina · 2021-09-22T15:11:53Z

python/cudf/cudf/io/parquet.py

+
+    # Convert filters to ds.Expression
+    if filters is not None:
+        filters = pq._filters_to_expression(filters)


Are we OK using a non-public arrow API here? Is there a public alternative, or is it worth vendoring?

Good question! This same issue came up in Dask a few times (where we are also using this "private" utility).
@jorisvandenbossche - Is this still the recommended way to translate filters?

Note that this particular line was not actually added in this PR, but it would be good to change it if there is a new "public" API.

Although private, it's knowingly being used in several places (such as dask), so we won't be removing it without alternative / deprecation, as with public API. So I think it is fine to use.
There is an issue about making this public (https://issues.apache.org/jira/browse/ARROW-9672), I can look into that for the coming release, but you will still need to keep using the private one for some time if you want to support older versions.

I appreciate the confirmation Joris. I'll keep my eye on that issue in case something changes. Note that the most ideal solution from the perspective of Dask and RAPIDS is for pyarrow to start recognizing filters specified as the DNF-like list of tuples :)

shwina

Looks great! A couple of minor changes requested.

rjzamora · 2021-09-22T17:32:59Z

@gpucibot merge

rjzamora · 2021-09-22T18:53:32Z

rerun tests

This is a simple follow-up to #9304 and #9265 meant to achieve the following: - After this PR, the default behavior of `cudf.read_csv` will be to convert fsspec-based `AbstractBufferedFile` objects to Arrow `PythonFile` objects for non-local file systems. Since `PythonFile` objects inherit from `NativeFile` objects, libcudf can seek/read distinct byte ranges without requiring the entire file to be read into host memory (i.e. the default behavior enables proper partial IO from remote storage) - #9265 recently added an fsspec-based optimization for transfering csv byte ranges into local memory. That optimization already allowed us to avoid a full file transfer when a specific `byte_range` is specified to the `cudf.read_csv` call. However, the simpler approach introduced in this PR is (1) more general, (2) easier to maintain, and (3) demonstrates comparable performance. Therefore, this PR also rolls back one of the less-maintainable optimizations added in #9265 (local buffer clipping). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - https://github.com/brandon-b-miller URL: #9376

rjzamora added 24 commits September 9, 2021 13:31

save work related to byte-range collection

82ab31d

enable byte_ranges optimization for open file-like

ef02f3d

fix bug for no column or row-group selection

a32a7ae

Merge remote-tracking branch 'upstream/branch-21.10' into nativefile-…

26b96f3

…parquet

add arrow_filesystem flag for dask_cudf

bd2e59a

use cat_ranges when available

53bb32e

expose arrow_filesystem and legacy_transfer

ada8451

most tests passing with reasonable defaults - arrow_filesystem=True n…

661ae58

…ot stable

fix bug

42c55c9

Merge remote-tracking branch 'upstream/branch-21.10' into nativefile-…

51021ab

…parquet

legacy_transfer fix

ddec4df

fix test failures

36e4c52

plumb in csv support since most of the work is already done

98efb58

remove unncessary BytesIO usage for optimized code path

63dd615

Merge remote-tracking branch 'upstream/branch-21.10' into nativefile-…

40639c2

…parquet

avoid memory leaks in _read_byte_ranges

5524538

avoid full-file transfer for read_csv with byte_range defined

fd2998a

avoid seeking before beginning of file

d1cb7a6

remove arrow_filesystem option from dask (for now)

491c69f

save state

5994fd9

strip nativefile changes from nativefile-parquet branch

d757715

fix byte_range bug in read_csv

63a3bc9

update test_s3 to exercise fsspec transfer optimization

a1f44c5

add parquet filter test with s3

cb78ba8

rjzamora requested review from a team as code owners September 21, 2021 15:47

rjzamora requested review from charlesbluca and marlenezw September 21, 2021 15:47

rjzamora self-assigned this Sep 21, 2021

github-actions bot added the Python Affects Python cuDF API. label Sep 21, 2021

rjzamora added dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue labels Sep 21, 2021

rjzamora added 3 commits September 21, 2021 09:08

use bytes again by default

949036c

use BytesIO to be safe

8409e35

fix another copy-paste mistake

7f9154b

rjzamora commented Sep 21, 2021

View reviewed changes

shwina reviewed Sep 22, 2021

View reviewed changes

python/cudf/cudf/io/csv.py Outdated Show resolved Hide resolved

shwina reviewed Sep 22, 2021

View reviewed changes

shwina approved these changes Sep 22, 2021

View reviewed changes

quasiben approved these changes Sep 22, 2021

View reviewed changes

rjzamora added 2 commits September 22, 2021 10:20

change dummy to local

e86f79b

add comments

13c8f5b

rjzamora added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Sep 22, 2021

rapids-bot bot merged commit 8dea0b1 into rapidsai:branch-21.10 Sep 22, 2021

jorisvandenbossche mentioned this pull request Sep 23, 2021

[Experimental] Optimize cudf/dask-cudf read_parquet for s3/remote filesystems #9225

Closed

rjzamora deleted the fsspec-optimized-transfer branch September 23, 2021 13:26

This was referenced Sep 24, 2021

Supporting range queries fsspec/filesystem_spec#766

Open

Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf #9304

Merged

rjzamora mentioned this pull request Oct 5, 2021

Use Arrow PythonFile for remote CSV storage #9376

Merged

galipremsagar mentioned this pull request Sep 24, 2022

[REVIEW] Remove kwargs in read_csv & to_csv #11762

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized fsspec data transfer for remote file-systems #9265

Optimized fsspec data transfer for remote file-systems #9265

rjzamora commented Sep 21, 2021 •

edited

Loading

rjzamora left a comment

rjzamora Sep 21, 2021

rjzamora Sep 21, 2021

rjzamora Sep 21, 2021

shwina Sep 22, 2021

rjzamora Sep 21, 2021

shwina Sep 22, 2021

codecov bot commented Sep 21, 2021 •

edited

Loading

shwina Sep 22, 2021

rjzamora Sep 22, 2021

jorisvandenbossche Sep 23, 2021

rjzamora Sep 23, 2021

shwina left a comment

rjzamora commented Sep 22, 2021

rjzamora commented Sep 22, 2021

Optimized fsspec data transfer for remote file-systems #9265

Optimized fsspec data transfer for remote file-systems #9265

Conversation

rjzamora commented Sep 21, 2021 • edited Loading

rjzamora left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 21, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shwina left a comment

Choose a reason for hiding this comment

rjzamora commented Sep 22, 2021

rjzamora commented Sep 22, 2021

rjzamora commented Sep 21, 2021 •

edited

Loading

codecov bot commented Sep 21, 2021 •

edited

Loading