Implement `cudf-polars` chunked parquet reading #16944

brandon-b-miller · 2024-09-27T14:38:20Z

This PR provides access to the libcudf chunked parquet reader through the cudf-polars gpu engine, inspired by the cuDF python implementation.

Closes #16818

python/cudf_polars/cudf_polars/callback.py

python/cudf_polars/cudf_polars/dsl/ir.py

python/cudf_polars/cudf_polars/dsl/translate.py

wence-

Minor tweaks, overall, I think the implementation side looks in good shape.

There's an open question about whether we want to read all the chunks and do a concatenate at the end, but I will wait for benchmarking on that.

python/cudf_polars/cudf_polars/dsl/ir.py

python/cudf_polars/cudf_polars/dsl/translate.py

python/cudf_polars/cudf_polars/dsl/ir.py

wence- · 2024-11-05T18:22:41Z

We need to put some documentation about the chunked reading somewhere. As a minimum, can you add a new md file in cudf_polars/docs and talk through what settings are available, and what they mean?

docs/cudf/source/cudf_polars/engine_options.md

Co-authored-by: Lawrence Mitchell <[email protected]>

python/cudf_polars/tests/test_scan.py

python/cudf_polars/cudf_polars/dsl/ir.py

brandon-b-miller · 2024-11-13T16:30:21Z

In addition to the slicing issue, switching to chunked reading by default seems to shake out another chunked parquet reader bug with n_rows > 0

import pylibcudf as plc
import pyarrow as pa
import pyarrow.parquet as pq


data = {
            "a": [1, 2, 3, None, 4, 5],
            "b": ["ẅ", "x", "y", "z", "123", "abcd"],
            "c": [None, None, 4, 5, -1, 0],
        }


path = "./test.parquet"
pq.write_table(pa.Table.from_pydict(data), path)

reader = plc.io.parquet.ChunkedParquetReader(
        plc.io.SourceInfo([path]),
        columns=['a', 'b', 'c'],
        nrows=2,
        skip_rows=0,
        chunk_read_limit=0,
        pass_read_limit=17179869184 # 16 GiB
)
chk = reader.read_chunk()
tbl = chk.tbl
names = chk.column_names()
concatenated_columns = tbl.columns()
while reader.has_next():
    tbl = reader.read_chunk().tbl

    for i in range(tbl.num_columns()):
        concatenated_columns[i] = plc.concatenate.concatenate(
            [concatenated_columns[i], tbl._columns[i]]
        )
        # Drop residual columns to save memory
        tbl._columns[i] = None

gpu_result = plc.interop.to_arrow(tbl)
cpu_result = pq.read_table(path)[:2]

print(cpu_result.column(1).to_pylist())
print(gpu_result.column(1).to_pylist())

this yields

['ẅ', 'x']
['ẅ', 'x\x00\x00\x00\x00\x00\x00\x00\x00\x00']

I suppose this is a separate issue from #17158 since the trunk includes this fix now I believe.

wence- · 2024-11-14T09:20:30Z

python/cudf_polars/cudf_polars/dsl/ir.py

-        if self.typ == "csv" and self.skip_rows != 0:  # pragma: no cover
+        if self.typ in {"csv", "parquet"} and self.skip_rows != 0:  # pragma: no cover
            # This comes from slice pushdown, but that
            # optimization doesn't happen right now
-            raise NotImplementedError("skipping rows in CSV reader")
+            raise NotImplementedError("skipping rows in CSV or Parquet reader")


I don't like this, because it turns off yet more routes into running a query on device. Particularly, since parquet ingest is the primary way we recommend people do things, we need it to work in basically all cases.

Things we can do:

turn off chunked reading is skip_rows != 0

fail if skip_rows != 0 (as here)

Manually handle skip_rows != 0 in the chunked reader by reading full chunks and slicing them away if skip_rows != 0

I think I like the third option best.

OK, I pushed an implementation of option 3.

Rather than falling back to CPU for chunked read + skip_rows, just read chunks and skip manually after the fact. Simplify the parquet scan tests a bit and add better coverage of both chunked and unchunked behaviour.

vyasr

I have some suggestions to improve the docs and a couple of minor questions on implementation, but nothing blocking. I think that this is good to merge when you are happy.

docs/cudf/source/cudf_polars/engine_options.md

python/cudf_polars/cudf_polars/callback.py

python/cudf_polars/cudf_polars/dsl/ir.py

python/cudf_polars/tests/test_scan.py

Co-authored-by: Vyas Ramasubramani <[email protected]>

brandon-b-miller · 2024-11-15T13:56:32Z

/merge

rjzamora · 2024-11-15T14:43:13Z

python/cudf_polars/cudf_polars/dsl/ir.py

@@ -208,8 +214,9 @@ def evaluate(self, *, cache: MutableMapping[int, DataFrame]) -> DataFrame:
            translation phase should fail earlier.
        """
        return self.do_evaluate(
+            config,


@wence- - Just a note. Pretty sure this means we will need to pass in a config object to every single task in the task graph for multi-gpu.

Argh, ok, painful. Let's try and figure something out (especially because the config object can contain a memory resource).

Would it make sense for config to be a required IR constructor argument, and not require it as an argument to do_evaluate (unless necessary)?

Ah, we could pass the config options we need into the Scan node during translate, and then it never needs to be in do_evaluate at all

Possible revision proposed here: #17339

Follow up to #16944 That PR added `config: GPUEngine` to the arguments of every `IR.do_evaluate` function. In order to simplify future multi-GPU development, this PR extracts the necessary configuration argument at `IR` translation time instead. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) - Lawrence Mitchell (https://github.com/wence-) Approvers: - https://github.com/brandon-b-miller - Lawrence Mitchell (https://github.com/wence-) URL: #17339

brandon-b-miller added 2 commits September 26, 2024 13:26

access and config chunked parquet reader

146ac45

do not early return df

0242495

brandon-b-miller added feature request New feature or request non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Sep 27, 2024

brandon-b-miller self-assigned this Sep 27, 2024

github-actions bot added the Python Affects Python cuDF API. label Sep 27, 2024

brandon-b-miller mentioned this pull request Sep 27, 2024

[DO NOT MERGE] cudf-polars chunked parquet reader #16789

Closed

brandon-b-miller commented Sep 27, 2024

View reviewed changes

python/cudf_polars/cudf_polars/callback.py Outdated Show resolved Hide resolved

wence- reviewed Oct 2, 2024

View reviewed changes

python/cudf_polars/cudf_polars/dsl/ir.py Outdated Show resolved Hide resolved

brandon-b-miller added 5 commits October 9, 2024 06:55

Merge branch 'branch-24.12' into cudf-polars-chunked-parquet-reader

e257242

fix nrows

95ebf4d

Merge branch 'branch-24.12' into cudf-polars-chunked-parquet-reader

7533ed3

Merge branch 'branch-24.12' into cudf-polars-chunked-parquet-reader

43acc47

updates, set defaults

6ddf128

wence- reviewed Oct 29, 2024

View reviewed changes

python/cudf_polars/cudf_polars/dsl/translate.py Outdated Show resolved Hide resolved

pass config through evaluate

fea77d7

brandon-b-miller marked this pull request as ready for review October 30, 2024 03:31

brandon-b-miller requested a review from a team as a code owner October 30, 2024 03:31

brandon-b-miller requested review from wence- and galipremsagar October 30, 2024 03:31

wence- requested changes Oct 30, 2024

View reviewed changes

python/cudf_polars/cudf_polars/dsl/ir.py Outdated Show resolved Hide resolved

python/cudf_polars/cudf_polars/dsl/translate.py Outdated Show resolved Hide resolved

a trial commit to test a different concatenation strategy

53b0b2a

vyasr mentioned this pull request Oct 31, 2024

[FEA] Update cudf.pandas Parquet reading according to cudf-polars results #17225

Open

merge/resolve

ec298d3

wence- reviewed Nov 5, 2024

View reviewed changes

python/cudf_polars/cudf_polars/dsl/ir.py Outdated Show resolved Hide resolved

brandon-b-miller added 3 commits November 6, 2024 08:32

adjust for IR changes / pass tests

310f8c2

address reviews

62c277b

revert translate.py changes

13df5aa

wence- reviewed Nov 13, 2024

View reviewed changes

docs/cudf/source/cudf_polars/engine_options.md Outdated Show resolved Hide resolved

wence- reviewed Nov 13, 2024

View reviewed changes

docs/cudf/source/cudf_polars/engine_options.md Outdated Show resolved Hide resolved

brandon-b-miller and others added 4 commits November 13, 2024 04:19

partially address reviews

9930d2e

Apply suggestions from code review

b2530a4

Co-authored-by: Lawrence Mitchell <[email protected]>

chunk on by default

9958fe9

turn OFF chunking in existing parquet tests

d33ec5e

wence- reviewed Nov 13, 2024

View reviewed changes

python/cudf_polars/tests/test_scan.py Show resolved Hide resolved

Merge branch 'branch-24.12' into cudf-polars-chunked-parquet-reader

b69eaa6

github-actions bot assigned galipremsagar Nov 13, 2024

wence- reviewed Nov 13, 2024

View reviewed changes

python/cudf_polars/cudf_polars/dsl/ir.py Outdated Show resolved Hide resolved

disable slice pushdown with parquet

2be2847

wence- reviewed Nov 14, 2024

View reviewed changes

wence- added 3 commits November 14, 2024 10:36

Merge branch 'branch-24.12' into HEAD

e72215b

Test parquet filters with chunking off and on

c23afd9

Implement workaround for rapidsai#16186

df341ea

Rather than falling back to CPU for chunked read + skip_rows, just read chunks and skip manually after the fact. Simplify the parquet scan tests a bit and add better coverage of both chunked and unchunked behaviour.

github-actions bot assigned wence- Nov 14, 2024

xfail a polars test

beb2462

vyasr mentioned this pull request Nov 14, 2024

[FEA] Support parquet row group skipping in Polars physical engine #16257

Open

vyasr approved these changes Nov 14, 2024

View reviewed changes

brandon-b-miller and others added 3 commits November 14, 2024 21:05

Apply suggestions from code review

b398172

Co-authored-by: Vyas Ramasubramani <[email protected]>

Merge branch 'branch-24.12' into cudf-polars-chunked-parquet-reader

2116d94

Remove commented code

e67614a

wence- approved these changes Nov 15, 2024

View reviewed changes

rapids-bot bot merged commit aa8c0c4 into rapidsai:branch-24.12 Nov 15, 2024
102 checks passed

brandon-b-miller deleted the cudf-polars-chunked-parquet-reader branch November 15, 2024 13:56

rjzamora reviewed Nov 15, 2024

View reviewed changes

rjzamora mentioned this pull request Nov 15, 2024

Extract GPUEngine config options at translation time #17339

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `cudf-polars` chunked parquet reading #16944

Implement `cudf-polars` chunked parquet reading #16944

brandon-b-miller commented Sep 27, 2024 •

edited by vyasr

Loading

wence- left a comment

wence- commented Nov 5, 2024

brandon-b-miller commented Nov 13, 2024

wence- Nov 14, 2024

wence- Nov 14, 2024

vyasr left a comment

brandon-b-miller commented Nov 15, 2024

rjzamora Nov 15, 2024

wence- Nov 15, 2024

rjzamora Nov 15, 2024

wence- Nov 15, 2024

rjzamora Nov 15, 2024

Implement cudf-polars chunked parquet reading #16944

Implement cudf-polars chunked parquet reading #16944

Conversation

brandon-b-miller commented Sep 27, 2024 • edited by vyasr Loading

wence- left a comment

Choose a reason for hiding this comment

wence- commented Nov 5, 2024

brandon-b-miller commented Nov 13, 2024

wence- Nov 14, 2024

Choose a reason for hiding this comment

wence- Nov 14, 2024

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

brandon-b-miller commented Nov 15, 2024

rjzamora Nov 15, 2024

Choose a reason for hiding this comment

wence- Nov 15, 2024

Choose a reason for hiding this comment

rjzamora Nov 15, 2024

Choose a reason for hiding this comment

wence- Nov 15, 2024

Choose a reason for hiding this comment

rjzamora Nov 15, 2024

Choose a reason for hiding this comment

Implement `cudf-polars` chunked parquet reading #16944

Implement `cudf-polars` chunked parquet reading #16944

brandon-b-miller commented Sep 27, 2024 •

edited by vyasr

Loading