Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cudf-polars chunked parquet reading #16944

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
146ac45
access and config chunked parquet reader
brandon-b-miller Sep 10, 2024
0242495
do not early return df
brandon-b-miller Sep 16, 2024
e257242
Merge branch 'branch-24.12' into cudf-polars-chunked-parquet-reader
brandon-b-miller Oct 9, 2024
95ebf4d
fix nrows
brandon-b-miller Oct 9, 2024
7533ed3
Merge branch 'branch-24.12' into cudf-polars-chunked-parquet-reader
brandon-b-miller Oct 22, 2024
43acc47
Merge branch 'branch-24.12' into cudf-polars-chunked-parquet-reader
brandon-b-miller Oct 28, 2024
6ddf128
updates, set defaults
brandon-b-miller Oct 29, 2024
fea77d7
pass config through evaluate
brandon-b-miller Oct 30, 2024
53b0b2a
a trial commit to test a different concatenation strategy
brandon-b-miller Oct 31, 2024
ec298d3
merge/resolve
brandon-b-miller Nov 5, 2024
310f8c2
adjust for IR changes / pass tests
brandon-b-miller Nov 6, 2024
62c277b
address reviews
brandon-b-miller Nov 7, 2024
13df5aa
revert translate.py changes
brandon-b-miller Nov 7, 2024
4aee59f
Revert "a trial commit to test a different concatenation strategy"
brandon-b-miller Nov 7, 2024
50add3a
Merge branch 'branch-24.12' into cudf-polars-chunked-parquet-reader
brandon-b-miller Nov 8, 2024
a113737
add docs
brandon-b-miller Nov 8, 2024
828a3bb
merge/resolve
brandon-b-miller Nov 12, 2024
48d3edc
add test coverage
brandon-b-miller Nov 12, 2024
eacc73d
merge/resolve
brandon-b-miller Nov 13, 2024
9c4c1bf
raise on fail true for default testing engine
brandon-b-miller Nov 13, 2024
d6aa668
Apply suggestions from code review
brandon-b-miller Nov 13, 2024
a06f0ae
reword Parquet Reader Options
brandon-b-miller Nov 13, 2024
9930d2e
partially address reviews
brandon-b-miller Nov 13, 2024
b2530a4
Apply suggestions from code review
brandon-b-miller Nov 13, 2024
9958fe9
chunk on by default
brandon-b-miller Nov 13, 2024
d33ec5e
turn OFF chunking in existing parquet tests
brandon-b-miller Nov 13, 2024
b69eaa6
Merge branch 'branch-24.12' into cudf-polars-chunked-parquet-reader
galipremsagar Nov 13, 2024
2be2847
disable slice pushdown with parquet
brandon-b-miller Nov 13, 2024
e72215b
Merge branch 'branch-24.12' into HEAD
wence- Nov 14, 2024
c23afd9
Test parquet filters with chunking off and on
wence- Nov 14, 2024
df341ea
Implement workaround for #16186
wence- Nov 14, 2024
beb2462
xfail a polars test
wence- Nov 14, 2024
b398172
Apply suggestions from code review
brandon-b-miller Nov 15, 2024
2116d94
Merge branch 'branch-24.12' into cudf-polars-chunked-parquet-reader
galipremsagar Nov 15, 2024
e67614a
Remove commented code
wence- Nov 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions docs/cudf/source/cudf_polars/engine_options.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# GPUEngine Configuration Options

The `polars.GPUEngine` object may be configured in several different ways.

## Parquet Reader Options
Reading large parquet files can use a large amount of memory, especially when the files are compressed. This may lead to out of memory errors for some workflows. To mitigate this, the "chunked" parquet reader may be selected. When enabled, parquet files are read in chunks, limiting the peak memory usage at the cost of a small drop in performance.


To configure the parquet reader, we provide a dictionary of options to the `parquet_options` keyword of the `GPUEngine` object. Valid keys and values are:
- `chunked` indicates that chunked parquet reading is to be used. By default, chunked reading is turned on.
- [`chunk_read_limit`](https://docs.rapids.ai/api/libcudf/legacy/classcudf_1_1io_1_1chunked__parquet__reader#aad118178b7536b7966e3325ae1143a1a) controls the maximum size per chunk. By default, the maximum chunk size is unlimited.
- [`pass_read_limit`](https://docs.rapids.ai/api/libcudf/legacy/classcudf_1_1io_1_1chunked__parquet__reader#aad118178b7536b7966e3325ae1143a1a) controls the maximum memory used for decompression. The default pass read limit is 16GiB.

For example, to select the chunked reader with custom values for `pass_read_limit` and `chunk_read_limit`:
```python
engine = GPUEngine(
parquet_options={
'chunked': True,
'chunk_read_limit': int(1e9),
'pass_read_limit': int(4e9)
}
)
result = query.collect(engine=engine)
```
Note that passing `chunked: False` disables chunked reading entirely, and thus `chunk_read_limit` and `pass_read_limit` will have no effect.
6 changes: 6 additions & 0 deletions docs/cudf/source/cudf_polars/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,9 @@ Launch on Google Colab
:target: https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/accelerated_data_processing_examples/polars_gpu_engine_demo.ipynb

Try out the GPU engine for Polars in a free GPU notebook environment. Sign in with your Google account and `launch the demo on Colab <https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/accelerated_data_processing_examples/polars_gpu_engine_demo.ipynb>`__.

.. toctree::
:maxdepth: 1
:caption: Engine Config Options:

engine_options
40 changes: 34 additions & 6 deletions python/cudf_polars/cudf_polars/callback.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ def set_device(device: int | None) -> Generator[int, None, None]:

def _callback(
ir: IR,
config: GPUEngine,
with_columns: list[str] | None,
pyarrow_predicate: str | None,
n_rows: int | None,
Expand All @@ -145,7 +146,30 @@ def _callback(
set_device(device),
set_memory_resource(memory_resource),
):
return ir.evaluate(cache={}).to_polars()
return ir.evaluate(cache={}, config=config).to_polars()


def validate_config_options(config: dict) -> None:
"""
Validate the configuration options for the GPU engine.

Parameters
----------
config
Configuration options to validate.

Raises
------
ValueError
If the configuration contains unsupported options.
"""
if unsupported := (config.keys() - {"raise_on_fail", "parquet_options"}):
raise ValueError(
f"Engine configuration contains unsupported settings: {unsupported}"
)
assert {"chunked", "chunk_read_limit", "pass_read_limit"}.issuperset(
config.get("parquet_options", {})
)


def execute_with_cudf(nt: NodeTraverser, *, config: GPUEngine) -> None:
Expand Down Expand Up @@ -174,10 +198,8 @@ def execute_with_cudf(nt: NodeTraverser, *, config: GPUEngine) -> None:
device = config.device
memory_resource = config.memory_resource
raise_on_fail = config.config.get("raise_on_fail", False)
if unsupported := (config.config.keys() - {"raise_on_fail"}):
raise ValueError(
f"Engine configuration contains unsupported settings {unsupported}"
)
validate_config_options(config.config)

with nvtx.annotate(message="ConvertIR", domain="cudf_polars"):
translator = Translator(nt)
ir = translator.translate_ir()
Expand All @@ -200,5 +222,11 @@ def execute_with_cudf(nt: NodeTraverser, *, config: GPUEngine) -> None:
raise exception
else:
nt.set_udf(
partial(_callback, ir, device=device, memory_resource=memory_resource)
partial(
_callback,
ir,
config,
device=device,
memory_resource=memory_resource,
)
)
Loading
Loading