Add multi-partition `DataFrameScan` support to cuDF-Polars #17441

rjzamora · 2024-11-25T20:59:47Z

Description

Follow-up to #17262

Adds support for parallel DataFrameScan operations.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rjzamora · 2024-11-25T21:02:06Z

python/cudf_polars/cudf_polars/callback.py

    ):
        raise ValueError(
            f"Engine configuration contains unsupported settings: {unsupported}"
        )
    assert {"chunked", "chunk_read_limit", "pass_read_limit"}.issuperset(
        config.get("parquet_options", {})
    )
+    assert {"num_rows_threshold"}.issuperset(config.get("parallel_options", {}))


I'd like to nest all multi-gpu options within the "parallel_options" moving forward (to avoid adding more top-level keys).

We might imagine that these options are executor-specific, does it make sense to have a nesting that is:

executor: str | tuple[str, dict]

So the executor argument is either a name, or a ("name", name-specific-options)?

That seems fine to me. Any opinion on this @pentschev ?

I do think it's a good idea to consider how the number of these options will inevitably grow over time (and that they will probably be executor-specific).

Hmm. The str | tuple[str, dict] logic actually feels a bit clumsy when I think about how to implement it.

How about we just rename "parallel_options" to "executor_options" (to make it clear that the options are executor-specific)? This still allows us to validate that the specified arguments are actually supported by the "active" executor.

As much as I agree that it is indeed clumsy it feels like we'll soon need to have nested options and inevitably make "executor_options" require accepting str | tuple[str, dict], so we may as well just do that in executor and with that allow as many levels of nested options as needed as part of executor. I think a better alternative may be an abstract base class Executor that we can specialize with the options we need for each executor.

I think a better alternative may be an abstract base class Executor that we can specialize with the options we need for each executor.

I do think this is the best long-term solution, but I also don't think it will be difficult to migrate from the "executor_options" approach currently used in this PR.

I don't think I understand why it is inevitable that "executor_options" would need to accept str | tuple[str, dict]. However, I do see why it would be useful to attach all executor-specific options to an Executor object. That said, I don't really want to deal with serialization/etc in this PR :)

I don't think I understand why it is inevitable that "executor_options" would need to accept str | tuple[str, dict].

It's possible I'm overestimating the amount of options we'll end up introducing here, but once we need nested options we'll need something more complex like the tuple[str, dict], or the abstract base class. Thus why I think it's inevitable.

…-multi-dataframe-scan

…a/cudf into cudf-polars-multi-dataframe-scan

rjzamora · 2024-11-26T15:31:30Z

cc @wence- - Interested to know how you feel about the pattern used here to define/use ParDataFrameScan (since it's the first of many "parallel" IR-node extensions I plan to introduce).

python/cudf_polars/cudf_polars/experimental/parallel.py

…-multi-dataframe-scan

python/cudf_polars/cudf_polars/callback.py

python/cudf_polars/cudf_polars/experimental/io.py

python/cudf_polars/cudf_polars/experimental/parallel.py

Co-authored-by: Lawrence Mitchell <[email protected]>

…-multi-dataframe-scan

…a/cudf into cudf-polars-multi-dataframe-scan

wence-

I think this is basically good, I think my comments are a request for a bit more documentation on the rationale for certain choices.

python/cudf_polars/cudf_polars/callback.py

python/cudf_polars/cudf_polars/experimental/parallel.py

wence- · 2024-12-02T16:47:15Z

python/cudf_polars/cudf_polars/experimental/parallel.py

+@lower_ir_node.register(IR)
+def _(ir: IR, rec: LowerIRTransformer) -> tuple[IR, MutableMapping[IR, PartitionInfo]]:
+    # Single-partition default (see: _lower_ir_single)
+    return rec.state["default_mapper"](ir)


So we have two recursive transformers:

lower_ir_node (can handle multi-partitions)

this "default" mapper (cannot handle multi-partitions)

The idea is that we want a single-partition fallback for nodes where we're already defining a multi-partition handler.

However, once we enter the "single-partition" state through this fallback, we can never leave it.

I think I understood why we needed to split between single and multi-partition handlers, but can you explain it here please?

Ah - I just realized I misunderstood your earlier suggestion and messed this up a bit.

The idea is that we want a single-partition fallback for nodes where we're already defining a multi-partition handler.

Sort of. I just want a clean/intuitive way to fall back to "common" logic for any IR type. When we don't have a multi-partition handler defined for the IR type in question, I'd like to fall back to single-partition logic that is defined in one place. That logic would raise an error if there is not actually one partition. If we do have a multi-partition handler defined, it may still make sense for that handler to call that same single-partition logic in some cases (e.g. when support for specific options is missing, or there is only one partition).

A similar pattern will emerge for "partition-wise" operations. We are not going to want to repeat this logic all over the place, instead, we are going to want to call a _lower_ir_partitionwise function from multi-partition handler for the the IR type in question (e.g. Select).

Note that I rolled back the "default_mapper" change for now (in favor of more-explicit handling for the fall-back case). Perhaps we can iron out our answer to this question in a PR that focuses on a "tricky" case like Select.

python/cudf_polars/cudf_polars/experimental/parallel.py

wence- · 2024-12-02T17:52:17Z

python/cudf_polars/tests/experimental/test_dataframescan.py

+    assert_gpu_result_equal(df, engine=engine)
+
+    # Check partitioning
+    qir = Translator(df._ldf.visit(), engine).translate_ir()


nit: we need to remember to check that we didn't get any errors. I will t ry and open a PR that does this automatically.

…-multi-dataframe-scan

rjzamora · 2024-12-03T17:16:58Z

/merge

Adds multi-partition (partition-wise) `Select` support following the same design as #17441 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #17495

Adds multi-partition `Scan` support following the same design as #17441 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #17494

rjzamora added 2 commits November 25, 2024 12:55

add multi-partition DataFrameScan IR node

f651515

add multi-partition DataFrameScan IR node

17fa65a

rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Nov 25, 2024

rjzamora self-assigned this Nov 25, 2024

github-actions bot added the Python Affects Python cuDF API. label Nov 25, 2024

rjzamora commented Nov 25, 2024

View reviewed changes

rjzamora added 3 commits November 25, 2024 13:10

avoid redirection

e7e2a37

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

b587ea3

…-multi-dataframe-scan

adjust coverage

7da3209

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 26, 2024

rjzamora marked this pull request as ready for review November 26, 2024 14:58

rjzamora requested a review from a team as a code owner November 26, 2024 14:58

rjzamora requested review from vyasr and mroeschke November 26, 2024 14:58

rjzamora added 2 commits November 26, 2024 07:10

pull in _from_pydf change needed by 17364

7acdee2

Merge branch 'branch-25.02' into cudf-polars-multi-dataframe-scan

dcede57

rjzamora mentioned this pull request Nov 26, 2024

Prevent PyDataFrame serialization #17364

Draft

3 tasks

rjzamora added 2 commits November 26, 2024 07:29

avoid reconstruction (future concern)

69a76ee

Merge branch 'cudf-polars-multi-dataframe-scan' of github.com:rjzamor…

a7be622

…a/cudf into cudf-polars-multi-dataframe-scan

wence- reviewed Nov 26, 2024

View reviewed changes

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved

rjzamora added 5 commits November 26, 2024 10:08

apply code review suggestion

c311590

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

f6bb5d1

…-multi-dataframe-scan

roll back unnecessary changes

c8af6fc

rename to executor_options

a765cbc

fix coverage

b18121b

rjzamora commented Nov 27, 2024

View reviewed changes

python/cudf_polars/cudf_polars/callback.py Outdated Show resolved Hide resolved

wence- reviewed Nov 27, 2024

View reviewed changes

python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved

wence- reviewed Nov 27, 2024

View reviewed changes

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved

wence- reviewed Nov 27, 2024

View reviewed changes

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved

wence- reviewed Nov 27, 2024

View reviewed changes

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved

rjzamora and others added 8 commits November 27, 2024 13:51

Apply suggestions from code review

c6eb1b8

Co-authored-by: Lawrence Mitchell <[email protected]>

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

35e5493

…-multi-dataframe-scan

Merge branch 'branch-25.02' into cudf-polars-multi-dataframe-scan

5cfef05

Merge branch 'cudf-polars-multi-dataframe-scan' of github.com:rjzamor…

c031b01

…a/cudf into cudf-polars-multi-dataframe-scan

fix code suggestions

93f6a86

refactor to avoid circular imports

646ddba

fix test coverage

e0929e4

Merge branch 'branch-25.02' into cudf-polars-multi-dataframe-scan

46531aa

wence- approved these changes Dec 2, 2024

View reviewed changes

rjzamora added 5 commits December 2, 2024 11:58

remove problematic default mapping

8fb2833

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

925cb47

…-multi-dataframe-scan

improve comment

503ad59

move back lower_ir_node default

043d268

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

332ced3

…-multi-dataframe-scan

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Dec 3, 2024

rapids-bot bot merged commit 3785a48 into rapidsai:branch-25.02 Dec 3, 2024
107 checks passed

rjzamora deleted the cudf-polars-multi-dataframe-scan branch December 3, 2024 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-partition `DataFrameScan` support to cuDF-Polars #17441

Add multi-partition `DataFrameScan` support to cuDF-Polars #17441

rjzamora commented Nov 25, 2024 •

edited

Loading

rjzamora Nov 25, 2024

wence- Nov 26, 2024

rjzamora Nov 26, 2024

rjzamora Nov 26, 2024

pentschev Nov 26, 2024

rjzamora Nov 26, 2024

pentschev Nov 27, 2024

rjzamora commented Nov 26, 2024

wence- left a comment

wence- Dec 2, 2024

rjzamora Dec 2, 2024

rjzamora Dec 2, 2024

wence- Dec 2, 2024

rjzamora commented Dec 3, 2024

Add multi-partition DataFrameScan support to cuDF-Polars #17441

Add multi-partition DataFrameScan support to cuDF-Polars #17441

Conversation

rjzamora commented Nov 25, 2024 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjzamora commented Nov 26, 2024

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjzamora commented Dec 3, 2024

Add multi-partition `DataFrameScan` support to cuDF-Polars #17441

Add multi-partition `DataFrameScan` support to cuDF-Polars #17441

rjzamora commented Nov 25, 2024 •

edited

Loading