Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-partition DataFrameScan support to cuDF-Polars #17441

Merged

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Nov 25, 2024

Description

Follow-up to #17262

Adds support for parallel DataFrameScan operations.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Nov 25, 2024
@rjzamora rjzamora self-assigned this Nov 25, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 25, 2024
):
raise ValueError(
f"Engine configuration contains unsupported settings: {unsupported}"
)
assert {"chunked", "chunk_read_limit", "pass_read_limit"}.issuperset(
config.get("parquet_options", {})
)
assert {"num_rows_threshold"}.issuperset(config.get("parallel_options", {}))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to nest all multi-gpu options within the "parallel_options" moving forward (to avoid adding more top-level keys).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might imagine that these options are executor-specific, does it make sense to have a nesting that is:

executor: str | tuple[str, dict]

So the executor argument is either a name, or a ("name", name-specific-options)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine to me. Any opinion on this @pentschev ?

I do think it's a good idea to consider how the number of these options will inevitably grow over time (and that they will probably be executor-specific).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. The str | tuple[str, dict] logic actually feels a bit clumsy when I think about how to implement it.

How about we just rename "parallel_options" to "executor_options" (to make it clear that the options are executor-specific)? This still allows us to validate that the specified arguments are actually supported by the "active" executor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as I agree that it is indeed clumsy it feels like we'll soon need to have nested options and inevitably make "executor_options" require accepting str | tuple[str, dict], so we may as well just do that in executor and with that allow as many levels of nested options as needed as part of executor. I think a better alternative may be an abstract base class Executor that we can specialize with the options we need for each executor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better alternative may be an abstract base class Executor that we can specialize with the options we need for each executor.

I do think this is the best long-term solution, but I also don't think it will be difficult to migrate from the "executor_options" approach currently used in this PR.

I don't think I understand why it is inevitable that "executor_options" would need to accept str | tuple[str, dict]. However, I do see why it would be useful to attach all executor-specific options to an Executor object. That said, I don't really want to deal with serialization/etc in this PR :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand why it is inevitable that "executor_options" would need to accept str | tuple[str, dict].

It's possible I'm overestimating the amount of options we'll end up introducing here, but once we need nested options we'll need something more complex like the tuple[str, dict], or the abstract base class. Thus why I think it's inevitable.

@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 26, 2024
@rjzamora rjzamora marked this pull request as ready for review November 26, 2024 14:58
@rjzamora rjzamora requested a review from a team as a code owner November 26, 2024 14:58
@rjzamora rjzamora requested review from vyasr and mroeschke November 26, 2024 14:58
@rjzamora
Copy link
Member Author

cc @wence- - Interested to know how you feel about the pattern used here to define/use ParDataFrameScan (since it's the first of many "parallel" IR-node extensions I plan to introduce).

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is basically good, I think my comments are a request for a bit more documentation on the rationale for certain choices.

python/cudf_polars/cudf_polars/callback.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved
Comment on lines 148 to 151
@lower_ir_node.register(IR)
def _(ir: IR, rec: LowerIRTransformer) -> tuple[IR, MutableMapping[IR, PartitionInfo]]:
# Single-partition default (see: _lower_ir_single)
return rec.state["default_mapper"](ir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have two recursive transformers:

  1. lower_ir_node (can handle multi-partitions)
  2. this "default" mapper (cannot handle multi-partitions)

The idea is that we want a single-partition fallback for nodes where we're already defining a multi-partition handler.

However, once we enter the "single-partition" state through this fallback, we can never leave it.

I think I understood why we needed to split between single and multi-partition handlers, but can you explain it here please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I just realized I misunderstood your earlier suggestion and messed this up a bit.

The idea is that we want a single-partition fallback for nodes where we're already defining a multi-partition handler.

Sort of. I just want a clean/intuitive way to fall back to "common" logic for any IR type. When we don't have a multi-partition handler defined for the IR type in question, I'd like to fall back to single-partition logic that is defined in one place. That logic would raise an error if there is not actually one partition. If we do have a multi-partition handler defined, it may still make sense for that handler to call that same single-partition logic in some cases (e.g. when support for specific options is missing, or there is only one partition).

A similar pattern will emerge for "partition-wise" operations. We are not going to want to repeat this logic all over the place, instead, we are going to want to call a _lower_ir_partitionwise function from multi-partition handler for the the IR type in question (e.g. Select).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I rolled back the "default_mapper" change for now (in favor of more-explicit handling for the fall-back case). Perhaps we can iron out our answer to this question in a PR that focuses on a "tricky" case like Select.

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved
assert_gpu_result_equal(df, engine=engine)

# Check partitioning
qir = Translator(df._ldf.visit(), engine).translate_ir()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we need to remember to check that we didn't get any errors. I will t ry and open a PR that does this automatically.

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Dec 3, 2024
@rjzamora
Copy link
Member Author

rjzamora commented Dec 3, 2024

/merge

@rapids-bot rapids-bot bot merged commit 3785a48 into rapidsai:branch-25.02 Dec 3, 2024
107 checks passed
@rjzamora rjzamora deleted the cudf-polars-multi-dataframe-scan branch December 3, 2024 17:17
rapids-bot bot pushed a commit that referenced this pull request Dec 18, 2024
Adds multi-partition (partition-wise) `Select` support following the same design as #17441

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #17495
rapids-bot bot pushed a commit that referenced this pull request Dec 19, 2024
Adds multi-partition `Scan` support following the same design as #17441

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #17494
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants