Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce NamedColumn concept in cudf-polars #15914

Merged

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Jun 4, 2024

Description

Simplify name tracking in expression evaluation by only requiring names for columns when putting them in to a DataFrame. At the same time, this allows us to have one place where we broadcast-expand Scalars to the size of the DataFrame, so we can expunge tracking them in the DataFrame itself.

Additionally, adapt to minor changes on the polars side in terms of translating the DSL: we no longer need to handle CSE expressions specially, and sorting by multiple keys takes a list of descending flags, rather than a single bool as previously.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested a review from a team as a code owner June 4, 2024 15:35
@wence- wence- requested review from mroeschke and lithomas1 June 4, 2024 15:35
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Jun 4, 2024
@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 4, 2024
class NamedExpr(Expr):
__slots__ = ("name", "children")
_non_child = ("dtype", "name")
class NamedExpr:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to deliberately not make this one an Expr (it should not appear when evaluating expressions themselves, only when constructing return values in dataframe (IR) nodes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be helpful to leave as a comment.

Suggested change
class NamedExpr:
class NamedExpr:
# NamedExpr does not inherit from Expr because it should not appear when
# evaluating expressions themselves, only when constructing return values
# in dataframe (IR) nodes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

wence- added 4 commits June 5, 2024 17:43
Names in the result dataframe only appear from PyExprIR and thence
NamedExpr nodes. To avoid name tracking issues, only require a name
when translating a NamedExpr.
Expressions must now be translated with the node which is to provide
the schema active.
@wence- wence- force-pushed the wence/fea/expunge-scalar-dataframe branch from 769b248 to 9b87759 Compare June 5, 2024 17:44
We can't decide expression-by-expression whether the result should be
broadcast to the size of the context DataFrame. It is only when we
return "out" to construct a new DataFrame (i.e. when we are evaluating
an IR node) that we have the necessary information.
Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the Python code looks good

@lithomas1
Copy link
Contributor

This looks good (to the best of my knowledge).

Maybe @vyasr or @brandon-b-miller should double check this though.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over this PR to acquaint myself with more of the internals of cudf-polars. I have just a couple comments. Thanks!

class NamedExpr(Expr):
__slots__ = ("name", "children")
_non_child = ("dtype", "name")
class NamedExpr:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be helpful to leave as a comment.

Suggested change
class NamedExpr:
class NamedExpr:
# NamedExpr does not inherit from Expr because it should not appear when
# evaluating expressions themselves, only when constructing return values
# in dataframe (IR) nodes).

python/cudf_polars/cudf_polars/dsl/expr.py Show resolved Hide resolved
python/cudf_polars/cudf_polars/dsl/translate.py Outdated Show resolved Hide resolved
python/cudf_polars/docs/overview.md Show resolved Hide resolved
python/cudf_polars/docs/overview.md Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor Author

wence- commented Jun 6, 2024

/merge

@rapids-bot rapids-bot bot merged commit d1e511e into rapidsai:branch-24.08 Jun 6, 2024
70 checks passed
@wence- wence- deleted the wence/fea/expunge-scalar-dataframe branch June 6, 2024 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants