Introduce `NamedColumn` concept in cudf-polars #15914

wence- · 2024-06-04T15:35:16Z

Description

Simplify name tracking in expression evaluation by only requiring names for columns when putting them in to a DataFrame. At the same time, this allows us to have one place where we broadcast-expand Scalars to the size of the DataFrame, so we can expunge tracking them in the DataFrame itself.

Additionally, adapt to minor changes on the polars side in terms of translating the DSL: we no longer need to handle CSE expressions specially, and sorting by multiple keys takes a list of descending flags, rather than a single bool as previously.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

wence- · 2024-06-04T15:47:09Z

python/cudf_polars/cudf_polars/dsl/expr.py

-class NamedExpr(Expr):
-    __slots__ = ("name", "children")
-    _non_child = ("dtype", "name")
+class NamedExpr:


Decided to deliberately not make this one an Expr (it should not appear when evaluating expressions themselves, only when constructing return values in dataframe (IR) nodes)

This could be helpful to leave as a comment.

Suggested change

class NamedExpr:

class NamedExpr:

# NamedExpr does not inherit from Expr because it should not appear when

# evaluating expressions themselves, only when constructing return values

# in dataframe (IR) nodes).

python/cudf_polars/cudf_polars/containers/column.py

Names in the result dataframe only appear from PyExprIR and thence NamedExpr nodes. To avoid name tracking issues, only require a name when translating a NamedExpr.

Expressions must now be translated with the node which is to provide the schema active.

We can't decide expression-by-expression whether the result should be broadcast to the size of the context DataFrame. It is only when we return "out" to construct a new DataFrame (i.e. when we are evaluating an IR node) that we have the necessary information.

mroeschke

Generally the Python code looks good

lithomas1 · 2024-06-05T17:56:44Z

This looks good (to the best of my knowledge).

Maybe @vyasr or @brandon-b-miller should double check this though.

bdice

I looked over this PR to acquaint myself with more of the internals of cudf-polars. I have just a couple comments. Thanks!

bdice · 2024-06-05T18:03:38Z

python/cudf_polars/cudf_polars/dsl/expr.py

-class NamedExpr(Expr):
-    __slots__ = ("name", "children")
-    _non_child = ("dtype", "name")
+class NamedExpr:


This could be helpful to leave as a comment.

Suggested change

class NamedExpr:

class NamedExpr:

# NamedExpr does not inherit from Expr because it should not appear when

# evaluating expressions themselves, only when constructing return values

# in dataframe (IR) nodes).

python/cudf_polars/cudf_polars/dsl/expr.py

python/cudf_polars/cudf_polars/dsl/translate.py

python/cudf_polars/docs/overview.md

wence- · 2024-06-06T10:07:47Z

/merge

Descending is now a sequence for multiple sort keys

e320d38

wence- requested a review from a team as a code owner June 4, 2024 15:35

wence- requested review from mroeschke and lithomas1 June 4, 2024 15:35

github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Jun 4, 2024

wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 4, 2024

wence- commented Jun 4, 2024

View reviewed changes

mroeschke reviewed Jun 5, 2024

View reviewed changes

python/cudf_polars/cudf_polars/containers/column.py Show resolved Hide resolved

wence- commented Jun 5, 2024

View reviewed changes

python/cudf_polars/cudf_polars/containers/column.py Outdated Show resolved Hide resolved

wence- added 4 commits June 5, 2024 17:43

Separate Column and NamedColumn

ae29794

Names in the result dataframe only appear from PyExprIR and thence NamedExpr nodes. To avoid name tracking issues, only require a name when translating a NamedExpr.

Expunge scalars property from DataFrame

533b1b9

No more CSE exprs

b2dfef0

Expressions must now be translated with the node which is to provide the schema active.

Update docs for new structure

9b87759

wence- force-pushed the wence/fea/expunge-scalar-dataframe branch from 769b248 to 9b87759 Compare June 5, 2024 17:44

lithomas1 approved these changes Jun 5, 2024

View reviewed changes

mroeschke approved these changes Jun 5, 2024

View reviewed changes

bdice approved these changes Jun 5, 2024

View reviewed changes

wence- added 2 commits June 6, 2024 10:06

Correct numpydoc convention for Raises section

3fcc556

Typo in numbered list

d46e4b8

rapids-bot bot merged commit d1e511e into rapidsai:branch-24.08 Jun 6, 2024
70 checks passed

wence- deleted the wence/fea/expunge-scalar-dataframe branch June 6, 2024 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `NamedColumn` concept in cudf-polars #15914

Introduce `NamedColumn` concept in cudf-polars #15914

wence- commented Jun 4, 2024

wence- Jun 4, 2024

bdice Jun 5, 2024

wence- Jun 6, 2024

mroeschke left a comment

lithomas1 commented Jun 5, 2024

bdice left a comment

bdice Jun 5, 2024

wence- commented Jun 6, 2024

-class NamedExpr:
+class NamedExpr:
+    # NamedExpr does not inherit from Expr because it should not appear when
+    # evaluating expressions themselves, only when constructing return values
+    # in dataframe (IR) nodes).

Introduce NamedColumn concept in cudf-polars #15914

Introduce NamedColumn concept in cudf-polars #15914

Conversation

wence- commented Jun 4, 2024

Description

Checklist

wence- Jun 4, 2024

Choose a reason for hiding this comment

bdice Jun 5, 2024

Choose a reason for hiding this comment

wence- Jun 6, 2024

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

lithomas1 commented Jun 5, 2024

bdice left a comment

Choose a reason for hiding this comment

bdice Jun 5, 2024

Choose a reason for hiding this comment

wence- commented Jun 6, 2024

Introduce `NamedColumn` concept in cudf-polars #15914

Introduce `NamedColumn` concept in cudf-polars #15914