feat: extend dataframe `drop` method #773

FBruzzesi · 2024-08-10T21:03:00Z

What type of PR is this? (check all applicable)

Related issues

Related issue [Enh]: Extend signature of supported methods #742

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

FBruzzesi · 2024-08-10T21:04:42Z

narwhals/_arrow/dataframe.py

+        if strict:
+            for d in to_drop:
+                if d not in cols:
+                    msg = f'"{d}" not found'
+                    raise ColumnNotFoundError(msg)


The logic is same everywhere. I wonder if we should move it directly to BaseFrame.

Also, I noticed that polars is "greedy" in the raise, meaning that if 2 columns are in the drop list but missing, the error will only mention the first one

I think Polars changed behaviour here in 1.0 too (before it would just ignore missing columns)

so, fine by me to just implement it in BaseFrame and let everyone benefit

Cool, I just noticed CI failing for polars pre 1. Refactoring into base now :)

Somehow I keep f*cking up parsing *columns - it's probably just late 🙈

Ok it is somehow fixed but worth taking a double look at it as the expected types along the chain of calls are a bit different

FBruzzesi · 2024-08-10T21:06:22Z

narwhals/_arrow/dataframe.py

@@ -285,8 +286,18 @@ def join(
            ),
        )

-    def drop(self, *columns: str) -> Self:
-        return self._from_native_dataframe(self._native_dataframe.drop(list(columns)))
+    def drop(self: Self, *columns: str, strict: bool = True) -> Self:


Ok hear me out, I know that we don't add defaults in here, but we sometimes used compliant dataframe drop method (e.g. in join). I can add it in the call there

FBruzzesi · 2024-08-11T15:12:25Z

Aside: I noticed that polars allows selectors other than column names. I think it should not be too complex to integrate.
Thoughts?

MarcoGorelli · 2024-08-11T20:29:20Z

narwhals/_pandas_like/dataframe.py

-        return self._from_native_dataframe(
-            self._native_dataframe.drop(columns=list(columns))
-        )
+    def drop(self: Self, columns: str | list[str]) -> Self:


does it need to be str | list[str]? didn't it get flattened one level above?

I will check with mypy for internal use

MarcoGorelli · 2024-08-11T20:42:21Z

Aside: I noticed that polars allows selectors other than column names. I think it should not be too complex to integrate.
Thoughts?

if it's not too complex - sure, I do like these 👍

FBruzzesi · 2024-08-11T20:47:07Z

narwhals/_arrow/dataframe.py

-            ).drop(key_token)
+                )
+                .drop(key_token),


Avoids a few functions and attributes calls, same below

FBruzzesi · 2024-08-11T20:50:27Z

Aside: I noticed that polars allows selectors other than column names. I think it should not be too complex to integrate.
Thoughts?

if it's not too complex - sure, I do like these 👍

Happy to keep it as a follow up honestly. Other methods would benefit from that as well and it could be worth to factor out some sort of "parser" for selectors and column names

MarcoGorelli

looks good, just got a comment on the polars lazyframe case, will need to think about it

MarcoGorelli · 2024-08-12T18:54:58Z

narwhals/dataframe.py

+    def drop(self, *columns: Iterable[str], strict: bool) -> Self:
+        cols = set(self.collect_schema().names())


this is the part I'm a little unsure about

I'm not sure we should be triggering collect_schema for Polars LazyFrame - I'll check what they're doing

We can let polars do what polars does natively.
I can factor out the parsing into a utils function and call it downstream instead of in BaseFrame.

I moved the logic here to avoid rewriting the same snippet everywhere, but we should not affect polars performances other than the function calls

thanks! sure, but at the moment we still reach this collect_schema call for Polars, right? that's the part i think we should avoid

yes at the moment we do. I will rework this PR a bit to avoid that in the versions that polars natively supports .drop(..., strict=...)

maybe we can just avoid it completely for Polars, and note in the docstring that for Polars<1 dropping non-existent columns silently passes?

At least in the lazy case. In the eager case, getting column names is free in Polars, so we can intercept drop without issues

so I would expect implicitly to collect those?!

yup, but only a LazyFrame.collect time, not at LazyFrame.drop time

In [8]: df = pl.LazyFrame({'a': [1,2,3]}) In [9]: df.drop('b') Out[9]: <LazyFrame at 0x7FE016F33510> In [10]: df.drop('b').collect() --------------------------------------------------------------------------- ColumnNotFoundError Traceback (most recent call last) Cell In[10], line 1 ----> 1 df.drop('b').collect() File ~/scratch/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs) 1939 # Only for testing purposes atm. 1940 callback = _kwargs.get("post_opt_callback") -> 1942 return wrap_df(ldf.collect(callback)) ColumnNotFoundError: b

Just to make sure to be on the same page before making other changes:

Eager and lazy post v1 we let polars do its thing

Eager pre v1, we can explicitly check column names as per other backend

Lazy pre v1, ignore the strict argument and raise a warning saying something along the line of "please go eager here" if passed as True?

thanks for checking, i'd say:

Eager and lazy post v1 we let polars do its thing: agree

Eager pre v1, we can explicitly check column names as per other backend agree

Lazy pre v1, ignore the strict argument and raise a warning saying something along the line of "please go eager here" if passed as True? I was thinking more: just let Polars do its thinking, but leave a note in the docs about the pre-v1 behaviour difference. So, a docs note, not a runtime warning (which might be what you meant, just emphasising to be sure)

I meant a runtime warning, especially because strict=True is now the default, and that's the thing we ignore

Reckon a runtime warning risks being annoying, as it may be unnecessary? And going eager might represent a significant performance degradation?

MarcoGorelli

thanks @FBruzzesi !

FBruzzesi added 4 commits August 10, 2024 21:21

feat: extend drop method

c8acfbf

polars pre 1

6e7895d

refactor into baseframe

d346b7f

pyarrow

aed61e8

github-actions bot added the enhancement New feature or request label Aug 10, 2024

FBruzzesi changed the title ~~feat: extend drop method~~ feat: extend dataframe drop method Aug 10, 2024

FBruzzesi commented Aug 10, 2024

View reviewed changes

FBruzzesi added 4 commits August 11, 2024 21:27

type hints

c4fcdb0

old arrow

1eead0c

Merge branch 'main' into feat/extend-drop-method

ddd8bed

merge main

3332e50

MarcoGorelli reviewed Aug 11, 2024

View reviewed changes

FBruzzesi commented Aug 11, 2024

View reviewed changes

MarcoGorelli reviewed Aug 12, 2024

View reviewed changes

FBruzzesi added 5 commits August 13, 2024 10:06

Merge branch 'main' into feat/extend-drop-method

130b0a8

refactor

d442c4b

add warning

b787e9d

feedbacks

b814dce

merge main

8056163

MarcoGorelli approved these changes Aug 13, 2024

View reviewed changes

MarcoGorelli merged commit 6522a62 into main Aug 13, 2024
24 checks passed

FBruzzesi deleted the feat/extend-drop-method branch August 13, 2024 19:04

FBruzzesi mentioned this pull request Aug 14, 2024

[Enh]: Extend signature of supported methods #742

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extend dataframe `drop` method #773

feat: extend dataframe `drop` method #773

FBruzzesi commented Aug 10, 2024

FBruzzesi Aug 10, 2024

MarcoGorelli Aug 10, 2024

FBruzzesi Aug 10, 2024

FBruzzesi Aug 10, 2024

FBruzzesi Aug 11, 2024

FBruzzesi Aug 10, 2024

FBruzzesi commented Aug 11, 2024

MarcoGorelli Aug 11, 2024

FBruzzesi Aug 11, 2024

MarcoGorelli commented Aug 11, 2024

FBruzzesi Aug 11, 2024 •

edited

Loading

FBruzzesi commented Aug 11, 2024

MarcoGorelli left a comment

MarcoGorelli Aug 12, 2024

FBruzzesi Aug 12, 2024 •

edited

Loading

MarcoGorelli Aug 13, 2024

FBruzzesi Aug 13, 2024

MarcoGorelli Aug 13, 2024 •

edited

Loading

MarcoGorelli Aug 13, 2024

FBruzzesi Aug 13, 2024

MarcoGorelli Aug 13, 2024

FBruzzesi Aug 13, 2024 •

edited

Loading

MarcoGorelli Aug 13, 2024

MarcoGorelli left a comment

		def drop(self, *columns: Iterable[str], strict: bool) -> Self:
		cols = set(self.collect_schema().names())

feat: extend dataframe drop method #773

feat: extend dataframe drop method #773

Conversation

FBruzzesi commented Aug 10, 2024

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FBruzzesi commented Aug 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Aug 11, 2024

FBruzzesi Aug 11, 2024 • edited Loading

Choose a reason for hiding this comment

FBruzzesi commented Aug 11, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FBruzzesi Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FBruzzesi Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

feat: extend dataframe `drop` method #773

feat: extend dataframe `drop` method #773

FBruzzesi Aug 11, 2024 •

edited

Loading

FBruzzesi Aug 12, 2024 •

edited

Loading

MarcoGorelli Aug 13, 2024 •

edited

Loading

FBruzzesi Aug 13, 2024 •

edited

Loading