-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: extend dataframe drop
method
#773
Conversation
drop
method
narwhals/_arrow/dataframe.py
Outdated
if strict: | ||
for d in to_drop: | ||
if d not in cols: | ||
msg = f'"{d}" not found' | ||
raise ColumnNotFoundError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is same everywhere. I wonder if we should move it directly to BaseFrame
.
Also, I noticed that polars is "greedy" in the raise, meaning that if 2 columns are in the drop list but missing, the error will only mention the first one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Polars changed behaviour here in 1.0 too (before it would just ignore missing columns)
so, fine by me to just implement it in BaseFrame and let everyone benefit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I just noticed CI failing for polars pre 1. Refactoring into base now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow I keep f*cking up parsing *columns
- it's probably just late π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok it is somehow fixed but worth taking a double look at it as the expected types along the chain of calls are a bit different
narwhals/_arrow/dataframe.py
Outdated
@@ -285,8 +286,18 @@ def join( | |||
), | |||
) | |||
|
|||
def drop(self, *columns: str) -> Self: | |||
return self._from_native_dataframe(self._native_dataframe.drop(list(columns))) | |||
def drop(self: Self, *columns: str, strict: bool = True) -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok hear me out, I know that we don't add defaults in here, but we sometimes used compliant dataframe drop method (e.g. in join
). I can add it in the call there
Aside: I noticed that polars allows selectors other than column names. I think it should not be too complex to integrate. |
narwhals/_pandas_like/dataframe.py
Outdated
return self._from_native_dataframe( | ||
self._native_dataframe.drop(columns=list(columns)) | ||
) | ||
def drop(self: Self, columns: str | list[str]) -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need to be str | list[str]
? didn't it get flattened one level above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check with mypy for internal use
if it's not too complex - sure, I do like these π |
narwhals/_arrow/dataframe.py
Outdated
).drop(key_token) | ||
) | ||
.drop(key_token), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoids a few functions and attributes calls, same below
Happy to keep it as a follow up honestly. Other methods would benefit from that as well and it could be worth to factor out some sort of "parser" for selectors and column names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just got a comment on the polars lazyframe case, will need to think about it
narwhals/dataframe.py
Outdated
def drop(self, *columns: Iterable[str], strict: bool) -> Self: | ||
cols = set(self.collect_schema().names()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the part I'm a little unsure about
I'm not sure we should be triggering collect_schema
for Polars LazyFrame - I'll check what they're doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can let polars do what polars does natively.
I can factor out the parsing into a utils function and call it downstream instead of in BaseFrame
.
I moved the logic here to avoid rewriting the same snippet everywhere, but we should not affect polars performances other than the function calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! sure, but at the moment we still reach this collect_schema
call for Polars, right? that's the part i think we should avoid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes at the moment we do. I will rework this PR a bit to avoid that in the versions that polars natively supports .drop(..., strict=...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can just avoid it completely for Polars, and note in the docstring that for Polars<1 dropping non-existent columns silently passes?
At least in the lazy case. In the eager case, getting column names is free in Polars, so we can intercept drop
without issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I would expect implicitly to collect those?!
yup, but only a LazyFrame.collect
time, not at LazyFrame.drop
time
In [8]: df = pl.LazyFrame({'a': [1,2,3]})
In [9]: df.drop('b')
Out[9]: <LazyFrame at 0x7FE016F33510>
In [10]: df.drop('b').collect()
---------------------------------------------------------------------------
ColumnNotFoundError Traceback (most recent call last)
Cell In[10], line 1
----> 1 df.drop('b').collect()
File ~/scratch/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
1939 # Only for testing purposes atm.
1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))
ColumnNotFoundError: b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure to be on the same page before making other changes:
- Eager and lazy post v1 we let polars do its thing
- Eager pre v1, we can explicitly check column names as per other backend
- Lazy pre v1, ignore the strict argument and raise a warning saying something along the line of "please go eager here" if passed as True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for checking, i'd say:
- Eager and lazy post v1 we let polars do its thing: agree
- Eager pre v1, we can explicitly check column names as per other backend agree
- Lazy pre v1, ignore the strict argument and raise a warning saying something along the line of "please go eager here" if passed as True? I was thinking more: just let Polars do its thinking, but leave a note in the docs about the pre-v1 behaviour difference. So, a docs note, not a runtime warning (which might be what you meant, just emphasising to be sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant a runtime warning, especially because strict=True
is now the default, and that's the thing we ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reckon a runtime warning risks being annoying, as it may be unnecessary? And going eager might represent a significant performance degradation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @FBruzzesi !
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below.