Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: extend dataframe drop method #773

Merged
merged 13 commits into from
Aug 13, 2024
Merged

Conversation

FBruzzesi
Copy link
Member

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

@github-actions github-actions bot added the enhancement New feature or request label Aug 10, 2024
@FBruzzesi FBruzzesi changed the title feat: extend drop method feat: extend dataframe drop method Aug 10, 2024
Comment on lines 292 to 296
if strict:
for d in to_drop:
if d not in cols:
msg = f'"{d}" not found'
raise ColumnNotFoundError(msg)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is same everywhere. I wonder if we should move it directly to BaseFrame.

Also, I noticed that polars is "greedy" in the raise, meaning that if 2 columns are in the drop list but missing, the error will only mention the first one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Polars changed behaviour here in 1.0 too (before it would just ignore missing columns)

so, fine by me to just implement it in BaseFrame and let everyone benefit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I just noticed CI failing for polars pre 1. Refactoring into base now :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I keep f*cking up parsing *columns - it's probably just late πŸ™ˆ

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it is somehow fixed but worth taking a double look at it as the expected types along the chain of calls are a bit different

@@ -285,8 +286,18 @@ def join(
),
)

def drop(self, *columns: str) -> Self:
return self._from_native_dataframe(self._native_dataframe.drop(list(columns)))
def drop(self: Self, *columns: str, strict: bool = True) -> Self:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok hear me out, I know that we don't add defaults in here, but we sometimes used compliant dataframe drop method (e.g. in join). I can add it in the call there

@FBruzzesi
Copy link
Member Author

Aside: I noticed that polars allows selectors other than column names. I think it should not be too complex to integrate.
Thoughts?

return self._from_native_dataframe(
self._native_dataframe.drop(columns=list(columns))
)
def drop(self: Self, columns: str | list[str]) -> Self:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it need to be str | list[str]? didn't it get flattened one level above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check with mypy for internal use

@MarcoGorelli
Copy link
Member

Aside: I noticed that polars allows selectors other than column names. I think it should not be too complex to integrate.
Thoughts?

if it's not too complex - sure, I do like these πŸ‘

Comment on lines 276 to 277
).drop(key_token)
)
.drop(key_token),
Copy link
Member Author

@FBruzzesi FBruzzesi Aug 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoids a few functions and attributes calls, same below

@FBruzzesi
Copy link
Member Author

Aside: I noticed that polars allows selectors other than column names. I think it should not be too complex to integrate.
Thoughts?

if it's not too complex - sure, I do like these πŸ‘

Happy to keep it as a follow up honestly. Other methods would benefit from that as well and it could be worth to factor out some sort of "parser" for selectors and column names

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just got a comment on the polars lazyframe case, will need to think about it

Comment on lines 141 to 142
def drop(self, *columns: Iterable[str], strict: bool) -> Self:
cols = set(self.collect_schema().names())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the part I'm a little unsure about

I'm not sure we should be triggering collect_schema for Polars LazyFrame - I'll check what they're doing

Copy link
Member Author

@FBruzzesi FBruzzesi Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can let polars do what polars does natively.
I can factor out the parsing into a utils function and call it downstream instead of in BaseFrame.

I moved the logic here to avoid rewriting the same snippet everywhere, but we should not affect polars performances other than the function calls

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! sure, but at the moment we still reach this collect_schema call for Polars, right? that's the part i think we should avoid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes at the moment we do. I will rework this PR a bit to avoid that in the versions that polars natively supports .drop(..., strict=...)

Copy link
Member

@MarcoGorelli MarcoGorelli Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can just avoid it completely for Polars, and note in the docstring that for Polars<1 dropping non-existent columns silently passes?

At least in the lazy case. In the eager case, getting column names is free in Polars, so we can intercept drop without issues

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would expect implicitly to collect those?!

yup, but only a LazyFrame.collect time, not at LazyFrame.drop time

In [8]: df = pl.LazyFrame({'a': [1,2,3]})

In [9]: df.drop('b')
Out[9]: <LazyFrame at 0x7FE016F33510>

In [10]: df.drop('b').collect()
---------------------------------------------------------------------------
ColumnNotFoundError                       Traceback (most recent call last)
Cell In[10], line 1
----> 1 df.drop('b').collect()

File ~/scratch/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

ColumnNotFoundError: b

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure to be on the same page before making other changes:

  • Eager and lazy post v1 we let polars do its thing
  • Eager pre v1, we can explicitly check column names as per other backend
  • Lazy pre v1, ignore the strict argument and raise a warning saying something along the line of "please go eager here" if passed as True?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for checking, i'd say:

  • Eager and lazy post v1 we let polars do its thing: agree
  • Eager pre v1, we can explicitly check column names as per other backend agree
  • Lazy pre v1, ignore the strict argument and raise a warning saying something along the line of "please go eager here" if passed as True? I was thinking more: just let Polars do its thinking, but leave a note in the docs about the pre-v1 behaviour difference. So, a docs note, not a runtime warning (which might be what you meant, just emphasising to be sure)

Copy link
Member Author

@FBruzzesi FBruzzesi Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant a runtime warning, especially because strict=True is now the default, and that's the thing we ignore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reckon a runtime warning risks being annoying, as it may be unnecessary? And going eager might represent a significant performance degradation?

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @FBruzzesi !

@MarcoGorelli MarcoGorelli merged commit 6522a62 into main Aug 13, 2024
24 checks passed
@FBruzzesi FBruzzesi deleted the feat/extend-drop-method branch August 13, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants