Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Dask Expr.count #731

Merged
merged 4 commits into from
Aug 8, 2024
Merged

feat: add Dask Expr.count #731

merged 4 commits into from
Aug 8, 2024

Conversation

anopsy
Copy link
Member

@anopsy anopsy commented Aug 7, 2024

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

I left if "dask" in the count_test as it gave me AttributeError: 'Scalar' object has no attribute 'name' . Let me know if that's okay.

I just noticed that is_between uses dask's - "between" and returns among other things "is_between"-string, while fill_nulls uses dask's fill_na and returns a string "fill_na". So should the returned string be the name of the method name from narwhals API or from native API. Let me know what's the logic behind this string

@github-actions github-actions bot added the enhancement New feature or request label Aug 7, 2024
@FBruzzesi
Copy link
Member

Hey thanks for the PR.

I left if "dask" in the count_test as it gave me AttributeError: 'Scalar' object has no attribute 'name'

Yes we are trying to fix this! Yet I am not sure how coverage is not complaining?!

I just noticed that is_between uses dask's - "between" and returns among other things "is_between"-string, while fill_nulls uses dask's fill_na and returns a string "fill_na". So should the returned string be the name of the method name from narwhals API or from native API. Let me know what's the logic behind this string

expr_name should be "fill_nulls" for fill_nulls method

@anopsy
Copy link
Member Author

anopsy commented Aug 7, 2024

100% coverage also caught my attention, I checked if there are no tests covering it in some reduction and scalar related tests, but apparently not. I also double checked if the DaskExpr.count works on separate tests. I'm reading about coverage.py and Smother to track which test covers it. If I succeed I'll let you know.

BTW Any suggestion how to fix Ci check , it complains about first_valid_index which isn't yet implemented in Dask dask/dask#1259 If I understand the issue correctly

@anopsy anopsy changed the title feat: add DaskExpr-count and DaskExpr-drop-nulls formatted feat: add Dask Expr.count and Dask Expr.drop_nulls Aug 7, 2024
@anopsy
Copy link
Member Author

anopsy commented Aug 7, 2024

I tried to add returns_scalar = False for Dask drop_nulls and removing if "dask" and it seems that drop_nulls stopped working (as I'm pretty sure I checked the output with my own tests πŸ™ˆ and it was indeed removing None from Dask DataFrame). I dived into dask API and it does have dropna for both DataFrame and Series. If you happen to have any idea what could be the reason, let me know

@FBruzzesi
Copy link
Member

FBruzzesi commented Aug 7, 2024

I tried to add returns_scalar = False for Dask drop_nulls and removing if "dask" and it seems that drop_nulls stopped working (as I'm pretty sure I checked the output with my own tests πŸ™ˆ and it was indeed removing None from Dask DataFrame). I dived into dask API and it does have dropna for both DataFrame and Series. If you happen to have any idea what could be the reason, let me know

@MarcoGorelli this is probably because of select line 109 which assigns from the original dataframe, yet the new series has a different length.

If we change it and create a frame from scratch, then it is not an issue:

- df = self._native_dataframe.assign(**new_series).loc[:, list(new_series.keys())]
+ df = dd.from_pandas(
+        pd.DataFrame(), npartitions=self._native_dataframe.npartitions
+        ).assign(**new_series)

(I am not super confident about the partitions part, does the series bring a partition itself?)

Edit: Just to confirm, I was able to run all the tests except one on reduction, because if the leftmost series is a scalar then we have issues as the first assignment will create a dataframe of len 1

@FBruzzesi
Copy link
Member

FBruzzesi commented Aug 7, 2024

Ok I have a hotfix and for now I could not come up with anything nicer πŸ™ˆ

The idea is to create the dataframe starting with the left-most non-scalar (and we know there is at least one otherwise we return earlier), and then re-order columns:

- df = self._native_dataframe.assign(**new_series).loc[:, list(new_series.keys())]
+ pd = get_pandas()
+ de = get_dask_expr()
+ col_order = list(new_series.keys())
+ new_series = dict(
+     sorted(
+        new_series.items(),
+        key=lambda item: isinstance(item[1], de._collection.Scalar),
+     )
+ )
+ return self._from_native_dataframe(
+     dd.from_pandas(pd.DataFrame()).assign(**new_series).loc[:, col_order]
+ )

In case of two columns with different index or length, this will yield unexpected/wrong results. Example:

import dask.dataframe as dd
import narwhals as nw

df_dd = nw.from_native(
    dd.from_dict({"a": [1, 2, None, 3], "b": [None, "x", None, "y"]}, npartitions=1)
)

nw.to_native(df_dd.select(nw.col("a").drop_nulls(), nw.col("b").drop_nulls()).collect())
     a     b
0  1.0  <NA>
1  2.0     x
3  3.0     y

While it should raise.

Maybe we really shouldn't change the index as mentioned in the issue (#637).

Thoughts?

@MarcoGorelli
Copy link
Member

Thanks for your PR!

Indeed, I think for now we should avoid Expr.drop_nulls for Dask as it changes the index

But Expr.count we can add πŸ‘

@anopsy
Copy link
Member Author

anopsy commented Aug 8, 2024

I kinda felt it was bad idea to push both of them at the same time πŸ˜…

@MarcoGorelli
Copy link
Member

How about, for Expr.drop_nulls, for Dask, we keep it, but we just raise NotImplementedError? something like

# We can't (yet?) allow methods which modify the index
msg = "`Expr.drop_nulls` is not supported for the Dask backend. Please use `LazyFrame.drop_nulls` instead"
raise NotImplementedError(msg)

?

@anopsy
Copy link
Member Author

anopsy commented Aug 8, 2024

Shall we do it for other methods changing index, at least the ones you mentioned in the #637 or should we keep a low profile?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 8, 2024

I'd say, let's do it for those too 😎 not necessarily as part of the same PR, OK to keep things small if it's easier (up to you, all together is fine too!)

@anopsy anopsy changed the title feat: add Dask Expr.count and Dask Expr.drop_nulls feat: add Dask Expr.count Aug 8, 2024
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! Thanks!

NarwhalChubbiwhalGIF

@FBruzzesi FBruzzesi merged commit d71f911 into narwhals-dev:main Aug 8, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants