feat: add Dask `Expr.count` #731

anopsy · 2024-08-07T07:47:53Z

What type of PR is this? (check all applicable)

Related issues

Related issue feat: add methods to Dask backend #637
Closes #

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

I left if "dask" in the count_test as it gave me AttributeError: 'Scalar' object has no attribute 'name' . Let me know if that's okay.

I just noticed that is_between uses dask's - "between" and returns among other things "is_between"-string, while fill_nulls uses dask's fill_na and returns a string "fill_na". So should the returned string be the name of the method name from narwhals API or from native API. Let me know what's the logic behind this string

FBruzzesi · 2024-08-07T08:01:01Z

Hey thanks for the PR.

I left if "dask" in the count_test as it gave me AttributeError: 'Scalar' object has no attribute 'name'

Yes we are trying to fix this! Yet I am not sure how coverage is not complaining?!

I just noticed that is_between uses dask's - "between" and returns among other things "is_between"-string, while fill_nulls uses dask's fill_na and returns a string "fill_na". So should the returned string be the name of the method name from narwhals API or from native API. Let me know what's the logic behind this string

expr_name should be "fill_nulls" for fill_nulls method

anopsy · 2024-08-07T08:45:13Z

100% coverage also caught my attention, I checked if there are no tests covering it in some reduction and scalar related tests, but apparently not. I also double checked if the DaskExpr.count works on separate tests. I'm reading about coverage.py and Smother to track which test covers it. If I succeed I'll let you know.

BTW Any suggestion how to fix Ci check , it complains about first_valid_index which isn't yet implemented in Dask dask/dask#1259 If I understand the issue correctly

anopsy · 2024-08-07T14:50:10Z

I tried to add returns_scalar = False for Dask drop_nulls and removing if "dask" and it seems that drop_nulls stopped working (as I'm pretty sure I checked the output with my own tests 🙈 and it was indeed removing None from Dask DataFrame). I dived into dask API and it does have dropna for both DataFrame and Series. If you happen to have any idea what could be the reason, let me know

FBruzzesi · 2024-08-07T15:05:34Z

I tried to add returns_scalar = False for Dask drop_nulls and removing if "dask" and it seems that drop_nulls stopped working (as I'm pretty sure I checked the output with my own tests 🙈 and it was indeed removing None from Dask DataFrame). I dived into dask API and it does have dropna for both DataFrame and Series. If you happen to have any idea what could be the reason, let me know

@MarcoGorelli this is probably because of select line 109 which assigns from the original dataframe, yet the new series has a different length.

If we change it and create a frame from scratch, then it is not an issue:

- df = self._native_dataframe.assign(**new_series).loc[:, list(new_series.keys())]
+ df = dd.from_pandas(
+        pd.DataFrame(), npartitions=self._native_dataframe.npartitions
+        ).assign(**new_series)

(I am not super confident about the partitions part, does the series bring a partition itself?)

Edit: Just to confirm, I was able to run all the tests except one on reduction, because if the leftmost series is a scalar then we have issues as the first assignment will create a dataframe of len 1

FBruzzesi · 2024-08-07T18:57:53Z

Ok I have a hotfix and for now I could not come up with anything nicer 🙈

The idea is to create the dataframe starting with the left-most non-scalar (and we know there is at least one otherwise we return earlier), and then re-order columns:

- df = self._native_dataframe.assign(**new_series).loc[:, list(new_series.keys())]
+ pd = get_pandas()
+ de = get_dask_expr()
+ col_order = list(new_series.keys())
+ new_series = dict(
+     sorted(
+        new_series.items(),
+        key=lambda item: isinstance(item[1], de._collection.Scalar),
+     )
+ )
+ return self._from_native_dataframe(
+     dd.from_pandas(pd.DataFrame()).assign(**new_series).loc[:, col_order]
+ )

In case of two columns with different index or length, this will yield unexpected/wrong results. Example:

import dask.dataframe as dd
import narwhals as nw

df_dd = nw.from_native(
    dd.from_dict({"a": [1, 2, None, 3], "b": [None, "x", None, "y"]}, npartitions=1)
)

nw.to_native(df_dd.select(nw.col("a").drop_nulls(), nw.col("b").drop_nulls()).collect())

     a     b
0  1.0  <NA>
1  2.0     x
3  3.0     y

While it should raise.

Maybe we really shouldn't change the index as mentioned in the issue (#637).

Thoughts?

MarcoGorelli · 2024-08-07T21:26:45Z

Thanks for your PR!

Indeed, I think for now we should avoid Expr.drop_nulls for Dask as it changes the index

But Expr.count we can add 👍

anopsy · 2024-08-08T06:41:21Z

I kinda felt it was bad idea to push both of them at the same time 😅

MarcoGorelli · 2024-08-08T07:19:14Z

How about, for Expr.drop_nulls, for Dask, we keep it, but we just raise NotImplementedError? something like

# We can't (yet?) allow methods which modify the index
msg = "`Expr.drop_nulls` is not supported for the Dask backend. Please use `LazyFrame.drop_nulls` instead"
raise NotImplementedError(msg)

?

anopsy · 2024-08-08T07:52:04Z

Shall we do it for other methods changing index, at least the ones you mentioned in the #637 or should we keep a low profile?

MarcoGorelli · 2024-08-08T08:07:07Z

I'd say, let's do it for those too 😎 not necessarily as part of the same PR, OK to keep things small if it's easier (up to you, all together is fine too!)

FBruzzesi

Neat! Thanks!

add DaskExpr-count and DaskExpr-drop-nulls formatted

6a4eb50

github-actions bot added the enhancement New feature or request label Aug 7, 2024

anopsy changed the title ~~feat: add DaskExpr-count and DaskExpr-drop-nulls formatted~~ feat: add Dask Expr.count and Dask Expr.drop_nulls Aug 7, 2024

Merge remote-tracking branch 'upstream/main' into dask-count

0804e36

anopsy added 2 commits August 8, 2024 11:15

add dask Expr.count

59f174c

Merge remote-tracking branch 'upstream/main' into dask-count

7af550a

anopsy changed the title ~~feat: add Dask Expr.count and Dask Expr.drop_nulls~~ feat: add Dask Expr.count Aug 8, 2024

FBruzzesi approved these changes Aug 8, 2024

View reviewed changes

FBruzzesi merged commit d71f911 into narwhals-dev:main Aug 8, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Dask `Expr.count` #731

feat: add Dask `Expr.count` #731

anopsy commented Aug 7, 2024

FBruzzesi commented Aug 7, 2024

anopsy commented Aug 7, 2024 •

edited

Loading

anopsy commented Aug 7, 2024

FBruzzesi commented Aug 7, 2024 •

edited

Loading

FBruzzesi commented Aug 7, 2024 •

edited

Loading

MarcoGorelli commented Aug 7, 2024

anopsy commented Aug 8, 2024

MarcoGorelli commented Aug 8, 2024

anopsy commented Aug 8, 2024

MarcoGorelli commented Aug 8, 2024 •

edited

Loading

FBruzzesi left a comment

feat: add Dask Expr.count #731

feat: add Dask Expr.count #731

Conversation

anopsy commented Aug 7, 2024

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below.

FBruzzesi commented Aug 7, 2024

anopsy commented Aug 7, 2024 • edited Loading

anopsy commented Aug 7, 2024

FBruzzesi commented Aug 7, 2024 • edited Loading

FBruzzesi commented Aug 7, 2024 • edited Loading

MarcoGorelli commented Aug 7, 2024

anopsy commented Aug 8, 2024

MarcoGorelli commented Aug 8, 2024

anopsy commented Aug 8, 2024

MarcoGorelli commented Aug 8, 2024 • edited Loading

FBruzzesi left a comment

Choose a reason for hiding this comment

feat: add Dask `Expr.count` #731

feat: add Dask `Expr.count` #731

anopsy commented Aug 7, 2024 •

edited

Loading

FBruzzesi commented Aug 7, 2024 •

edited

Loading

FBruzzesi commented Aug 7, 2024 •

edited

Loading

MarcoGorelli commented Aug 8, 2024 •

edited

Loading