-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Dask Expr.count
#731
Conversation
Hey thanks for the PR.
Yes we are trying to fix this! Yet I am not sure how coverage is not complaining?!
|
100% coverage also caught my attention, I checked if there are no tests covering it in some reduction and scalar related tests, but apparently not. I also double checked if the BTW Any suggestion how to fix Ci check , it complains about |
Expr.count
and Dask Expr.drop_nulls
I tried to add returns_scalar = False for Dask drop_nulls and removing |
@MarcoGorelli this is probably because of If we change it and create a frame from scratch, then it is not an issue: - df = self._native_dataframe.assign(**new_series).loc[:, list(new_series.keys())]
+ df = dd.from_pandas(
+ pd.DataFrame(), npartitions=self._native_dataframe.npartitions
+ ).assign(**new_series) (I am not super confident about the partitions part, does the series bring a partition itself?) Edit: Just to confirm, I was able to run all the tests except one on reduction, because if the leftmost series is a scalar then we have issues as the first assignment will create a dataframe of len 1 |
Ok I have a hotfix and for now I could not come up with anything nicer π The idea is to create the dataframe starting with the left-most non-scalar (and we know there is at least one otherwise we return earlier), and then re-order columns: - df = self._native_dataframe.assign(**new_series).loc[:, list(new_series.keys())]
+ pd = get_pandas()
+ de = get_dask_expr()
+ col_order = list(new_series.keys())
+ new_series = dict(
+ sorted(
+ new_series.items(),
+ key=lambda item: isinstance(item[1], de._collection.Scalar),
+ )
+ )
+ return self._from_native_dataframe(
+ dd.from_pandas(pd.DataFrame()).assign(**new_series).loc[:, col_order]
+ ) In case of two columns with different index or length, this will yield unexpected/wrong results. Example: import dask.dataframe as dd
import narwhals as nw
df_dd = nw.from_native(
dd.from_dict({"a": [1, 2, None, 3], "b": [None, "x", None, "y"]}, npartitions=1)
)
nw.to_native(df_dd.select(nw.col("a").drop_nulls(), nw.col("b").drop_nulls()).collect())
While it should raise. Maybe we really shouldn't change the index as mentioned in the issue (#637). Thoughts? |
Thanks for your PR! Indeed, I think for now we should avoid But Expr.count we can add π |
I kinda felt it was bad idea to push both of them at the same time π |
How about, for
? |
Shall we do it for other methods changing index, at least the ones you mentioned in the #637 or should we keep a low profile? |
I'd say, let's do it for those too π not necessarily as part of the same PR, OK to keep things small if it's easier (up to you, all together is fine too!) |
Expr.count
and Dask Expr.drop_nulls
Expr.count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below.
I left
if "dask"
in the count_test as it gave meAttributeError: 'Scalar' object has no attribute 'name'
. Let me know if that's okay.I just noticed that is_between uses dask's - "between" and returns among other things "is_between"-string, while fill_nulls uses dask's fill_na and returns a string "fill_na". So should the returned string be the name of the method name from narwhals API or from native API. Let me know what's the logic behind this string