-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: dask with_row_index
and rename
#692
Conversation
narwhals/_dask/dataframe.py
Outdated
return self._from_native_dataframe( | ||
self._native_dataframe.assign(**{name: 1}).assign( | ||
**{name: lambda t: t[name].cumsum() - 1} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm not an expert in Dask, but I worked quite a bit with Spark so my interpretation/way of thinking comes from there)
I am wondering how the performance would be in a distributed setting with partitioned data. Would cumsum
require to do the calculations in a single node?
Spark with the Pandas API has various ways to set an index: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type
Could we maybe use this as an inspiration? Or is it better to create an array and add it as a column as we do for pandas-like dfs? π€
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering how the performance would be in a distributed setting with partitioned data. Would
cumsum
require to do the calculations in a single node?
I am not sure, I find dask documentation a bit unexplicit on such topic. I based the implementation on a TomAugspurger SO answer
Spark with the Pandas API has various ways to set an index: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type
Could we maybe use this as an inspiration? Or is it better to create an array and add it as a column as we do for pandas-like dfs? π€
Thanks, I will take a closer look and see if that's feasible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see.
TBH I'm not 100% sure that it is a problem, just wanted to mention it since row index in Spark was a bit tricky. π
We could also decide to investigate this in a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dask has implemented some parallel algorithms for cumsum
/ cumprod
based on parallel prefix scan algorithms. I don't really know the details, but it's cool stuff :)
Here's a link to PR for reference dask/dask#6675
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sorry for the late reply) Very interesting!
Should we add the method='blelloch'
to use this fancy algorithm? https://docs.dask.org/en/stable/generated/dask.array.cumsum.html
We can also add a comment to say that the implementation comes from that SO answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also didn't come back to this!
Should we add the method='blelloch' to use this fancy algorithm?
The docs state "More benchmarking is necessary.", but also the PR was merge almost 4 years ago so I am not sure
We can also add a comment to say that the implementation comes from that SO answer
Sure! Adding that right away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @FBruzzesi , and @EdAbati + @aidoskanapyanov for reviewing!
* feat: dask with_row_index and rename * note on implementation and cumsum method
What type of PR is this? (check all applicable)
Related issues
Checklist