Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: dask with_row_index and rename #692

Merged
merged 4 commits into from
Aug 5, 2024
Merged

Conversation

FBruzzesi
Copy link
Member

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

@github-actions github-actions bot added the enhancement New feature or request label Jul 31, 2024
Comment on lines 99 to 102
return self._from_native_dataframe(
self._native_dataframe.assign(**{name: 1}).assign(
**{name: lambda t: t[name].cumsum() - 1}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm not an expert in Dask, but I worked quite a bit with Spark so my interpretation/way of thinking comes from there)

I am wondering how the performance would be in a distributed setting with partitioned data. Would cumsum require to do the calculations in a single node?

Spark with the Pandas API has various ways to set an index: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type

Could we maybe use this as an inspiration? Or is it better to create an array and add it as a column as we do for pandas-like dfs? πŸ€”

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering how the performance would be in a distributed setting with partitioned data. Would cumsum require to do the calculations in a single node?

I am not sure, I find dask documentation a bit unexplicit on such topic. I based the implementation on a TomAugspurger SO answer

Spark with the Pandas API has various ways to set an index: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type

Could we maybe use this as an inspiration? Or is it better to create an array and add it as a column as we do for pandas-like dfs? πŸ€”

Thanks, I will take a closer look and see if that's feasible

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see.

TBH I'm not 100% sure that it is a problem, just wanted to mention it since row index in Spark was a bit tricky. πŸ™‚

We could also decide to investigate this in a follow-up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dask has implemented some parallel algorithms for cumsum / cumprod based on parallel prefix scan algorithms. I don't really know the details, but it's cool stuff :)

Here's a link to PR for reference dask/dask#6675

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sorry for the late reply) Very interesting!

Should we add the method='blelloch' to use this fancy algorithm? https://docs.dask.org/en/stable/generated/dask.array.cumsum.html

We can also add a comment to say that the implementation comes from that SO answer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also didn't come back to this!

Should we add the method='blelloch' to use this fancy algorithm?

The docs state "More benchmarking is necessary.", but also the PR was merge almost 4 years ago so I am not sure

We can also add a comment to say that the implementation comes from that SO answer

Sure! Adding that right away

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @FBruzzesi , and @EdAbati + @aidoskanapyanov for reviewing!

@MarcoGorelli MarcoGorelli merged commit f776e9a into main Aug 5, 2024
23 checks passed
@FBruzzesi FBruzzesi deleted the feat/dask-with_row_index branch August 5, 2024 14:13
aivanoved pushed a commit to aivanoved/narwhals that referenced this pull request Aug 6, 2024
* feat: dask with_row_index and rename

* note on implementation and cumsum method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants