feat: dask `with_row_index` and `rename` #692

FBruzzesi · 2024-07-31T07:07:45Z

What type of PR is this? (check all applicable)

Related issues

Related issue feat: add methods to Dask backend #637

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

EdAbati · 2024-08-01T08:45:00Z

narwhals/_dask/dataframe.py

+        return self._from_native_dataframe(
+            self._native_dataframe.assign(**{name: 1}).assign(
+                **{name: lambda t: t[name].cumsum() - 1}
+            )


(I'm not an expert in Dask, but I worked quite a bit with Spark so my interpretation/way of thinking comes from there)

I am wondering how the performance would be in a distributed setting with partitioned data. Would cumsum require to do the calculations in a single node?

Spark with the Pandas API has various ways to set an index: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type

Could we maybe use this as an inspiration? Or is it better to create an array and add it as a column as we do for pandas-like dfs? 🤔

I am wondering how the performance would be in a distributed setting with partitioned data. Would cumsum require to do the calculations in a single node?

I am not sure, I find dask documentation a bit unexplicit on such topic. I based the implementation on a TomAugspurger SO answer

Spark with the Pandas API has various ways to set an index: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type

Could we maybe use this as an inspiration? Or is it better to create an array and add it as a column as we do for pandas-like dfs? 🤔

Thanks, I will take a closer look and see if that's feasible

Ah I see.

TBH I'm not 100% sure that it is a problem, just wanted to mention it since row index in Spark was a bit tricky. 🙂

We could also decide to investigate this in a follow-up

Dask has implemented some parallel algorithms for cumsum / cumprod based on parallel prefix scan algorithms. I don't really know the details, but it's cool stuff :)

Here's a link to PR for reference dask/dask#6675

(sorry for the late reply) Very interesting!

Should we add the method='blelloch' to use this fancy algorithm? https://docs.dask.org/en/stable/generated/dask.array.cumsum.html

We can also add a comment to say that the implementation comes from that SO answer

I also didn't come back to this!

Should we add the method='blelloch' to use this fancy algorithm?

The docs state "More benchmarking is necessary.", but also the PR was merge almost 4 years ago so I am not sure

We can also add a comment to say that the implementation comes from that SO answer

Sure! Adding that right away

MarcoGorelli

thanks @FBruzzesi , and @EdAbati + @aidoskanapyanov for reviewing!

* feat: dask with_row_index and rename * note on implementation and cumsum method

FBruzzesi added 2 commits July 31, 2024 09:05

feat: dask with_row_index and rename

d0a3928

Merge branch 'main' into feat/dask-with_row_index

8b2401c

github-actions bot added the enhancement New feature or request label Jul 31, 2024

EdAbati reviewed Aug 1, 2024

View reviewed changes

FBruzzesi added 2 commits August 4, 2024 23:09

note on implementation and cumsum method

5b76623

merge main

3220788

MarcoGorelli approved these changes Aug 5, 2024

View reviewed changes

MarcoGorelli merged commit f776e9a into main Aug 5, 2024
23 checks passed

FBruzzesi deleted the feat/dask-with_row_index branch August 5, 2024 14:13

aivanoved pushed a commit to aivanoved/narwhals that referenced this pull request Aug 6, 2024

feat: dask with_row_index and rename (narwhals-dev#692)

98ba4e3

* feat: dask with_row_index and rename * note on implementation and cumsum method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dask `with_row_index` and `rename` #692

feat: dask `with_row_index` and `rename` #692

FBruzzesi commented Jul 31, 2024

EdAbati Aug 1, 2024

FBruzzesi Aug 1, 2024

EdAbati Aug 1, 2024

aidoskanapyanov Aug 2, 2024

EdAbati Aug 4, 2024

FBruzzesi Aug 4, 2024

MarcoGorelli left a comment

feat: dask with_row_index and rename #692

feat: dask with_row_index and rename #692

Conversation

FBruzzesi commented Jul 31, 2024

What type of PR is this? (check all applicable)

Related issues

Checklist

EdAbati Aug 1, 2024

Choose a reason for hiding this comment

FBruzzesi Aug 1, 2024

Choose a reason for hiding this comment

EdAbati Aug 1, 2024

Choose a reason for hiding this comment

aidoskanapyanov Aug 2, 2024

Choose a reason for hiding this comment

EdAbati Aug 4, 2024

Choose a reason for hiding this comment

FBruzzesi Aug 4, 2024

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

feat: dask `with_row_index` and `rename` #692

feat: dask `with_row_index` and `rename` #692