DOCS-#6871: Update Modin on Ray cluster tutorial #6872

Retribution98 · 2024-01-21T18:20:03Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Update Modin on Ray cluster tutorial #6871
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md

YarShev · 2024-01-22T13:07:26Z

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md

+
+**This exercise instructs the user on how to start a 700+ core cluster, and it is not shut down until the end of Exercise 6. Read instructions carefully.**
+
+Often in practice we have a need to exceed the capabilities of a single machine. Modin works and performs well in both local mode and in a cluster environment. The key advantage of Modin is that your notebook does not change between local development and cluster execution. Users are not required to think about how many workers exist or how to distribute and partition their data; Modin handles all of this seamlessly and transparently.


Suggested change

Often in practice we have a need to exceed the capabilities of a single machine. Modin works and performs well in both local mode and in a cluster environment. The key advantage of Modin is that your notebook does not change between local development and cluster execution. Users are not required to think about how many workers exist or how to distribute and partition their data; Modin handles all of this seamlessly and transparently.

Often in practice we have a need to exceed the capabilities of a single machine.

Modin works and performs well in both local mode and in a cluster environment.

The key advantage of Modin is that your notebook does not change between

local development and cluster execution. Users are not required to think about

how many workers exist or how to distribute and partition their data;

Modin handles all of this seamlessly and transparently.

The key advantage of Modin is that your notebook does not change between
local development and cluster execution.

To live up to this statement, it would be better to use a notebook for the tutorial rather than a python script. @Retribution98 could you describe in more detail what the problem was when using the notebook to launch?

@RehanSD maybe you can give us some advice, since you are one of the last authors of the notebook.

@Retribution98 what effort will it require from you to run benchmarking on a 120 GB dataset as it was originally?

@anmyachev Unfortunately, we cannot use Modin on Ray cluster from Jupyter Notebook without using Ray Client. I also thought about running a Jupyter server on the cluster head node, but I think this is not the best idea. Do you agree with me?
You can suggest using large files, but this will take a lot of time and, I think, it will not be convenient for the user.

Unfortunately, we cannot use Modin on Ray cluster from Jupyter Notebook without using Ray Client.

ok

I also thought about running a Jupyter server on the cluster head node, but I think this is not the best idea.

@Retribution98 why?

You can suggest using large files, but this will take a lot of time and, I think, it will not be convenient for the user.

We can add a note that this example takes a long time to complete, and indicate a place where the amount of data can be reduced (preferably a one-line change) by user.

@anmyachev Deploying a Jupyter notebook is quite a complex task and requires a separate tutorial. It can be added later, but for now I just added a note about this feature.

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_6.py

YarShev · 2024-01-22T13:21:24Z

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_6.py

@@ -0,0 +1,41 @@
+import os


Maybe we should rename this file to cluster_exercise.py?

I think the exercise number can help users understand in what order the exercises should be done because some of the tips from the tutorial for a single node can be useful for a cluster as well.

docs/getting_started/using_modin/using_modin_cluster.rst

YarShev · 2024-02-09T09:43:42Z

docs/getting_started/using_modin/using_modin_cluster.rst

-.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html
-.. _`NYC Taxi dataset`: https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv
-.. _`Modin's cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
+Using Modin


Maybe we should make the following structure to avoid duplication in the folder and file names?

./using_modin index.rst ./local index.rst modin_on_*.rst ./cluster index.rst modin_on_*.rst

We can, but I'm not sure we should because the URL paths will be changed and some references to Modin docs may be broken.

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml

docs/getting_started/using_modin/using_modin_cluster.rst

docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst

YarShev · 2024-02-09T10:50:49Z

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.md

@@ -0,0 +1,124 @@
+![LOGO](../../../img/MODIN_ver2_hrz.png)


Can we avoid text duplication we have here and in rst file? Can we just refer to rst doc here?

I'm afraid not. Before this PR, we had a Jupyter notebook that had the same content as the RST documentation. We can leave only the RST file, but then we should delete the tutorial folder altogether.

Signed-off-by: Kirill Suvorov <[email protected]>

Co-authored-by: Iaroslav Igoshev <[email protected]>

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py

docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.md

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py

Co-authored-by: Iaroslav Igoshev <[email protected]>

docs/getting_started/using_modin/using_modin_cluster.rst

YarShev · 2024-02-14T20:47:01Z

docs/getting_started/using_modin/using_modin_cluster.rst

-
-   %%time
-   apply_result = df.map(str)
+In this tutorial, we provide the `exercise_5.py`_ script, which reads the data from the


The link to exercise_5.py doesn't work in the docs. Recheck.

This is because this PR hasn't yet been merged in the master, but there is the link to the actual version in the master.

docs/getting_started/using_modin/using_modin_cluster.rst

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py

YarShev

Overall, LGTM! @anmyachev, any other comments?

docs/getting_started/using_modin/using_modin_cluster.rst

YarShev · 2024-02-19T11:59:00Z

docs/getting_started/using_modin/using_modin_cluster.rst

-   %%time
-   groupby_result = df.groupby("passenger_count").count()
+.. note::
+   Some Dataframe functions are executed asynchronously, so to correctly measure execution time 


Not relevant anymore.

Right, this note has removed

I still see the note.

I still see the note.

Please check it again, I repushed my changes.

docs/getting_started/using_modin/using_modin_cluster.rst

Co-authored-by: Iaroslav Igoshev <[email protected]>

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md

anmyachev

LGTM!

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md

…ME.md

Retribution98 requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev, dchigarev and a team as code owners January 21, 2024 18:20

YarShev reviewed Jan 22, 2024

View reviewed changes

Retribution98 force-pushed the jupyter_ray_cluster branch from 277e26b to e282fbf Compare February 8, 2024 17:02

YarShev reviewed Feb 9, 2024

View reviewed changes

Retribution98 added 4 commits February 12, 2024 10:41

DOCS-modin-project#6871: Update Modin on Ray cluster tutorial

968bde5

Signed-off-by: Kirill Suvorov <[email protected]>

Fix docs

224a120

Fix docs build

7b20d89

fix img path

572afe1

Retribution98 force-pushed the jupyter_ray_cluster branch from 927ba4f to 572afe1 Compare February 12, 2024 10:53

Apply suggestions from code review

2911a37

Co-authored-by: Iaroslav Igoshev <[email protected]>

github-advanced-security bot found potential problems Feb 13, 2024

View reviewed changes

Added minor adjustments

f1bf7b1

Retribution98 force-pushed the jupyter_ray_cluster branch from e8bb146 to f1bf7b1 Compare February 13, 2024 15:22

YarShev reviewed Feb 14, 2024

View reviewed changes

Retribution98 and others added 2 commits February 14, 2024 17:19

Apply suggestions from code review

f2e0952

Co-authored-by: Iaroslav Igoshev <[email protected]>

Remove dublication

ebe94b0

YarShev reviewed Feb 14, 2024

View reviewed changes

Fix aes after code review

d0377ba

YarShev reviewed Feb 19, 2024

View reviewed changes

Apply suggestions from code review

b1c2e76

Co-authored-by: Iaroslav Igoshev <[email protected]>

YarShev previously approved these changes Feb 19, 2024

View reviewed changes

Remove a note

b9bbb7d

Retribution98 dismissed YarShev’s stale review via b9bbb7d February 19, 2024 14:18

anmyachev reviewed Feb 19, 2024

View reviewed changes

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md Outdated Show resolved Hide resolved

anmyachev previously approved these changes Feb 19, 2024

View reviewed changes

YarShev reviewed Feb 19, 2024

View reviewed changes

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md Outdated Show resolved Hide resolved

YarShev dismissed anmyachev’s stale review via 96652bc February 19, 2024 14:38

Update examples/tutorial/jupyter/execution/pandas_on_ray/cluster/READ…

96652bc

…ME.md

YarShev approved these changes Feb 19, 2024

View reviewed changes

YarShev merged commit 3615811 into modin-project:master Feb 19, 2024
10 checks passed

Retribution98 deleted the jupyter_ray_cluster branch February 19, 2024 14:53

dchigarev mentioned this pull request Feb 19, 2024

FIX-#6944: Apply 'isort' formatting for scripts from tutorials #6945

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOCS-#6871: Update Modin on Ray cluster tutorial #6872

DOCS-#6871: Update Modin on Ray cluster tutorial #6872

Retribution98 commented Jan 21, 2024

YarShev Jan 22, 2024

anmyachev Jan 24, 2024

Retribution98 Feb 8, 2024

anmyachev Feb 9, 2024

Retribution98 Feb 13, 2024 •

edited

Loading

YarShev Jan 22, 2024

Retribution98 Feb 8, 2024

YarShev Feb 9, 2024

Retribution98 Feb 13, 2024

YarShev Feb 9, 2024

Retribution98 Feb 13, 2024

YarShev Feb 14, 2024

Retribution98 Feb 19, 2024

YarShev left a comment

YarShev Feb 19, 2024

Retribution98 Feb 19, 2024

anmyachev Feb 19, 2024

Retribution98 Feb 19, 2024

anmyachev left a comment


		This exercise instructs the user on how to start a 700+ core cluster, and it is not shut down until the end of Exercise 6. Read instructions carefully.

		Often in practice we have a need to exceed the capabilities of a single machine. Modin works and performs well in both local mode and in a cluster environment. The key advantage of Modin is that your notebook does not change between local development and cluster execution. Users are not required to think about how many workers exist or how to distribute and partition their data; Modin handles all of this seamlessly and transparently.

DOCS-#6871: Update Modin on Ray cluster tutorial #6872

DOCS-#6871: Update Modin on Ray cluster tutorial #6872

Conversation

Retribution98 commented Jan 21, 2024

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Retribution98 Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

Retribution98 Feb 13, 2024 •

edited

Loading