-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOCS-#6871: Update Modin on Ray cluster tutorial #6872
DOCS-#6871: Update Modin on Ray cluster tutorial #6872
Conversation
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md
Outdated
Show resolved
Hide resolved
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md
Outdated
Show resolved
Hide resolved
|
||
**This exercise instructs the user on how to start a 700+ core cluster, and it is not shut down until the end of Exercise 6. Read instructions carefully.** | ||
|
||
Often in practice we have a need to exceed the capabilities of a single machine. Modin works and performs well in both local mode and in a cluster environment. The key advantage of Modin is that your notebook does not change between local development and cluster execution. Users are not required to think about how many workers exist or how to distribute and partition their data; Modin handles all of this seamlessly and transparently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Often in practice we have a need to exceed the capabilities of a single machine. Modin works and performs well in both local mode and in a cluster environment. The key advantage of Modin is that your notebook does not change between local development and cluster execution. Users are not required to think about how many workers exist or how to distribute and partition their data; Modin handles all of this seamlessly and transparently. | |
Often in practice we have a need to exceed the capabilities of a single machine. | |
Modin works and performs well in both local mode and in a cluster environment. | |
The key advantage of Modin is that your notebook does not change between | |
local development and cluster execution. Users are not required to think about | |
how many workers exist or how to distribute and partition their data; | |
Modin handles all of this seamlessly and transparently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key advantage of Modin is that your notebook does not change between
local development and cluster execution.
To live up to this statement, it would be better to use a notebook for the tutorial rather than a python script. @Retribution98 could you describe in more detail what the problem was when using the notebook to launch?
@RehanSD maybe you can give us some advice, since you are one of the last authors of the notebook.
@Retribution98 what effort will it require from you to run benchmarking on a 120 GB dataset as it was originally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anmyachev Unfortunately, we cannot use Modin on Ray cluster from Jupyter Notebook without using Ray Client. I also thought about running a Jupyter server on the cluster head node, but I think this is not the best idea. Do you agree with me?
You can suggest using large files, but this will take a lot of time and, I think, it will not be convenient for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we cannot use Modin on Ray cluster from Jupyter Notebook without using Ray Client.
ok
I also thought about running a Jupyter server on the cluster head node, but I think this is not the best idea.
@Retribution98 why?
You can suggest using large files, but this will take a lot of time and, I think, it will not be convenient for the user.
We can add a note that this example takes a long time to complete, and indicate a place where the amount of data can be reduced (preferably a one-line change) by user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anmyachev Deploying a Jupyter notebook is quite a complex task and requires a separate tutorial. It can be added later, but for now I just added a note about this feature.
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md
Outdated
Show resolved
Hide resolved
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md
Outdated
Show resolved
Hide resolved
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md
Outdated
Show resolved
Hide resolved
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md
Outdated
Show resolved
Hide resolved
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md
Outdated
Show resolved
Hide resolved
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_6.py
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,41 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should rename this file to cluster_exercise.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the exercise number can help users understand in what order the exercises should be done because some of the tips from the tutorial for a single node can be useful for a cluster as well.
277e26b
to
e282fbf
Compare
.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html | ||
.. _`NYC Taxi dataset`: https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv | ||
.. _`Modin's cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml | ||
Using Modin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make the following structure to avoid duplication in the folder and file names?
./using_modin
index.rst
./local
index.rst
modin_on_*.rst
./cluster
index.rst
modin_on_*.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but I'm not sure we should because the URL paths will be changed and some references to Modin docs may be broken.
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
Outdated
Show resolved
Hide resolved
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,124 @@ | |||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid text duplication we have here and in rst file? Can we just refer to rst doc here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid not. Before this PR, we had a Jupyter notebook that had the same content as the RST documentation. We can leave only the RST file, but then we should delete the tutorial folder altogether.
Signed-off-by: Kirill Suvorov <[email protected]>
927ba4f
to
572afe1
Compare
Co-authored-by: Iaroslav Igoshev <[email protected]>
e8bb146
to
f1bf7b1
Compare
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
Outdated
Show resolved
Hide resolved
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.md
Outdated
Show resolved
Hide resolved
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.md
Outdated
Show resolved
Hide resolved
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Iaroslav Igoshev <[email protected]>
|
||
%%time | ||
apply_result = df.map(str) | ||
In this tutorial, we provide the `exercise_5.py`_ script, which reads the data from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link to exercise_5.py doesn't work in the docs. Recheck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because this PR hasn't yet been merged in the master, but there is the link to the actual version in the master.
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md
Outdated
Show resolved
Hide resolved
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM! @anmyachev, any other comments?
%%time | ||
groupby_result = df.groupby("passenger_count").count() | ||
.. note:: | ||
Some Dataframe functions are executed asynchronously, so to correctly measure execution time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not relevant anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this note has removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see the note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see the note.
Please check it again, I repushed my changes.
Co-authored-by: Iaroslav Igoshev <[email protected]>
examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date