Skip to content

Commit

Permalink
DOCS-#6819: Update Modin on cluster documentation (#6678)
Browse files Browse the repository at this point in the history
Co-authored-by: Iaroslav Igoshev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
  • Loading branch information
anmyachev and YarShev authored Dec 13, 2023
1 parent acfcf34 commit 7f2dc36
Show file tree
Hide file tree
Showing 3 changed files with 46 additions and 15 deletions.
37 changes: 29 additions & 8 deletions docs/getting_started/using_modin/using_modin_cluster.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Using Modin in a Cluster

.. note::
| *Estimated Reading Time: 15 minutes*
| You can follow along in a Jupyter notebook in this two-part tutorial: [`Part 1 <https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.ipynb>`_], [`Part 2 <https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_6.ipynb>`_].
| You can follow along in a Jupyter notebook in this two-part tutorial: `Part 1`_, `Part 2`_.
Often in practice we have a need to exceed the capabilities of a single machine. Modin
works and performs well in both local mode and in a cluster environment. The key
Expand All @@ -15,8 +15,9 @@ transparently.

Starting up a Ray Cluster
-------------------------
Modin is able to utilize Ray's built-in autoscaled cluster. To launch a Ray cluster using Amazon Web Service (AWS), you can use `this file <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml>`_
as the config file.
Modin is able to utilize Ray's built-in autoscaled cluster. To launch a Ray cluster
using Amazon Web Service (AWS), you can use `Modin's cluster setup config`_
(`Ray's autoscaler options`_).

.. code-block:: bash
Expand All @@ -31,8 +32,11 @@ To start up the Ray cluster, run the following command in your terminal:
This configuration script starts 1 head node (m5.24xlarge) and 7 workers (m5.24xlarge),
768 total CPUs. For more information on how to launch a Ray cluster across different
cloud providers or on-premise, you can also refer to the Ray documentation `here <https://docs.ray.io/en/latest/cluster/cloud.html>`_.
cloud providers or on-premise, you can also refer to the `Ray's cluster docs`_.

.. note::
By default, Modin on Ray uses 60% of the system memory. It is recommended to use the same
amount, when using your own cluster (for each node).

Connecting to a Ray Cluster
---------------------------
Expand All @@ -56,13 +60,20 @@ Modin:
Congratualions! You have successfully connected to the Ray cluster.

.. note::
Be careful when using the Ray client to connect to a remote cluster.
This connection mode may not work. Known bugs:
- https://github.com/ray-project/ray/issues/38713,
- https://github.com/modin-project/modin/issues/6641.

Using Modin on a Ray Cluster
----------------------------

Now that we have a Ray cluster up and running, we can use Modin to perform pandas
operation as if we were working with pandas on a single machine. We test Modin's
performance on the 200MB `NYC Taxi dataset <https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv>`_ that was provided as part of our `cluster setup script <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml>`_. We can time the following operation
in a Jupyter notebook:
performance on the 200MB `NYC Taxi dataset`_ that was provided as part of our
`Modin's cluster setup config`_. We can time the following operation in a Jupyter
notebook:

.. code-block:: python
Expand All @@ -78,6 +89,10 @@ in a Jupyter notebook:
%%time
apply_result = df.map(str)
.. note::
When using local paths, make sure that they are available on all nodes in the
cluster, for example using distributed file system like NFS.

Modin performance scales as the number of nodes and cores increases. The following
chart shows the performance of the above operations with 2, 4, and 8 nodes, with
improvements in performance as we increase the number of resources Modin can use.
Expand All @@ -93,7 +108,7 @@ Advanced: Configuring your Ray Environment
In some cases, it may be useful to customize your Ray environment. Below, we have listed
a few ways you can solve common problems in data management with Modin by customizing
your Ray environment. It is possible to use any of Ray's initialization parameters,
which are all found in `Ray's documentation`_.
which are all found in `Ray's API docs`_.

.. code-block:: python
Expand All @@ -108,4 +123,10 @@ you can customize your Ray environment for use in Modin!
.. _`DataFrame`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
.. _`pandas`: https://pandas.pydata.org/pandas-docs/stable/
.. _`open an issue`: https://github.com/modin-project/modin/issues
.. _`Ray's documentation`: https://ray.readthedocs.io/en/latest/api.html
.. _`Ray's API docs`: https://ray.readthedocs.io/en/latest/api.html
.. _`Part 1`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.ipynb
.. _`Part 2`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_6.ipynb
.. _`Ray's autoscaler options`: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config
.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html
.. _`NYC Taxi dataset`: https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv
.. _`Modin's cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
22 changes: 15 additions & 7 deletions docs/getting_started/using_modin/using_modin_locally.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,16 @@ Using Modin Locally

.. note::
| *Estimated Reading Time: 5 minutes*
| You can follow along this tutorial in a Jupyter notebook `here <hhttps://github.com/modin-project/modin/tree/master/examples/quickstart.ipynb>`.
| You can follow along this tutorial in the `Jupyter notebook`_.
In our quickstart example, we have already seen how you can achieve considerable
speedup from Modin, even on a single machine. Users do not need to know how many cores their system has, nor do they need to specify how to distribute the data. In fact,
speedup from Modin, even on a single machine. Users do not need to know how many
cores their system has, nor do they need to specify how to distribute the data. In fact,
users can **continue using their existing pandas code** while experiencing a
considerable speedup from Modin, even on a single machine.

To use Modin on a single machine, only a modification of the import statement is needed. Once you've changed your import statement, you're ready to use Modin
To use Modin on a single machine, only a modification of the import statement is needed.
Once you've changed your import statement, you're ready to use Modin
just like you would pandas, since the API is identical to pandas.

.. code-block:: python
Expand Down Expand Up @@ -66,7 +68,8 @@ cluster for you:
Finally, if you already have an Ray or Dask engine initialized, Modin will
automatically attach to whichever engine is available. If you are interested in using
Modin with HDK engine, please refer to :doc:`these instructions </development/using_hdk>`. For additional information on other settings you can configure, see
Modin with HDK engine, please refer to :doc:`these instructions </development/using_hdk>`.
For additional information on other settings you can configure, see
:doc:`Modin's config </flow/modin/config>` page for more details.

Advanced: Configuring the resources Modin uses
Expand All @@ -81,8 +84,8 @@ the following code:
import modin
print(modin.config.NPartitions.get()) #prints 16 on a laptop with 16 physical cores
Modin fully utilizes the resources on your machine. To read more about how this works, see :doc:`Why Modin? </getting_started/why_modin/pandas/>`
page for more details.
Modin fully utilizes the resources on your machine. To read more about how this works,
see :doc:`Why Modin? </getting_started/why_modin/pandas/>` page for more details.

Since Modin will use all of the resources available on your machine by default, at
times, it is possible that you may like to limit the amount of resources Modin uses to
Expand Down Expand Up @@ -116,4 +119,9 @@ specify more processors than you have available on your machine; however this wi
improve the performance (and might end up hurting the performance of the system).

.. note::
Make sure to update the ``MODIN_CPUS`` configuration and initialize your preferred engine before you start working with the first operation using Modin! Otherwise, Modin will opt for the default setting.
Make sure to update the ``MODIN_CPUS`` configuration and initialize your preferred
engine before you start working with the first operation using Modin! Otherwise,
Modin will opt for the default setting.


.. _`Jupyter notebook`: https://github.com/modin-project/modin/tree/master/examples/quickstart.ipynb
2 changes: 2 additions & 0 deletions modin/core/execution/ray/common/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@
_RAY_IGNORE_UNHANDLED_ERRORS_VAR = "RAY_IGNORE_UNHANDLED_ERRORS"

ObjectIDType = ray.ObjectRef
# TODO: Minimum version of Ray - 1.13
# `if` branch can be deleted
if version.parse(ray.__version__) >= version.parse("1.2.0"):
from ray.util.client.common import ClientObjectRef

Expand Down

0 comments on commit 7f2dc36

Please sign in to comment.