Skip to content

Commit

Permalink
[1/n][torch/elastic] Move torchelastic docs *.rst (pytorch#148)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: pytorch/elastic#148

Pull Request resolved: pytorch#56811

Moves docs sphinx `*.rst` files from the torchelastic repository to torch. Note: only moves the rst files the next step is to link it to the main pytorch `index.rst` and write new `examples.rst`

Reviewed By: H-Huang

Differential Revision: D27974751

fbshipit-source-id: 8ff9f242aa32e0326c37da3916ea0633aa068fc5
  • Loading branch information
Kiuk Chung authored and Kushashwa Shrimali committed May 18, 2021
1 parent 111c439 commit 152817c
Show file tree
Hide file tree
Showing 21 changed files with 559 additions and 5 deletions.
2 changes: 2 additions & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ docutils==0.16
sphinxcontrib.katex
matplotlib
tensorboard
# required to build torch.distributed.elastic.rendezvous.etcd* docs
python-etcd>=0.4.5
42 changes: 42 additions & 0 deletions docs/source/distributed.elastic.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
Torch Distributed Elastic
============================

Makes distributed PyTorch fault-tolerant and elastic.

Get Started
---------------
.. toctree::
:maxdepth: 1
:caption: Usage

elastic/quickstart
elastic/train_script
elastic/examples

Documentation
---------------

.. toctree::
:maxdepth: 1
:caption: API

elastic/run
elastic/agent
elastic/multiprocessing
elastic/errors
elastic/rendezvous
elastic/timer
elastic/metrics
elastic/events

.. toctree::
:maxdepth: 1
:caption: Advanced

elastic/customization

.. toctree::
:maxdepth: 1
:caption: Plugins

elastic/kubernetes
61 changes: 61 additions & 0 deletions docs/source/elastic/agent.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Elastic Agent
==============

.. automodule:: torch.distributed.elastic.agent
.. currentmodule:: torch.distributed.elastic.agent

Server
--------

.. automodule:: torch.distributed.elastic.agent.server

Below is a diagram of an agent that manages a local group of workers.

.. image:: agent_diagram.jpg

Concepts
--------

This section describes the high-level classes and concepts that
are relevant to understanding the role of the ``agent`` in torchelastic.

.. currentmodule:: torch.distributed.elastic.agent.server

.. autoclass:: ElasticAgent
:members:

.. autoclass:: WorkerSpec
:members:

.. autoclass:: WorkerState
:members:

.. autoclass:: Worker
:members:

.. autoclass:: WorkerGroup
:members:

Implementations
-------------------

Below are the agent implementations provided by torchelastic.

.. currentmodule:: torch.distributed.elastic.agent.server.local_elastic_agent
.. autoclass:: LocalElasticAgent


Extending the Agent
---------------------

To extend the agent you can implement ```ElasticAgent`` directly, however
we recommend you extend ``SimpleElasticAgent`` instead, which provides
most of the scaffolding and leaves you with a few specific abstract methods
to implement.

.. currentmodule:: torch.distributed.elastic.agent.server
.. autoclass:: SimpleElasticAgent
:members:
:private-members:

.. autoclass:: torch.distributed.elastic.agent.server.api.RunResult
Binary file added docs/source/elastic/agent_diagram.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
118 changes: 118 additions & 0 deletions docs/source/elastic/customization.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
Customization
=============

This section describes how to customize TorchElastic to fit your needs.

Launcher
------------------------

The launcher program that ships with TorchElastic
should be sufficient for most use-cases (see :ref:`launcher-api`).
You can implement a custom launcher by
programmatically creating an agent and passing it specs for your workers as
shown below.

.. code-block:: python
# my_launcher.py
if __name__ == "__main__":
args = parse_args(sys.argv[1:])
rdzv_handler = RendezvousHandler(...)
spec = WorkerSpec(
local_world_size=args.nproc_per_node,
fn=trainer_entrypoint_fn,
args=(trainer_entrypoint_fn args.fn_args,...),
rdzv_handler=rdzv_handler,
max_restarts=args.max_restarts,
monitor_interval=args.monitor_interval,
)
agent = LocalElasticAgent(spec, start_method="spawn")
try:
run_result = agent.run()
if run_result.is_failed():
print(f"worker 0 failed with: run_result.failures[0]")
else:
print(f"worker 0 return value is: run_result.return_values[0]")
except Exception ex:
# handle exception
Rendezvous Handler
------------------------

To implement your own rendezvous, extend ``torch.distributed.elastic.rendezvous.RendezvousHandler``
and implement its methods.

.. warning:: Rendezvous handlers are tricky to implement. Before you begin
make sure you completely understand the properties of rendezvous.
Please refer to :ref:`rendezvous-api` for more information.

Once implemented you can pass your custom rendezvous handler to the worker
spec when creating the agent.

.. code-block:: python
spec = WorkerSpec(
rdzv_handler=MyRendezvousHandler(params),
...
)
elastic_agent = LocalElasticAgent(spec, start_method=start_method)
elastic_agent.run(spec.role)
Metric Handler
-----------------------------

TorchElastic emits platform level metrics (see :ref:`metrics-api`).
By default metrics are emitted to `/dev/null` so you will not see them.
To have the metrics pushed to a metric handling service in your infrastructure,
implement a `torch.distributed.elastic.metrics.MetricHandler` and `configure` it in your
custom launcher.

.. code-block:: python
# my_launcher.py
import torch.distributed.elastic.metrics as metrics
class MyMetricHandler(metrics.MetricHandler):
def emit(self, metric_data: metrics.MetricData):
# push metric_data to your metric sink
def main():
metrics.configure(MyMetricHandler())
spec = WorkerSpec(...)
agent = LocalElasticAgent(spec)
agent.run()
Events Handler
-----------------------------

TorchElastic supports events recording (see :ref:`events-api`).
The events module defines API that allows you to record events and
implement custom EventHandler. EventHandler is used for publishing events
produced during torchelastic execution to different sources, e.g. AWS CloudWatch.
By default it uses `torch.distributed.elastic.events.NullEventHandler` that ignores
events. To configure custom events handler you need to implement
`torch.distributed.elastic.events.EventHandler` interface and `configure` it
in your custom launcher.

.. code-block:: python
# my_launcher.py
import torch.distributed.elastic.events as events
class MyEventHandler(events.EventHandler):
def record(self, event: events.Event):
# process event
def main():
events.configure(MyEventHandler())
spec = WorkerSpec(...)
agent = LocalElasticAgent(spec)
agent.run()
17 changes: 17 additions & 0 deletions docs/source/elastic/errors.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Error Propagation
==================

.. automodule:: torch.distributed.elastic.multiprocessing.errors

Methods and Classes
---------------------

.. currentmodule:: torch.distributed.elastic.multiprocessing.errors

.. autofunction:: torch.distributed.elastic.multiprocessing.errors.record

.. autoclass:: ChildFailedError

.. autoclass:: ErrorHandler

.. autoclass:: ProcessFailure
Binary file added docs/source/elastic/etcd_rdzv_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 24 additions & 0 deletions docs/source/elastic/events.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
.. _events-api:

Events
============================

.. automodule:: torch.distributed.elastic.events

API Methods
------------

.. autofunction:: torch.distributed.elastic.events.record

.. autofunction:: torch.distributed.elastic.events.get_logging_handler

Event Objects
-----------------

.. currentmodule:: torch.distributed.elastic.events.api

.. autoclass:: torch.distributed.elastic.events.api.Event

.. autoclass:: torch.distributed.elastic.events.api.EventSource

.. autoclass:: torch.distributed.elastic.events.api.EventMetadataValue
4 changes: 4 additions & 0 deletions docs/source/elastic/examples.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Examples
==========================

Please refer to the `elastic/examples README <https://github.com/pytorch/elastic/tree/master/examples>`_.
5 changes: 5 additions & 0 deletions docs/source/elastic/kubernetes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
TorchElastic Kubernetes
==========================

Please refer to our github's `Kubernetes README <https://github.com/pytorch/elastic/tree/master/kubernetes>`_
for more information on Elastic Job Controller and custom resource definition.
31 changes: 31 additions & 0 deletions docs/source/elastic/metrics.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
.. _metrics-api:

Metrics
=========

.. automodule:: torch.distributed.elastic.metrics


Metric Handlers
-----------------

.. currentmodule:: torch.distributed.elastic.metrics.api

Below are the metric handlers that come included with torchelastic.

.. autoclass:: MetricHandler

.. autoclass:: ConsoleMetricHandler

.. autoclass:: NullMetricHandler



Methods
------------

.. autofunction:: torch.distributed.elastic.metrics.configure

.. autofunction:: torch.distributed.elastic.metrics.prof

.. autofunction:: torch.distributed.elastic.metrics.put_metric
24 changes: 24 additions & 0 deletions docs/source/elastic/multiprocessing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
:github_url: https://github.com/pytorch/elastic

Multiprocessing
================

.. automodule:: torch.distributed.elastic.multiprocessing

Starting Multiple Workers
---------------------------

.. autofunction:: torch.distributed.elastic.multiprocessing.start_processes

Process Context
----------------

.. currentmodule:: torch.distributed.elastic.multiprocessing.api

.. autoclass:: PContext

.. autoclass:: MultiprocessContext

.. autoclass:: SubprocessContext

.. autoclass:: RunProcsResult
50 changes: 50 additions & 0 deletions docs/source/elastic/quickstart.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
Quickstart
===========

.. code-block:: bash
pip install torch
# start a single-node etcd server on ONE host
etcd --enable-v2
--listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
--advertise-client-urls PUBLIC_HOSTNAME:2379
To launch a **fault-tolerant** job, run the following on all nodes.

.. code-block:: bash
python -m torch.distributed.run
--nnodes=NUM_NODES
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes
and at most ``MAX_SIZE`` nodes.

.. code-block:: bash
python -m torch.distributed.run
--nnodes=MIN_SIZE:MAX_SIZE
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
.. note:: The `--standalone` option can be passed to launch a single node job with
a sidecar rendezvous server. You don’t have to pass —rdzv_id, —rdzv_endpoint,
and —rdzv_backend when the —standalone option is used


.. note:: Learn more about writing your distributed training script
`here <train_script.html>`_.

If ``torch.distributed.run`` does not meet your requirements
you may use our APIs directly for more powerful customization. Start by
taking a look at the `elastic agent <agent.html>`_ API).
Loading

0 comments on commit 152817c

Please sign in to comment.