forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[1/n][torch/elastic] Move torchelastic docs *.rst (pytorch#148)
Summary: Pull Request resolved: pytorch/elastic#148 Pull Request resolved: pytorch#56811 Moves docs sphinx `*.rst` files from the torchelastic repository to torch. Note: only moves the rst files the next step is to link it to the main pytorch `index.rst` and write new `examples.rst` Reviewed By: H-Huang Differential Revision: D27974751 fbshipit-source-id: 8ff9f242aa32e0326c37da3916ea0633aa068fc5
- Loading branch information
Kiuk Chung
authored and
Kushashwa Shrimali
committed
May 18, 2021
1 parent
111c439
commit 152817c
Showing
21 changed files
with
559 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
Torch Distributed Elastic | ||
============================ | ||
|
||
Makes distributed PyTorch fault-tolerant and elastic. | ||
|
||
Get Started | ||
--------------- | ||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Usage | ||
|
||
elastic/quickstart | ||
elastic/train_script | ||
elastic/examples | ||
|
||
Documentation | ||
--------------- | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: API | ||
|
||
elastic/run | ||
elastic/agent | ||
elastic/multiprocessing | ||
elastic/errors | ||
elastic/rendezvous | ||
elastic/timer | ||
elastic/metrics | ||
elastic/events | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Advanced | ||
|
||
elastic/customization | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Plugins | ||
|
||
elastic/kubernetes |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
Elastic Agent | ||
============== | ||
|
||
.. automodule:: torch.distributed.elastic.agent | ||
.. currentmodule:: torch.distributed.elastic.agent | ||
|
||
Server | ||
-------- | ||
|
||
.. automodule:: torch.distributed.elastic.agent.server | ||
|
||
Below is a diagram of an agent that manages a local group of workers. | ||
|
||
.. image:: agent_diagram.jpg | ||
|
||
Concepts | ||
-------- | ||
|
||
This section describes the high-level classes and concepts that | ||
are relevant to understanding the role of the ``agent`` in torchelastic. | ||
|
||
.. currentmodule:: torch.distributed.elastic.agent.server | ||
|
||
.. autoclass:: ElasticAgent | ||
:members: | ||
|
||
.. autoclass:: WorkerSpec | ||
:members: | ||
|
||
.. autoclass:: WorkerState | ||
:members: | ||
|
||
.. autoclass:: Worker | ||
:members: | ||
|
||
.. autoclass:: WorkerGroup | ||
:members: | ||
|
||
Implementations | ||
------------------- | ||
|
||
Below are the agent implementations provided by torchelastic. | ||
|
||
.. currentmodule:: torch.distributed.elastic.agent.server.local_elastic_agent | ||
.. autoclass:: LocalElasticAgent | ||
|
||
|
||
Extending the Agent | ||
--------------------- | ||
|
||
To extend the agent you can implement ```ElasticAgent`` directly, however | ||
we recommend you extend ``SimpleElasticAgent`` instead, which provides | ||
most of the scaffolding and leaves you with a few specific abstract methods | ||
to implement. | ||
|
||
.. currentmodule:: torch.distributed.elastic.agent.server | ||
.. autoclass:: SimpleElasticAgent | ||
:members: | ||
:private-members: | ||
|
||
.. autoclass:: torch.distributed.elastic.agent.server.api.RunResult |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
Customization | ||
============= | ||
|
||
This section describes how to customize TorchElastic to fit your needs. | ||
|
||
Launcher | ||
------------------------ | ||
|
||
The launcher program that ships with TorchElastic | ||
should be sufficient for most use-cases (see :ref:`launcher-api`). | ||
You can implement a custom launcher by | ||
programmatically creating an agent and passing it specs for your workers as | ||
shown below. | ||
|
||
.. code-block:: python | ||
# my_launcher.py | ||
if __name__ == "__main__": | ||
args = parse_args(sys.argv[1:]) | ||
rdzv_handler = RendezvousHandler(...) | ||
spec = WorkerSpec( | ||
local_world_size=args.nproc_per_node, | ||
fn=trainer_entrypoint_fn, | ||
args=(trainer_entrypoint_fn args.fn_args,...), | ||
rdzv_handler=rdzv_handler, | ||
max_restarts=args.max_restarts, | ||
monitor_interval=args.monitor_interval, | ||
) | ||
agent = LocalElasticAgent(spec, start_method="spawn") | ||
try: | ||
run_result = agent.run() | ||
if run_result.is_failed(): | ||
print(f"worker 0 failed with: run_result.failures[0]") | ||
else: | ||
print(f"worker 0 return value is: run_result.return_values[0]") | ||
except Exception ex: | ||
# handle exception | ||
Rendezvous Handler | ||
------------------------ | ||
|
||
To implement your own rendezvous, extend ``torch.distributed.elastic.rendezvous.RendezvousHandler`` | ||
and implement its methods. | ||
|
||
.. warning:: Rendezvous handlers are tricky to implement. Before you begin | ||
make sure you completely understand the properties of rendezvous. | ||
Please refer to :ref:`rendezvous-api` for more information. | ||
|
||
Once implemented you can pass your custom rendezvous handler to the worker | ||
spec when creating the agent. | ||
|
||
.. code-block:: python | ||
spec = WorkerSpec( | ||
rdzv_handler=MyRendezvousHandler(params), | ||
... | ||
) | ||
elastic_agent = LocalElasticAgent(spec, start_method=start_method) | ||
elastic_agent.run(spec.role) | ||
Metric Handler | ||
----------------------------- | ||
|
||
TorchElastic emits platform level metrics (see :ref:`metrics-api`). | ||
By default metrics are emitted to `/dev/null` so you will not see them. | ||
To have the metrics pushed to a metric handling service in your infrastructure, | ||
implement a `torch.distributed.elastic.metrics.MetricHandler` and `configure` it in your | ||
custom launcher. | ||
|
||
.. code-block:: python | ||
# my_launcher.py | ||
import torch.distributed.elastic.metrics as metrics | ||
class MyMetricHandler(metrics.MetricHandler): | ||
def emit(self, metric_data: metrics.MetricData): | ||
# push metric_data to your metric sink | ||
def main(): | ||
metrics.configure(MyMetricHandler()) | ||
spec = WorkerSpec(...) | ||
agent = LocalElasticAgent(spec) | ||
agent.run() | ||
Events Handler | ||
----------------------------- | ||
|
||
TorchElastic supports events recording (see :ref:`events-api`). | ||
The events module defines API that allows you to record events and | ||
implement custom EventHandler. EventHandler is used for publishing events | ||
produced during torchelastic execution to different sources, e.g. AWS CloudWatch. | ||
By default it uses `torch.distributed.elastic.events.NullEventHandler` that ignores | ||
events. To configure custom events handler you need to implement | ||
`torch.distributed.elastic.events.EventHandler` interface and `configure` it | ||
in your custom launcher. | ||
|
||
.. code-block:: python | ||
# my_launcher.py | ||
import torch.distributed.elastic.events as events | ||
class MyEventHandler(events.EventHandler): | ||
def record(self, event: events.Event): | ||
# process event | ||
def main(): | ||
events.configure(MyEventHandler()) | ||
spec = WorkerSpec(...) | ||
agent = LocalElasticAgent(spec) | ||
agent.run() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
Error Propagation | ||
================== | ||
|
||
.. automodule:: torch.distributed.elastic.multiprocessing.errors | ||
|
||
Methods and Classes | ||
--------------------- | ||
|
||
.. currentmodule:: torch.distributed.elastic.multiprocessing.errors | ||
|
||
.. autofunction:: torch.distributed.elastic.multiprocessing.errors.record | ||
|
||
.. autoclass:: ChildFailedError | ||
|
||
.. autoclass:: ErrorHandler | ||
|
||
.. autoclass:: ProcessFailure |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
.. _events-api: | ||
|
||
Events | ||
============================ | ||
|
||
.. automodule:: torch.distributed.elastic.events | ||
|
||
API Methods | ||
------------ | ||
|
||
.. autofunction:: torch.distributed.elastic.events.record | ||
|
||
.. autofunction:: torch.distributed.elastic.events.get_logging_handler | ||
|
||
Event Objects | ||
----------------- | ||
|
||
.. currentmodule:: torch.distributed.elastic.events.api | ||
|
||
.. autoclass:: torch.distributed.elastic.events.api.Event | ||
|
||
.. autoclass:: torch.distributed.elastic.events.api.EventSource | ||
|
||
.. autoclass:: torch.distributed.elastic.events.api.EventMetadataValue |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
Examples | ||
========================== | ||
|
||
Please refer to the `elastic/examples README <https://github.com/pytorch/elastic/tree/master/examples>`_. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
TorchElastic Kubernetes | ||
========================== | ||
|
||
Please refer to our github's `Kubernetes README <https://github.com/pytorch/elastic/tree/master/kubernetes>`_ | ||
for more information on Elastic Job Controller and custom resource definition. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
.. _metrics-api: | ||
|
||
Metrics | ||
========= | ||
|
||
.. automodule:: torch.distributed.elastic.metrics | ||
|
||
|
||
Metric Handlers | ||
----------------- | ||
|
||
.. currentmodule:: torch.distributed.elastic.metrics.api | ||
|
||
Below are the metric handlers that come included with torchelastic. | ||
|
||
.. autoclass:: MetricHandler | ||
|
||
.. autoclass:: ConsoleMetricHandler | ||
|
||
.. autoclass:: NullMetricHandler | ||
|
||
|
||
|
||
Methods | ||
------------ | ||
|
||
.. autofunction:: torch.distributed.elastic.metrics.configure | ||
|
||
.. autofunction:: torch.distributed.elastic.metrics.prof | ||
|
||
.. autofunction:: torch.distributed.elastic.metrics.put_metric |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
:github_url: https://github.com/pytorch/elastic | ||
|
||
Multiprocessing | ||
================ | ||
|
||
.. automodule:: torch.distributed.elastic.multiprocessing | ||
|
||
Starting Multiple Workers | ||
--------------------------- | ||
|
||
.. autofunction:: torch.distributed.elastic.multiprocessing.start_processes | ||
|
||
Process Context | ||
---------------- | ||
|
||
.. currentmodule:: torch.distributed.elastic.multiprocessing.api | ||
|
||
.. autoclass:: PContext | ||
|
||
.. autoclass:: MultiprocessContext | ||
|
||
.. autoclass:: SubprocessContext | ||
|
||
.. autoclass:: RunProcsResult |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
Quickstart | ||
=========== | ||
|
||
.. code-block:: bash | ||
pip install torch | ||
# start a single-node etcd server on ONE host | ||
etcd --enable-v2 | ||
--listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001 | ||
--advertise-client-urls PUBLIC_HOSTNAME:2379 | ||
To launch a **fault-tolerant** job, run the following on all nodes. | ||
|
||
.. code-block:: bash | ||
python -m torch.distributed.run | ||
--nnodes=NUM_NODES | ||
--nproc_per_node=TRAINERS_PER_NODE | ||
--rdzv_id=JOB_ID | ||
--rdzv_backend=etcd | ||
--rdzv_endpoint=ETCD_HOST:ETCD_PORT | ||
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...) | ||
To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes | ||
and at most ``MAX_SIZE`` nodes. | ||
|
||
.. code-block:: bash | ||
python -m torch.distributed.run | ||
--nnodes=MIN_SIZE:MAX_SIZE | ||
--nproc_per_node=TRAINERS_PER_NODE | ||
--rdzv_id=JOB_ID | ||
--rdzv_backend=etcd | ||
--rdzv_endpoint=ETCD_HOST:ETCD_PORT | ||
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...) | ||
.. note:: The `--standalone` option can be passed to launch a single node job with | ||
a sidecar rendezvous server. You don’t have to pass —rdzv_id, —rdzv_endpoint, | ||
and —rdzv_backend when the —standalone option is used | ||
|
||
|
||
.. note:: Learn more about writing your distributed training script | ||
`here <train_script.html>`_. | ||
|
||
If ``torch.distributed.run`` does not meet your requirements | ||
you may use our APIs directly for more powerful customization. Start by | ||
taking a look at the `elastic agent <agent.html>`_ API). |
Oops, something went wrong.