Deploying to gh-pages from @ aa995a2 🚀

Lightning-Universe · Jun 1, 2024 · cb7062f · cb7062f
commit cb7062f
Show file tree

Hide file tree

Showing 102 changed files with 8,650 additions and 0 deletions.
diff --git a/docs/.buildinfo b/docs/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: b1d4f130b437b34669c086e08878ee12
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/.nojekyll b/docs/.nojekyll
diff --git a/docs/_sources/advanced.rst.txt b/docs/_sources/advanced.rst.txt
@@ -0,0 +1,97 @@
+.. _hivemind_intermediate:
+
+Training on unreliable mixed GPUs across the internet (Advanced)
+================================================================
+
+Reducing Communication By Overlapping Communication
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We can reduce the impact of communication across all machines by overlapping communication with our training iterations. In short, we enable communication to happen
+in the background of training.
+
+Overlap Gradient and State Averaging
+""""""""""""""""""""""""""""""""""""
+
+When the target batch size is reached, all processes that are included in the step send gradients and model states to each other. By enabling some flags through
+the strategy, communication can happen in the background. This allows training to continue (with slightly outdated weights) but provides us the means
+to overlap communication with computation.
+
+.. warning::
+    Enabling overlapping communication means convergence will slightly be affected.
+
+.. note::
+    Enabling these flags means that you must pass in a ``scheduler_fn`` to the ``HivemindStrategy`` instead of relying on a scheduler from ``configure_optimizers``.
+    The optimizer is re-created by Hivemind, and as a result, the scheduler has to be re-created.
+
+.. code-block:: python
+
+    import torch
+    from functools import partial
+    from pytorch_lightning import Trainer
+    from lightning_hivemind.strategy import HivemindStrategy
+
+    trainer = Trainer(
+        strategy=HivemindStrategy(
+            target_batch_size=8192,
+            delay_state_averaging=True,
+            delay_grad_averaging=True,
+            delay_optimizer_step=True,
+            offload_optimizer=True,  # required to delay averaging
+            scheduler_fn=partial(torch.optim.lr_scheduler.ExponentialLR, gamma=...),
+        ),
+        accelerator="gpu",
+        devices=1,
+    )
+
+
+Reducing GPU Memory requirements by re-using buffers & CPU offloading
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We can also offload the optimizer state to the CPU whilst re-using gradient buffers to reduce the memory requirement for machines.
+
+Offloading Optimizer State to the CPU
+"""""""""""""""""""""""""""""""""""""
+
+Offloading the Optimizer state to the CPU works the same as Deepspeed Zero-stage-2-offload, where we save GPU memory by keeping all optimizer states on the CPU.
+
+.. note::
+    Enabling these flags means that you must pass in a ``scheduler_fn`` to the ``HivemindStrategy`` instead of relying on a scheduler from ``configure_optimizers``.
+    The optimizer is re-created by Hivemind, and as a result, the scheduler has to be re-created.
+
+    We suggest enabling offloading and overlapping communication to hide the additional overhead from having to communicate with the CPU.
+
+.. code-block:: python
+
+    import torch
+    from functools import partial
+    from pytorch_lightning import Trainer
+    from lightning_hivemind.strategy import HivemindStrategy
+
+    trainer = Trainer(
+        strategy=HivemindStrategy(
+            target_batch_size=8192,
+            offload_optimizer=True,
+            scheduler_fn=partial(torch.optim.lr_scheduler.ExponentialLR, gamma=...),
+        ),
+        accelerator="gpu",
+        devices=1,
+    )
+
+
+Re-using Gradient Buffers
+"""""""""""""""""""""""""
+
+By default, Hivemind accumulates gradients in a separate buffer. This means additional GPU memory is required to store gradients. You can enable re-using the model parameter gradient buffers by passing ``reuse_grad_buffers=True`` to the ``HivemindStrategy``.
+
+.. warning::
+    The ``HivemindStrategy`` will override ``zero_grad`` in your ``LightningModule`` to have no effect. This is because gradients are accumulated in the model
+    and Hivemind manages when they need to be cleared.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from lightning_hivemind.strategy import HivemindStrategy
+
+    trainer = Trainer(
+        strategy=HivemindStrategy(target_batch_size=8192, reuse_grad_buffers=True), accelerator="gpu", devices=1
+    )
diff --git a/docs/_sources/expert.rst.txt b/docs/_sources/expert.rst.txt
@@ -0,0 +1,85 @@
+.. _hivemind_expert:
+
+Training on unreliable mixed GPUs across the internet (Expert)
+==============================================================
+
+Using Compression to Optimize Communications
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Below are some ways to reduce communication when training collaboratively. As the size of your model increase, bottlenecks in communication become more apparent.
+
+Compress Gradients & State
+""""""""""""""""""""""""""
+
+Hivemind allows you to compress gradients and states before sending them to other machines. This helps reduce the communication overhead substantially when training across the internet.
+
+Below, we enable Float16 compression, which compresses gradients and states to Float16 before sending it to other machines.
+
+.. note::
+    Compressing gradients can affect convergence if you're lowering the precision (i.e training in Float32, but compressing gradients to FP16).
+
+.. code-block:: python
+
+    from hivemind import Float16Compression
+    from pytorch_lightning import Trainer
+    from lightning_hivemind.strategy import HivemindStrategy
+
+    trainer = Trainer(
+        strategy=HivemindStrategy(
+            target_batch_size=target_batch_size,
+            grad_compression=Float16Compression(),
+            state_averaging_compression=Float16Compression(),
+        ),
+        accelerator="gpu",
+        devices=1,
+    )
+
+A slightly more advanced scheme is dynamic compression based on value size. Below, we enable 8-bit quantization for large numbers, and Float16 compression for small values, reducing communication bottlenecks even further.
+
+Size Adaptive Compression has been used in a variety of Hivemind applications and has shown success, but does quantize gradients further, meaning we lose precision when compressing.
+
+.. code-block:: python
+
+    from hivemind import Float16Compression, Uniform8BitQuantization
+    from pytorch_lightning import Trainer
+    from lightning_hivemind.strategy import HivemindStrategy
+
+    # compresses values above threshold with 8bit Quantization, lower with Float16
+    compression = SizeAdaptiveCompression(
+        threshold=2 ** 16 + 1, less=Float16Compression(), greater_equal=Uniform8BitQuantization()
+    )
+    trainer = Trainer(
+        strategy=HivemindStrategy(
+            target_batch_size=target_batch_size,
+            grad_compression=compression,
+            state_averaging_compression=compression,
+        ),
+        accelerator="gpu",
+        devices=1,
+    )
+
+
+PowerSGD
+""""""""
+
+`PowerSGD <https://arxiv.org/abs/1905.13727>`_ is a technique to reduce distributed communication of gradients across processes.
+In short, PowerSGD uses a low-rank approximation to compress gradients before running an `all-reduce` step to sync gradients across all processes.
+
+.. note::
+    Though PowerSGD can impact convergence, it can also substantially reduce communication between processes.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from lightning_hivemind.strategy import HivemindStrategy
+    from functools import partial
+    from hivemind.optim.power_sgd_averager import PowerSGDGradientAverager
+
+    trainer = Trainer(
+        strategy=HivemindStrategy(
+            target_batch_size=8192,
+            grad_averager_factory=partial(PowerSGDGradientAverager, averager_rank=32, min_compression_ratio=0.5),
+        ),
+        accelerator="gpu",
+        devices=1,
+    )
diff --git a/docs/_sources/index.rst.txt b/docs/_sources/index.rst.txt
@@ -0,0 +1,15 @@
+.. Lightning-AI-Sandbox documentation master file, created by
+   sphinx-quickstart on Wed Mar 25 21:34:07 2020.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+.. include:: readme.rst
+
+
+.. toctree::
+   :maxdepth: 1
+   :name: start
+   :caption: Start here
+
+   advanced
+   expert
diff --git a/docs/_sources/overview.rst.txt b/docs/_sources/overview.rst.txt
@@ -0,0 +1,62 @@
+:orphan:
+
+################################################################
+Hivemind - training on unreliable mixed GPUs across the internet
+################################################################
+
+Collaborative Training tries to solve the need for top-tier multi-GPU servers by allowing you to train across unreliable machines,
+such as local machines or even preemptible cloud compute across the internet.
+
+Under the hood, we use `Hivemind <https://github.com/learning-at-home/hivemind>`__ which provides de-centralized training across the internet.
+
+.. warning::  This is an :ref:`experimental <versioning:Experimental API>` feature.
+
+
+To use Collaborative Training, you need to first this extension.
+
+.. code-block:: bash
+
+    pip install lightning-hivemind
+
+This will install both the `Hivemind <https://pypi.org/project/hivemind/>`__ package as well as the ``HivemindStrategy`` for the Lightning Trainer:
+
+Reducing Communication By Overlapping Communication
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We can reduce the impact of communication across all machines by overlapping communication with our training iterations. In short, we enable communication to happen in the background of training.
+
+Overlap Gradient and State Averaging
+""""""""""""""""""""""""""""""""""""
+
+When the target batch size is reached, all processes that are included in the step send gradients and model states to each other. By enabling some flags through
+the strategy, communication can happen in the background. This allows training to continue (with slightly outdated weights) but provides us the means
+to overlap communication with computation.
+
+.. warning::
+    Enabling overlapping communication means convergence will slightly be affected.
+
+.. note::
+    Enabling these flags means that you must pass in a ``scheduler_fn`` to the ``HivemindStrategy`` instead of relying on a scheduler from ``configure_optimizers``.
+    The optimizer is re-created by Hivemind, and as a result, the scheduler has to be re-created.
+
+.. code-block:: python
+
+    import torch
+    from functools import partial
+    from lightning import Trainer
+    from lightning_hivemind.strategy import HivemindStrategy
+
+    trainer = Trainer(
+        strategy=HivemindStrategy(
+            target_batch_size=8192,
+            delay_state_averaging=True,
+            delay_grad_averaging=True,
+            delay_optimizer_step=True,
+            offload_optimizer=True,  # required to delay averaging
+            scheduler_fn=partial(torch.optim.lr_scheduler.ExponentialLR, gamma=...),
+        ),
+        accelerator="gpu",
+        devices=1,
+    )
+
+For more information on the strategy capabilities, see the `lightning-hivemind <https://github.com/Lightning-Universe/lightning-hivemind>`__ repo.
diff --git a/docs/_sources/readme.rst.txt b/docs/_sources/readme.rst.txt
@@ -0,0 +1,47 @@
+Lightning + Hivemind
+====================
+
+Collaborative Training tries to solve the need for top-tier multi-GPU
+servers by allowing you to train across unreliable machines, such as
+local machines or even preemptible cloud computing across the internet.
+
+Under the hood, we use
+`Hivemind <https://github.com/learning-at-home/hivemind>`__, which
+provides de-centralized training across the internet.
+
+To use Collaborative Training, you need first to have this extension.
+
+.. code:: bash
+
+   pip install -U lightning-Hivemind
+
+The ``HivemindStrategy`` accumulates gradients from all collaborating
+processes until they reach a ``target_batch_size``. By default, we use
+the batch size of the first batch to determine what each local machine
+batch contributes towards the ``target_batch_size``. Once the
+``target_batch_size`` is reached, an optimizer step is made on all
+processes.
+
+When using ``HivemindStrategy``, note that you cannot use gradient
+accumulation (``accumulate_grad_batches``). This is because Hivemind
+manages accumulation internally.
+
+.. code:: py
+
+   from lightning import Trainer
+   from lightning_hivemind.strategy import HivemindStrategy
+
+   trainer = Trainer(strategy=HivemindStrategy(target_batch_size=8192), accelerator="gpu", devices=1)
+
+Followed by:
+
+.. code:: bash
+
+   python train.py
+   # Other machines can connect by running the same command:
+   # INITIAL_PEERS=... python train.py
+   # or passing the peers to the strategy:"
+   # HivemindStrategy(initial_peers=...)"
+
+A helper message is printed once your training begins, showing you how
+to train on other machines using the same code.
diff --git a/docs/_sources/versioning.rst.txt b/docs/_sources/versioning.rst.txt
@@ -0,0 +1,26 @@
+:orphan:
+
+.. warning::
+
+   This file is a placeholder for the real versioning policy page in the main
+   documentation.  It is here so we can cross link to it from the changelog.
+
+.. _versioning:
+
+Versioning Policy
+#################
+
+API Stability
+*************
+
+Stable API
+----------
+
+Experimental API
+----------------
+
+API Evolution
+*************
+
+Compatibility matrix
+********************