Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: better docs for profiling in determined [MD-252] #9011

Merged
merged 5 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 49 additions & 59 deletions docs/get-started/architecture/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,65 +183,55 @@ practitioners might find inconvenient to implement. The Determined cohesive, end
platform provides best-in-class functionality for deep learning model training, including the
following benefits:

+------------------------------------------------+-----------------------------------------------------------+
| Implementation | Benefit |
+================================================+===========================================================+
| Automated model tuning | Optimize models by searching through conventional |
| | hyperparameters or macro- architectures, using a variety |
| | of search algorithms. Hyperparameter searches are |
| | automatically parallelized across the accelerators in the |
| | cluster. See :ref:`hyperparameter-tuning`. |
+------------------------------------------------+-----------------------------------------------------------+
| Cluster-backed notebooks, commands, and shells | Leverage your shared cluster computing devices in a more |
| | versatile environment. See :ref:`notebooks` and |
| | :ref:`commands-and-shells`. |
+------------------------------------------------+-----------------------------------------------------------+
| Cluster management | Automatically manage ML accelerators, such as GPUs, |
| | on-premise or in cloud VMs using your own environment, |
| | automatically scaling for your on-demand workloads. |
| | Determined runs in either AWS or GCP, so you can switch |
| | easily according to your requirements. See :ref:`Resource |
| | Pools <resource-pools>`, :ref:`Scheduling <scheduling>`, |
| | and :ref:`Elastic Infrastructure |
| | <elastic-infrastructure>`. |
+------------------------------------------------+-----------------------------------------------------------+
| Containerization | Develop and train models in customizable containers that |
| | enable simple, consistent dependency management |
| | throughout the model development lifecycle. See |
| | :ref:`custom-env`. |
+------------------------------------------------+-----------------------------------------------------------+
| Distributed training | Easily distribute a single training job across multiple |
| | accelerators to speed up model training and reduce model |
| | development iteration time. Determined uses synchronous, |
| | data-parallel distributed training, with key performance |
| | optimizations over other available options. See |
| | :ref:`multi-gpu-training-concept`. |
+------------------------------------------------+-----------------------------------------------------------+
| Experiment collaboration | Automatically track your experiment configuration and |
| | environment to facilitate reproducibility and |
| | collaboration among teams. See :ref:`experiments`. |
+------------------------------------------------+-----------------------------------------------------------+
| Fault tolerance | Models are checkpointed throughout the training process |
| | and can be restarted from the latest checkpoint, |
| | automatically. This enables training jobs to |
| | automatically tolerate transient hardware or system |
| | issues in the cluster. |
+------------------------------------------------+-----------------------------------------------------------+
| Framework support | Broad framework support leverages these capabilities |
| | using any of the leading machine learning frameworks |
| | without needing to manage a different cluster for each. |
| | Different frameworks for different models can be used |
| | without risking future lock-in. See |
| | :ref:`apis-howto-overview`. |
+------------------------------------------------+-----------------------------------------------------------+
| Profiling | Out-of-the-box system metrics (measurements of hardware |
| | usage) and timings (durations of actions taken during |
| | training, such as data loading). |
+------------------------------------------------+-----------------------------------------------------------+
| Visualization | Visualize your model and training procedure by using The |
| | built-in WebUI and by launching managed |
| | :ref:`tensorboards` instances. |
+------------------------------------------------+-----------------------------------------------------------+
+------------------------------------------------+------------------------------------------------------------------------------+
| Implementation | Benefit |
+================================================+==============================================================================+
| Automated model tuning | Optimize models by searching through conventional hyperparameters or macro- |
| | architectures, using a variety of search algorithms. Hyperparameter searches |
| | are automatically parallelized across the accelerators in the cluster. See |
| | :ref:`hyperparameter-tuning`. |
+------------------------------------------------+------------------------------------------------------------------------------+
| Cluster-backed notebooks, commands, and shells | Leverage your shared cluster computing devices in a more versatile |
| | environment. See :ref:`notebooks` and :ref:`commands-and-shells`. |
+------------------------------------------------+------------------------------------------------------------------------------+
| Cluster management | Automatically manage ML accelerators, such as GPUs, on-premise or in cloud |
| | VMs using your own environment, automatically scaling for your on-demand |
| | workloads. Determined runs in either AWS or GCP, so you can switch easily |
| | according to your requirements. See :ref:`Resource Pools <resource-pools>`, |
| | :ref:`Scheduling <scheduling>`, and :ref:`Elastic Infrastructure |
| | <elastic-infrastructure>`. |
+------------------------------------------------+------------------------------------------------------------------------------+
| Containerization | Develop and train models in customizable containers that enable simple, |
| | consistent dependency management throughout the model development lifecycle. |
| | See :ref:`custom-env`. |
+------------------------------------------------+------------------------------------------------------------------------------+
| Distributed training | Easily distribute a single training job across multiple accelerators to |
| | speed up model training and reduce model development iteration time. |
| | Determined uses synchronous, data-parallel distributed training, with key |
| | performance optimizations over other available options. See |
| | :ref:`multi-gpu-training-concept`. |
+------------------------------------------------+------------------------------------------------------------------------------+
| Experiment collaboration | Automatically track your experiment configuration and environment to |
| | facilitate reproducibility and collaboration among teams. See |
| | :ref:`experiments`. |
+------------------------------------------------+------------------------------------------------------------------------------+
| Fault tolerance | Models are checkpointed throughout the training process and can be restarted |
| | from the latest checkpoint, automatically. This enables training jobs to |
| | automatically tolerate transient hardware or system issues in the cluster. |
+------------------------------------------------+------------------------------------------------------------------------------+
| Framework support | Broad framework support leverages these capabilities using any of the |
| | leading machine learning frameworks without needing to manage a different |
| | cluster for each. Different frameworks for different models can be used |
| | without risking future lock-in. See :ref:`apis-howto-overview`. |
+------------------------------------------------+------------------------------------------------------------------------------+
| Profiling | Optimize your model's performance and diagnose bottlenecks with |
| | comprehensive profiling support across different layers deployment, from |
| | out-of-the-box system metrics tracking to seamless integrations with native |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?
should it be this: "across different deployment layers"?

| | training profilers. |
+------------------------------------------------+------------------------------------------------------------------------------+
| Visualization | Visualize your model and training procedure by using The built-in WebUI and |
| | by launching managed :ref:`tensorboards` instances. |
+------------------------------------------------+------------------------------------------------------------------------------+

**********
Concepts
Expand Down
50 changes: 47 additions & 3 deletions docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -508,13 +508,57 @@ In the Determined WebUI, go to the **Cluster** pane.
You should be able to see multiple slots active corresponding to the value you set for
``slots_per_trial`` you set in ``distributed.yaml``, as well as logs appearing from multiple ranks.

.. _core-profiling:

***********
Profiling
***********

Profiling with native profilers such as can be configured as usual. If running on a Determined
cluster, the profiling log output path can be configured for automatic upload to the Determined
Tensorboard UI.
There are two ways to profile the performance of your training job:

#. Core API's built-in system metrics profiler

#. Integration with profilers native to your training framework, such as the Tensorflow and PyTorch
profilers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorFlow


.. _core-profiler:

Core API Profiler
=================

Core API comes with profiling functionality that tracks system metrics throughout the training run.
These metrics are recorded at specified intervals to the master and can be viewed in the Web UI
under your experiment's "Profiling" tab.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Core API includes a profiling feature that monitors and records system metrics during the training run. These metrics are recorded at specified intervals and sent to the master, allowing you to view them in the "Profiling" tab of your experiment in the WebUI.


Use ``core_context.profiler`` to interact with the Core API profiler. It can be toggled on or off by
calling ``core_context.profiler.on()`` and ``core_context.profiler.off()``.

The following code snippet demonstrates how to enable profiling for only a portion of your training
code, but the profiler can be turned on and off at any point within the ``core.Context``.

.. code:: python

import determined as det


with det.core.init() as core_context:
...
for batch_idx in range(1, 10):
# In this example we just want to profile the first 5 batches.
if batch_idx == 1:
core_context.profiler.on()
if batch_idx == 5:
core_context.profiler.off()
train_batch(...)

.. _core-native-profilers:

Native Profilers
================

Profiling with native profilers such as PyTorch profiler and TensorFlow profiler can be configured
as usual. If running on a Determined cluster, the profiling log output path can be configured for
automatic upload to the Determined Tensorboard UI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorBoard


The following snippet initializes the PyTorch Profiler. It will profile GPU and CPU activities,
skipping batch 1, warming up on batch 2, profiling batches 3 and 4, then repeating the cycle. Result
Expand Down
2 changes: 2 additions & 0 deletions docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,8 @@ implement the :class:`determined.keras.callbacks.Callback` interface (an extensi
:class:`~determined.keras.TFKerasTrial` by implementing
:meth:`~determined.keras.TFKerasTrial.keras_callbacks`.

.. _keras-profiler:

***********
Profiling
***********
Expand Down
2 changes: 2 additions & 0 deletions docs/model-dev-guide/api-guides/apis-howto/api-pytorch-ug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -515,6 +515,8 @@ you find that the built-in context.DataLoader() does not support your use case.

See the :mod:`determined.pytorch.samplers` for details.

.. _pytorch_profiler:

Profiling
---------

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,8 @@ interleaving micro batches:
loss = self.model_engine.eval_batch()
return {"loss": loss}

.. _deepspeed-profiler:

***********
Profiling
***********
Expand Down
3 changes: 0 additions & 3 deletions docs/model-dev-guide/dtrain/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@ Additional Resources:
- Learn how :ref:`Configuration Templates <config-template>` can help reduce redundancy.
- Discover how Determined aims to support reproducible machine learning experiments in
:ref:`Reproducibility <reproducibility>`.
- In :ref:`Optimizing Training <optimizing-training>`, you'll learn about out-of-the box tools you
can use for instrumenting training.

.. toctree::
:caption: Distributed Training
Expand All @@ -44,4 +42,3 @@ Additional Resources:
Implementing Distributed Training <dtrain-implement>
Configuration Templates <config-templates>
Reproducibility <reproducibility>
Optimizing Training <optimize-training>
Loading
Loading