Skip to content

Commit

Permalink
feat: profiling v2 [MD-27] (#9032)
Browse files Browse the repository at this point in the history
- Persist Determined profiler's system metrics in generic metrics
  - ``metrics`` schema changes
  - shim existing trial profiler APIs to fetch from new schema
- Drop support for timing metrics
- Profiler in Core API
  - Rewrite `ProfilerAgent` into `core.profiler`
  - Use `core.profiler` in Trial APIs
- Better docs for profiling
  • Loading branch information
azhou-determined authored Mar 26, 2024
1 parent 133d127 commit 74fe16b
Show file tree
Hide file tree
Showing 72 changed files with 2,257 additions and 2,230 deletions.
3 changes: 2 additions & 1 deletion docs/.redirects/redirects.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{
"reference/python-sdk": "python-sdk/python-sdk.html",
"reference/training/experiment-config-reference": "../experiment-config-reference.html",
"model-dev-guide/dtrain/optimize-training": "../profiling.html",
"model-dev-guide/submit-experiment": "create-experiment.html",
"setup-cluster/internet-access": "checklists/adv-requirements.html",
"setup-cluster/deploy-cluster/internet-access": "../checklists/adv-requirements.html",
Expand Down Expand Up @@ -463,4 +464,4 @@
"tutorials/porting-tutorial": "pytorch-mnist-tutorial.html",
"tutorials/quick-start": "quickstart-mdldev.html",
"tutorials/pytorch-mnist-local-qs": "../get-started/webui-qs.html"
}
}
9 changes: 6 additions & 3 deletions docs/get-started/architecture/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -234,9 +234,12 @@ following benefits:
| | without risking future lock-in. See |
| | :ref:`apis-howto-overview`. |
+------------------------------------------------+-----------------------------------------------------------+
| Profiling | Out-of-the-box system metrics (measurements of hardware |
| | usage) and timings (durations of actions taken during |
| | training, such as data loading). |
| Profiling | Optimize your model's performance and diagnose |
| | bottlenecks with comprehensive profiling support across |
| | different layers of your deployment, from out-of-the-box |
| | system metrics tracking and seamless integrations with |
| | native training profilers to Prometheus/Grafana support. |
| | See :ref:`profiling`. |
+------------------------------------------------+-----------------------------------------------------------+
| Visualization | Visualize your model and training procedure by using The |
| | built-in WebUI and by launching managed |
Expand Down
57 changes: 53 additions & 4 deletions docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -508,18 +508,67 @@ In the Determined WebUI, go to the **Cluster** pane.
You should be able to see multiple slots active corresponding to the value you set for
``slots_per_trial`` you set in ``distributed.yaml``, as well as logs appearing from multiple ranks.

.. _core-profiling:

***********
Profiling
***********

Profiling with native profilers such as can be configured as usual. If running on a Determined
cluster, the profiling log output path can be configured for automatic upload to the Determined
Tensorboard UI.
There are two ways to profile the performance of your training job:

#. Core API's built-in system metrics profiler

#. Integration with profilers native to your training framework, such as the TensorFlow and PyTorch
profilers

.. _core-profiler:

Core API Profiler
=================

The Core API includes a profiling feature that monitors and records system metrics during the
training run. These metrics are recorded at specified intervals and sent to the master, allowing you
to view them in the "Profiling" tab of your experiment in the WebUI.

Use :class:`~determined.core.ProfilerContext` to interact with the Core API profiler. It can be
toggled on or off by calling :meth:`~determined.core.ProfilerContext.on` and
:meth:`~determined.core.ProfilerContext.off`. :meth:`~determined.core.ProfilerContext.on` accepts
optional parameters that configure the rate (in seconds) at which system metrics are sampled
(``sampling_interval``) and the number of samples to average before reporting
(``samples_per_report``). By default, the profiler samples every 1 second and reports the aggregate
of every 10 samples.

The following code snippet demonstrates how to enable profiling for only a portion of your training
code, but the profiler can be turned on and off at any point within the ``core.Context``.

.. code:: python
import determined as det
with det.core.init() as core_context:
...
for batch_idx in range(1, 10):
# In this example we just want to profile the first 5 batches.
if batch_idx == 1:
core_context.profiler.on()
if batch_idx == 5:
core_context.profiler.off()
train_batch(...)
.. _core-native-profilers:

Native Profilers
================

Profiling with native profilers such as PyTorch profiler and TensorFlow profiler can be configured
as usual. If running on a Determined cluster, the profiling log output path can be configured for
automatic upload to the Determined TensorBoard UI.

The following snippet initializes the PyTorch Profiler. It will profile GPU and CPU activities,
skipping batch 1, warming up on batch 2, profiling batches 3 and 4, then repeating the cycle. Result
files will be uploaded to the experiment's TensorBoard path and can be viewed under the "PyTorch
Profiler" tab in the Determined Tensorboard UI.
Profiler" tab in the Determined TensorBoard UI.

See `PyTorch Profiler <https://github.com/pytorch/kineto/tree/main/tb_plugin>`_ documentation for
details.
Expand Down
2 changes: 2 additions & 0 deletions docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,8 @@ implement the :class:`determined.keras.callbacks.Callback` interface (an extensi
:class:`~determined.keras.TFKerasTrial` by implementing
:meth:`~determined.keras.TFKerasTrial.keras_callbacks`.

.. _keras-profiler:

***********
Profiling
***********
Expand Down
2 changes: 2 additions & 0 deletions docs/model-dev-guide/api-guides/apis-howto/api-pytorch-ug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -515,6 +515,8 @@ you find that the built-in context.DataLoader() does not support your use case.
See the :mod:`determined.pytorch.samplers` for details.

.. _pytorch_profiler:

Profiling
---------

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,8 @@ interleaving micro batches:
loss = self.model_engine.eval_batch()
return {"loss": loss}
.. _deepspeed-profiler:

***********
Profiling
***********
Expand Down
3 changes: 0 additions & 3 deletions docs/model-dev-guide/dtrain/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@ Additional Resources:
- Learn how :ref:`Configuration Templates <config-template>` can help reduce redundancy.
- Discover how Determined aims to support reproducible machine learning experiments in
:ref:`Reproducibility <reproducibility>`.
- In :ref:`Optimizing Training <optimizing-training>`, you'll learn about out-of-the box tools you
can use for instrumenting training.

.. toctree::
:caption: Distributed Training
Expand All @@ -44,4 +42,3 @@ Additional Resources:
Implementing Distributed Training <dtrain-implement>
Configuration Templates <config-templates>
Reproducibility <reproducibility>
Optimizing Training <optimize-training>
86 changes: 0 additions & 86 deletions docs/model-dev-guide/dtrain/optimize-training.rst

This file was deleted.

107 changes: 107 additions & 0 deletions docs/model-dev-guide/profiling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
.. _profiling:

###########
Profiling
###########

Optimizing a model's performance is often a trade-off between accuracy, time, and resource
requirements. Training deep learning models is a time and resource intensive process, where each
iteration can take several hours and accumulate heavy hardware costs. Though sometimes this cost is
inherent to the task, unnecessary resource consumption can be caused by suboptimal code or bugs.
Thus, achieving optimal model performance requires an understanding of how your model interacts with
the system's computational resources.

Profiling collects metrics on how computational resources like CPU, GPU, and memory are being
utilized during a training job. It can reveal patterns in resource utilization that indicate
performance bottlenecks and pinpoint areas of the code or pipeline that are causing slowdowns or
inefficiencies.

A training job can be profiled at many different layers, from generic system-level metrics to
individual model operators and GPU kernels. Determined provides a few options for profiling, each
targeting a different layer in a training job at various levels of detail:

- :ref:`Determined system metrics profiler <how-to-profiling-det-profiler>` collects general
system-level metrics and provides an overview of hardware usage during an experiment.
- :ref:`Native profiler integration <how-to-profiling-native-profilers>` enables model profiling in
training APIs that provides fine-grained metrics specific to your model.
- :ref:`Prometheus/Grafana integration <how-to-profiling-prom-grafana>` can be set up to track
detailed hardware metrics and monitor overall cluster health.

.. _how-to-profiling:

.. _how-to-profiling-det-profiler:

*********************
Determined Profiler
*********************

Determined comes with a built-in profiler that provides out-of-the-box tracking for system-level
metrics. System metrics are statistics around hardware usage, such as GPU utilization, disk usage,
and network throughput.

These metrics provide a general overview of resource usage during a training run, and can be useful
for quickly identifying ineffective usages of computational resources. When the system metrics
reported for an experiment do not match hardware expectations, that is a sign that the software may
be able to be optimized to make better use of the hardware resources.

The Determined profiler collects a set of system metrics throughout an experiment which can be
visualized in the WebUI under the experiment's "Profiler" tab. It is supported for all training
APIs, but is not enabled by default.

Visit :ref:`core-profiler` to find out how to enable and configure the Determined profiler for your
experiment.

The following system metrics are tracked:

- *GPU utilization (percent)*: utilization of a GPU device
- *GPU free memory (bytes)*: amount of free memory available on a GPU device
- *Network throughput - sent (bytes/s)*: bytes sent system-wide
- *Network throughput (received)*: bytes received system-wide
- *Disk IOPS (operations/s)*: number of read + writes system-wide
- *Disk throughput - reads (bytes/s)*: bytes read system-wide
- *Disk throughput - writes (bytes/s)*: bytes written system-wide
- *Host available memory (bytes)*: amount of memory available (not including swap) system-wide
- *CPU utilization (percent)*: utilization of CPU cores, averaged across all cores in the system

For distributed training, these metrics are collected for every agent. The data is broken down by
agent, and GPU metrics can be further broken down by GPU.

.. note::

System Metrics record agent-level metrics, so when there are multiple experiments on the same
agent, it is difficult to analyze.

.. _how-to-profiling-native-profilers:

***************************
Native Training Profilers
***************************

Sometimes system-level profiling doesn't capture enough data to help debug bottlenecks in model
training code. Identifying inefficiencies in individual training operations or steps requires a more
fine-grained context than generic system metrics can provide. For this level of profiling,
Determined supports integration with training profilers that are native to their frameworks:

- PyTorch Profiler (:ref:`PyTorch API <pytorch_profiler>`)
- DeepSpeed Profiler (:ref:`DeepSpeed API <deepspeed-profiler>`)
- TensorFlow Keras Profiler (:ref:`Keras API <keras-profiler>`)

Please see your framework's profiler documentation and the Determined Training API guide for usage
details.

.. _how-to-profiling-prom-grafana:

************************************
Prometheus and Grafana Integration
************************************

For a more resource-centric view of Determined jobs, Determined provides a Prometheus endpoint along
with a pre-configured Grafana dashboard. These can be set up to track detailed hardware usage
metrics for a Determined cluster, and can be helpful for alerting and monitoring cluster health.

The Prometheus endpoint aggregates system metrics and associates them with Determined concepts such
as experiments, tags, and resource pools, which can be viewed in Grafana. Determined provides a
Grafana dashboard that shows real-time resource metrics across an entire cluster as well as
experiments, containers, and resource pools.

Visit :ref:`configure-prometheus-grafana` to find out how to enable this functionality.
33 changes: 5 additions & 28 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1483,37 +1483,14 @@ explicitly specified, the master will automatically generate an experiment seed.
Profiling
***********

The ``profiling`` section specifies configuration options related to profiling experiments. See
:ref:`how-to-profiling` for a more detailed walkthrough.

``profiling``
=============

Optional. Profiling is supported for all frameworks, though timings are only collected for
``PyTorchTrial``. Profiles are collected for a maximum of 5 minutes, regardless of the settings
below.
The ``profiling`` section specifies configuration options for the Determined system metrics
profiler. See :ref:`how-to-profiling` for a more detailed walkthrough.

``enabled``
-----------

Optional. Defines whether profiles should be collected or not. Defaults to false.

``begin_on_batch``
------------------

Optional. Specifies the batch on which profiling should begin.

``end_after_batch``
-------------------

Optional. Specifies the batch after which profiling should end.

``sync_timings``
----------------
===========

Optional. Specifies whether Determined should wait for all GPU kernel streams before considering a
timing as ended. Defaults to 'true'. Applies only for frameworks that collect timing metrics
(currently just PyTorch).
Optional. Enables system metrics profiling on the experiment, which can be viewed in the Web UI.
Defaults to false.

.. _experiment-configuration_training_units:

Expand Down
8 changes: 8 additions & 0 deletions docs/reference/training/api-core-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,14 @@
:members:
:member-order: bysource

*************************************
``determined.core.ProfilerContext``
*************************************

.. autoclass:: determined.core.ProfilerContext
:members:
:member-order: bysource

***************************************
``determined.core.SearcherOperation``
***************************************
Expand Down
Loading

0 comments on commit 74fe16b

Please sign in to comment.