Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: profiling v2 [MD-27] #9032

Merged
merged 23 commits into from
Mar 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
42da795
Backend support for persisting profiling/system metrics to (#8884)
azhou-determined Mar 5, 2024
b7f6445
feat: implement profiler in Core API [MD-10] [MD-302] (#8964)
azhou-determined Mar 13, 2024
980483c
feat: remove throughput and timing metrics views on web UI (#8992)
azhou-determined Mar 13, 2024
05d4811
chore: aggregate system profiling metrics (#8995)
azhou-determined Mar 14, 2024
781c9f6
feat: migrate existing profiler metrics to generic metrics [MD-300] […
azhou-determined Mar 15, 2024
2925c7a
docs: better docs for profiling in determined [MD-252] (#9011)
azhou-determined Mar 19, 2024
2425d9d
chore: add unit tests for profiler avg method
azhou-determined Mar 19, 2024
2715646
fix tests and optimize query
azhou-determined Mar 19, 2024
161aff9
re-gen proto & bindings
azhou-determined Mar 20, 2024
3d23901
docs: docs PR suggestions
azhou-determined Mar 22, 2024
ada5fe8
various PR comments and updates
azhou-determined Mar 22, 2024
a27e462
release notes
azhou-determined Mar 25, 2024
db8f6d9
docs update
azhou-determined Mar 25, 2024
3988bda
move migrations to top
azhou-determined Mar 25, 2024
7b729b7
chore: remove profiling not enabled check in web UI (#9054)
azhou-determined Mar 26, 2024
55bc58d
update core api docs
azhou-determined Mar 26, 2024
f5276c3
move historical data migration to new optional_migrations and update …
azhou-determined Mar 26, 2024
bbb930b
force re-gen proto image
azhou-determined Mar 26, 2024
0262862
force re-gen proto image
azhou-determined Mar 26, 2024
b41796f
change iops unit
azhou-determined Mar 26, 2024
d35fce6
re-gen proto
azhou-determined Mar 26, 2024
43d080f
fmt
azhou-determined Mar 26, 2024
9806379
docs tweaks
azhou-determined Mar 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/.redirects/redirects.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{
"reference/python-sdk": "python-sdk/python-sdk.html",
"reference/training/experiment-config-reference": "../experiment-config-reference.html",
"model-dev-guide/dtrain/optimize-training": "../profiling.html",
azhou-determined marked this conversation as resolved.
Show resolved Hide resolved
"model-dev-guide/submit-experiment": "create-experiment.html",
"setup-cluster/internet-access": "checklists/adv-requirements.html",
"setup-cluster/deploy-cluster/internet-access": "../checklists/adv-requirements.html",
Expand Down Expand Up @@ -463,4 +464,4 @@
"tutorials/porting-tutorial": "pytorch-mnist-tutorial.html",
"tutorials/quick-start": "quickstart-mdldev.html",
"tutorials/pytorch-mnist-local-qs": "../get-started/webui-qs.html"
}
}
9 changes: 6 additions & 3 deletions docs/get-started/architecture/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -234,9 +234,12 @@ following benefits:
| | without risking future lock-in. See |
| | :ref:`apis-howto-overview`. |
+------------------------------------------------+-----------------------------------------------------------+
| Profiling | Out-of-the-box system metrics (measurements of hardware |
| | usage) and timings (durations of actions taken during |
| | training, such as data loading). |
| Profiling | Optimize your model's performance and diagnose |
| | bottlenecks with comprehensive profiling support across |
| | different layers of your deployment, from out-of-the-box |
| | system metrics tracking and seamless integrations with |
| | native training profilers to Prometheus/Grafana support. |
| | See :ref:`profiling`. |
+------------------------------------------------+-----------------------------------------------------------+
| Visualization | Visualize your model and training procedure by using The |
| | built-in WebUI and by launching managed |
Expand Down
57 changes: 53 additions & 4 deletions docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -508,18 +508,67 @@ In the Determined WebUI, go to the **Cluster** pane.
You should be able to see multiple slots active corresponding to the value you set for
``slots_per_trial`` you set in ``distributed.yaml``, as well as logs appearing from multiple ranks.

.. _core-profiling:

***********
Profiling
***********

Profiling with native profilers such as can be configured as usual. If running on a Determined
cluster, the profiling log output path can be configured for automatic upload to the Determined
Tensorboard UI.
There are two ways to profile the performance of your training job:

#. Core API's built-in system metrics profiler

#. Integration with profilers native to your training framework, such as the TensorFlow and PyTorch
profilers

.. _core-profiler:

Core API Profiler
=================

The Core API includes a profiling feature that monitors and records system metrics during the
training run. These metrics are recorded at specified intervals and sent to the master, allowing you
to view them in the "Profiling" tab of your experiment in the WebUI.

Use :class:`~determined.core.ProfilerContext` to interact with the Core API profiler. It can be
toggled on or off by calling :meth:`~determined.core.ProfilerContext.on` and
:meth:`~determined.core.ProfilerContext.off`. :meth:`~determined.core.ProfilerContext.on` accepts
optional parameters that configure the rate (in seconds) at which system metrics are sampled
(``sampling_interval``) and the number of samples to average before reporting
(``samples_per_report``). By default, the profiler samples every 1 second and reports the aggregate
of every 10 samples.

The following code snippet demonstrates how to enable profiling for only a portion of your training
code, but the profiler can be turned on and off at any point within the ``core.Context``.

.. code:: python

import determined as det


with det.core.init() as core_context:
...
for batch_idx in range(1, 10):
# In this example we just want to profile the first 5 batches.
if batch_idx == 1:
core_context.profiler.on()
if batch_idx == 5:
core_context.profiler.off()
train_batch(...)

.. _core-native-profilers:

Native Profilers
================

Profiling with native profilers such as PyTorch profiler and TensorFlow profiler can be configured
as usual. If running on a Determined cluster, the profiling log output path can be configured for
automatic upload to the Determined TensorBoard UI.

The following snippet initializes the PyTorch Profiler. It will profile GPU and CPU activities,
skipping batch 1, warming up on batch 2, profiling batches 3 and 4, then repeating the cycle. Result
files will be uploaded to the experiment's TensorBoard path and can be viewed under the "PyTorch
Profiler" tab in the Determined Tensorboard UI.
Profiler" tab in the Determined TensorBoard UI.

See `PyTorch Profiler <https://github.com/pytorch/kineto/tree/main/tb_plugin>`_ documentation for
details.
Expand Down
2 changes: 2 additions & 0 deletions docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,8 @@ implement the :class:`determined.keras.callbacks.Callback` interface (an extensi
:class:`~determined.keras.TFKerasTrial` by implementing
:meth:`~determined.keras.TFKerasTrial.keras_callbacks`.

.. _keras-profiler:

***********
Profiling
***********
Expand Down
2 changes: 2 additions & 0 deletions docs/model-dev-guide/api-guides/apis-howto/api-pytorch-ug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -515,6 +515,8 @@ you find that the built-in context.DataLoader() does not support your use case.

See the :mod:`determined.pytorch.samplers` for details.

.. _pytorch_profiler:

Profiling
---------

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,8 @@ interleaving micro batches:
loss = self.model_engine.eval_batch()
return {"loss": loss}

.. _deepspeed-profiler:

***********
Profiling
***********
Expand Down
3 changes: 0 additions & 3 deletions docs/model-dev-guide/dtrain/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@ Additional Resources:
- Learn how :ref:`Configuration Templates <config-template>` can help reduce redundancy.
- Discover how Determined aims to support reproducible machine learning experiments in
:ref:`Reproducibility <reproducibility>`.
- In :ref:`Optimizing Training <optimizing-training>`, you'll learn about out-of-the box tools you
can use for instrumenting training.

.. toctree::
:caption: Distributed Training
Expand All @@ -44,4 +42,3 @@ Additional Resources:
Implementing Distributed Training <dtrain-implement>
Configuration Templates <config-templates>
Reproducibility <reproducibility>
Optimizing Training <optimize-training>
86 changes: 0 additions & 86 deletions docs/model-dev-guide/dtrain/optimize-training.rst

This file was deleted.

107 changes: 107 additions & 0 deletions docs/model-dev-guide/profiling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
.. _profiling:

###########
Profiling
###########

Optimizing a model's performance is often a trade-off between accuracy, time, and resource
requirements. Training deep learning models is a time and resource intensive process, where each
iteration can take several hours and accumulate heavy hardware costs. Though sometimes this cost is
inherent to the task, unnecessary resource consumption can be caused by suboptimal code or bugs.
Thus, achieving optimal model performance requires an understanding of how your model interacts with
the system's computational resources.

Profiling collects metrics on how computational resources like CPU, GPU, and memory are being
utilized during a training job. It can reveal patterns in resource utilization that indicate
performance bottlenecks and pinpoint areas of the code or pipeline that are causing slowdowns or
inefficiencies.

A training job can be profiled at many different layers, from generic system-level metrics to
individual model operators and GPU kernels. Determined provides a few options for profiling, each
targeting a different layer in a training job at various levels of detail:

- :ref:`Determined system metrics profiler <how-to-profiling-det-profiler>` collects general
system-level metrics and provides an overview of hardware usage during an experiment.
- :ref:`Native profiler integration <how-to-profiling-native-profilers>` enables model profiling in
training APIs that provides fine-grained metrics specific to your model.
- :ref:`Prometheus/Grafana integration <how-to-profiling-prom-grafana>` can be set up to track
detailed hardware metrics and monitor overall cluster health.

.. _how-to-profiling:

.. _how-to-profiling-det-profiler:

*********************
Determined Profiler
wes-turner marked this conversation as resolved.
Show resolved Hide resolved
*********************

Determined comes with a built-in profiler that provides out-of-the-box tracking for system-level
metrics. System metrics are statistics around hardware usage, such as GPU utilization, disk usage,
and network throughput.

These metrics provide a general overview of resource usage during a training run, and can be useful
for quickly identifying ineffective usages of computational resources. When the system metrics
reported for an experiment do not match hardware expectations, that is a sign that the software may
be able to be optimized to make better use of the hardware resources.

The Determined profiler collects a set of system metrics throughout an experiment which can be
visualized in the WebUI under the experiment's "Profiler" tab. It is supported for all training
APIs, but is not enabled by default.

Visit :ref:`core-profiler` to find out how to enable and configure the Determined profiler for your
experiment.

The following system metrics are tracked:

- *GPU utilization (percent)*: utilization of a GPU device
- *GPU free memory (bytes)*: amount of free memory available on a GPU device
- *Network throughput - sent (bytes/s)*: bytes sent system-wide
- *Network throughput (received)*: bytes received system-wide
- *Disk IOPS (operations/s)*: number of read + writes system-wide
- *Disk throughput - reads (bytes/s)*: bytes read system-wide
- *Disk throughput - writes (bytes/s)*: bytes written system-wide
- *Host available memory (bytes)*: amount of memory available (not including swap) system-wide
- *CPU utilization (percent)*: utilization of CPU cores, averaged across all cores in the system

For distributed training, these metrics are collected for every agent. The data is broken down by
agent, and GPU metrics can be further broken down by GPU.

.. note::

System Metrics record agent-level metrics, so when there are multiple experiments on the same
agent, it is difficult to analyze.

.. _how-to-profiling-native-profilers:

***************************
Native Training Profilers
azhou-determined marked this conversation as resolved.
Show resolved Hide resolved
***************************

Sometimes system-level profiling doesn't capture enough data to help debug bottlenecks in model
training code. Identifying inefficiencies in individual training operations or steps requires a more
fine-grained context than generic system metrics can provide. For this level of profiling,
Determined supports integration with training profilers that are native to their frameworks:

- PyTorch Profiler (:ref:`PyTorch API <pytorch_profiler>`)
- DeepSpeed Profiler (:ref:`DeepSpeed API <deepspeed-profiler>`)
- TensorFlow Keras Profiler (:ref:`Keras API <keras-profiler>`)

Please see your framework's profiler documentation and the Determined Training API guide for usage
details.

.. _how-to-profiling-prom-grafana:

************************************
Prometheus and Grafana Integration
************************************

For a more resource-centric view of Determined jobs, Determined provides a Prometheus endpoint along
with a pre-configured Grafana dashboard. These can be set up to track detailed hardware usage
metrics for a Determined cluster, and can be helpful for alerting and monitoring cluster health.

The Prometheus endpoint aggregates system metrics and associates them with Determined concepts such
as experiments, tags, and resource pools, which can be viewed in Grafana. Determined provides a
Grafana dashboard that shows real-time resource metrics across an entire cluster as well as
experiments, containers, and resource pools.

Visit :ref:`configure-prometheus-grafana` to find out how to enable this functionality.
33 changes: 5 additions & 28 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1483,37 +1483,14 @@ explicitly specified, the master will automatically generate an experiment seed.
Profiling
***********

The ``profiling`` section specifies configuration options related to profiling experiments. See
:ref:`how-to-profiling` for a more detailed walkthrough.

``profiling``
=============

Optional. Profiling is supported for all frameworks, though timings are only collected for
``PyTorchTrial``. Profiles are collected for a maximum of 5 minutes, regardless of the settings
below.
The ``profiling`` section specifies configuration options for the Determined system metrics
profiler. See :ref:`how-to-profiling` for a more detailed walkthrough.

``enabled``
-----------

Optional. Defines whether profiles should be collected or not. Defaults to false.

``begin_on_batch``
------------------

Optional. Specifies the batch on which profiling should begin.

``end_after_batch``
-------------------

Optional. Specifies the batch after which profiling should end.

``sync_timings``
----------------
===========

Optional. Specifies whether Determined should wait for all GPU kernel streams before considering a
timing as ended. Defaults to 'true'. Applies only for frameworks that collect timing metrics
(currently just PyTorch).
Optional. Enables system metrics profiling on the experiment, which can be viewed in the Web UI.
Defaults to false.

.. _experiment-configuration_training_units:

Expand Down
8 changes: 8 additions & 0 deletions docs/reference/training/api-core-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,14 @@
:members:
:member-order: bysource

*************************************
``determined.core.ProfilerContext``
*************************************

.. autoclass:: determined.core.ProfilerContext
:members:
:member-order: bysource

***************************************
``determined.core.SearcherOperation``
***************************************
Expand Down
Loading
Loading