determined-ai · azhou-determined · Mar 26, 2024 · Mar 5, 2024 · Mar 13, 2024 · Mar 13, 2024
diff --git a/docs/.redirects/redirects.json b/docs/.redirects/redirects.json
@@ -1,6 +1,7 @@
 {
     "reference/python-sdk": "python-sdk/python-sdk.html",
     "reference/training/experiment-config-reference": "../experiment-config-reference.html",
+    "model-dev-guide/dtrain/optimize-training": "../profiling.html",
     "model-dev-guide/submit-experiment": "create-experiment.html",
     "setup-cluster/internet-access": "checklists/adv-requirements.html",
     "setup-cluster/deploy-cluster/internet-access": "../checklists/adv-requirements.html",
@@ -463,4 +464,4 @@
     "tutorials/porting-tutorial": "pytorch-mnist-tutorial.html",
     "tutorials/quick-start": "quickstart-mdldev.html",
     "tutorials/pytorch-mnist-local-qs": "../get-started/webui-qs.html"
-}
+}
diff --git a/docs/get-started/architecture/introduction.rst b/docs/get-started/architecture/introduction.rst
@@ -234,9 +234,12 @@ following benefits:
 |                                                | without risking future lock-in. See                       |
 |                                                | :ref:`apis-howto-overview`.                               |
 +------------------------------------------------+-----------------------------------------------------------+
-| Profiling                                      | Out-of-the-box system metrics (measurements of hardware   |
-|                                                | usage) and timings (durations of actions taken during     |
-|                                                | training, such as data loading).                          |
+| Profiling                                      | Optimize your model's performance and diagnose            |
+|                                                | bottlenecks with comprehensive profiling support across   |
+|                                                | different layers of your deployment, from out-of-the-box  |
+|                                                | system metrics tracking and seamless integrations with    |
+|                                                | native training profilers to Prometheus/Grafana support.  |
+|                                                | See :ref:`profiling`.                                     |
 +------------------------------------------------+-----------------------------------------------------------+
 | Visualization                                  | Visualize your model and training procedure by using The  |
 |                                                | built-in WebUI and by launching managed                   |

diff --git a/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst b/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst
@@ -508,18 +508,67 @@ In the Determined WebUI, go to the **Cluster** pane.
 You should be able to see multiple slots active corresponding to the value you set for
 ``slots_per_trial`` you set in ``distributed.yaml``, as well as logs appearing from multiple ranks.
 
+.. _core-profiling:
+
 ***********
  Profiling
 ***********
 
-Profiling with native profilers such as can be configured as usual. If running on a Determined
-cluster, the profiling log output path can be configured for automatic upload to the Determined
-Tensorboard UI.
+There are two ways to profile the performance of your training job:
+
+#. Core API's built-in system metrics profiler
+
+#. Integration with profilers native to your training framework, such as the TensorFlow and PyTorch
+profilers
+
+.. _core-profiler:
+
+Core API Profiler
+=================
+
+The Core API includes a profiling feature that monitors and records system metrics during the
+training run. These metrics are recorded at specified intervals and sent to the master, allowing you
+to view them in the "Profiling" tab of your experiment in the WebUI.
+
+Use :class:`~determined.core.ProfilerContext` to interact with the Core API profiler. It can be
+toggled on or off by calling :meth:`~determined.core.ProfilerContext.on` and
+:meth:`~determined.core.ProfilerContext.off`. :meth:`~determined.core.ProfilerContext.on` accepts
+optional parameters that configure the rate (in seconds) at which system metrics are sampled
+(``sampling_interval``) and the number of samples to average before reporting
+(``samples_per_report``). By default, the profiler samples every 1 second and reports the aggregate
+of every 10 samples.
+
+The following code snippet demonstrates how to enable profiling for only a portion of your training
+code, but the profiler can be turned on and off at any point within the ``core.Context``.
+
+.. code:: python
+
+   import determined as det
+
+
+   with det.core.init() as core_context:
+       ...
+       for batch_idx in range(1, 10):
+           # In this example we just want to profile the first 5 batches.
+           if batch_idx == 1:
+               core_context.profiler.on()
+           if batch_idx == 5:
+               core_context.profiler.off()
+           train_batch(...)
+
+.. _core-native-profilers:
+
+Native Profilers
+================
+
+Profiling with native profilers such as PyTorch profiler and TensorFlow profiler can be configured
+as usual. If running on a Determined cluster, the profiling log output path can be configured for
+automatic upload to the Determined TensorBoard UI.
 
 The following snippet initializes the PyTorch Profiler. It will profile GPU and CPU activities,
 skipping batch 1, warming up on batch 2, profiling batches 3 and 4, then repeating the cycle. Result
 files will be uploaded to the experiment's TensorBoard path and can be viewed under the "PyTorch
-Profiler" tab in the Determined Tensorboard UI.
+Profiler" tab in the Determined TensorBoard UI.
 
 See `PyTorch Profiler <https://github.com/pytorch/kineto/tree/main/tb_plugin>`_ documentation for
 details.

diff --git a/docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst b/docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst
@@ -95,6 +95,8 @@ implement the :class:`determined.keras.callbacks.Callback` interface (an extensi
 :class:`~determined.keras.TFKerasTrial` by implementing
 :meth:`~determined.keras.TFKerasTrial.keras_callbacks`.
 
+.. _keras-profiler:
+
 ***********
  Profiling
 ***********

diff --git a/docs/model-dev-guide/api-guides/apis-howto/api-pytorch-ug.rst b/docs/model-dev-guide/api-guides/apis-howto/api-pytorch-ug.rst
@@ -515,6 +515,8 @@ you find that the built-in context.DataLoader() does not support your use case.
 
 See the :mod:`determined.pytorch.samplers` for details.
 
+.. _pytorch_profiler:
+
 Profiling
 ---------
 

diff --git a/docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst b/docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst
@@ -303,6 +303,8 @@ interleaving micro batches:
        loss = self.model_engine.eval_batch()
        return {"loss": loss}
 
+.. _deepspeed-profiler:
+
 ***********
  Profiling
 ***********

diff --git a/docs/model-dev-guide/dtrain/_index.rst b/docs/model-dev-guide/dtrain/_index.rst
@@ -33,8 +33,6 @@ Additional Resources:
 -  Learn how :ref:`Configuration Templates <config-template>` can help reduce redundancy.
 -  Discover how Determined aims to support reproducible machine learning experiments in
    :ref:`Reproducibility <reproducibility>`.
--  In :ref:`Optimizing Training <optimizing-training>`, you'll learn about out-of-the box tools you
-   can use for instrumenting training.
 
 .. toctree::
    :caption: Distributed Training
@@ -44,4 +42,3 @@ Additional Resources:
    Implementing Distributed Training <dtrain-implement>
    Configuration Templates <config-templates>
    Reproducibility <reproducibility>
-   Optimizing Training <optimize-training>
diff --git a/docs/model-dev-guide/dtrain/optimize-training.rst b/docs/model-dev-guide/dtrain/optimize-training.rst
diff --git a/docs/model-dev-guide/profiling.rst b/docs/model-dev-guide/profiling.rst
@@ -0,0 +1,107 @@
+.. _profiling:
+
+###########
+ Profiling
+###########
+
+Optimizing a model's performance is often a trade-off between accuracy, time, and resource
+requirements. Training deep learning models is a time and resource intensive process, where each
+iteration can take several hours and accumulate heavy hardware costs. Though sometimes this cost is
+inherent to the task, unnecessary resource consumption can be caused by suboptimal code or bugs.
+Thus, achieving optimal model performance requires an understanding of how your model interacts with
+the system's computational resources.
+
+Profiling collects metrics on how computational resources like CPU, GPU, and memory are being
+utilized during a training job. It can reveal patterns in resource utilization that indicate
+performance bottlenecks and pinpoint areas of the code or pipeline that are causing slowdowns or
+inefficiencies.
+
+A training job can be profiled at many different layers, from generic system-level metrics to
+individual model operators and GPU kernels. Determined provides a few options for profiling, each
+targeting a different layer in a training job at various levels of detail:
+
+-  :ref:`Determined system metrics profiler <how-to-profiling-det-profiler>` collects general
+   system-level metrics and provides an overview of hardware usage during an experiment.
+-  :ref:`Native profiler integration <how-to-profiling-native-profilers>` enables model profiling in
+   training APIs that provides fine-grained metrics specific to your model.
+-  :ref:`Prometheus/Grafana integration <how-to-profiling-prom-grafana>` can be set up to track
+   detailed hardware metrics and monitor overall cluster health.
+
+.. _how-to-profiling:
+
+.. _how-to-profiling-det-profiler:
+
+*********************
+ Determined Profiler
+*********************
+
+Determined comes with a built-in profiler that provides out-of-the-box tracking for system-level
+metrics. System metrics are statistics around hardware usage, such as GPU utilization, disk usage,
+and network throughput.
+
+These metrics provide a general overview of resource usage during a training run, and can be useful
+for quickly identifying ineffective usages of computational resources. When the system metrics
+reported for an experiment do not match hardware expectations, that is a sign that the software may
+be able to be optimized to make better use of the hardware resources.
+
+The Determined profiler collects a set of system metrics throughout an experiment which can be
+visualized in the WebUI under the experiment's "Profiler" tab. It is supported for all training
+APIs, but is not enabled by default.
+
+Visit :ref:`core-profiler` to find out how to enable and configure the Determined profiler for your
+experiment.
+
+The following system metrics are tracked:
+
+-  *GPU utilization (percent)*: utilization of a GPU device
+-  *GPU free memory (bytes)*: amount of free memory available on a GPU device
+-  *Network throughput - sent (bytes/s)*: bytes sent system-wide
+-  *Network throughput (received)*: bytes received system-wide
+-  *Disk IOPS (operations/s)*: number of read + writes system-wide
+-  *Disk throughput - reads (bytes/s)*: bytes read system-wide
+-  *Disk throughput - writes (bytes/s)*: bytes written system-wide
+-  *Host available memory (bytes)*: amount of memory available (not including swap) system-wide
+-  *CPU utilization (percent)*: utilization of CPU cores, averaged across all cores in the system
+
+For distributed training, these metrics are collected for every agent. The data is broken down by
+agent, and GPU metrics can be further broken down by GPU.
+
+.. note::
+
+   System Metrics record agent-level metrics, so when there are multiple experiments on the same
+   agent, it is difficult to analyze.
+
+.. _how-to-profiling-native-profilers:
+
+***************************
+ Native Training Profilers
+***************************
+
+Sometimes system-level profiling doesn't capture enough data to help debug bottlenecks in model
+training code. Identifying inefficiencies in individual training operations or steps requires a more
+fine-grained context than generic system metrics can provide. For this level of profiling,
+Determined supports integration with training profilers that are native to their frameworks:
+
+-  PyTorch Profiler (:ref:`PyTorch API <pytorch_profiler>`)
+-  DeepSpeed Profiler (:ref:`DeepSpeed API <deepspeed-profiler>`)
+-  TensorFlow Keras Profiler (:ref:`Keras API <keras-profiler>`)
+
+Please see your framework's profiler documentation and the Determined Training API guide for usage
+details.
+
+.. _how-to-profiling-prom-grafana:
+
+************************************
+ Prometheus and Grafana Integration
+************************************
+
+For a more resource-centric view of Determined jobs, Determined provides a Prometheus endpoint along
+with a pre-configured Grafana dashboard. These can be set up to track detailed hardware usage
+metrics for a Determined cluster, and can be helpful for alerting and monitoring cluster health.
+
+The Prometheus endpoint aggregates system metrics and associates them with Determined concepts such
+as experiments, tags, and resource pools, which can be viewed in Grafana. Determined provides a
+Grafana dashboard that shows real-time resource metrics across an entire cluster as well as
+experiments, containers, and resource pools.
+
+Visit :ref:`configure-prometheus-grafana` to find out how to enable this functionality.
diff --git a/docs/reference/experiment-config-reference.rst b/docs/reference/experiment-config-reference.rst
@@ -1483,37 +1483,14 @@ explicitly specified, the master will automatically generate an experiment seed.
  Profiling
 ***********
 
-The ``profiling`` section specifies configuration options related to profiling experiments. See
-:ref:`how-to-profiling` for a more detailed walkthrough.
-
-``profiling``
-=============
-
-Optional. Profiling is supported for all frameworks, though timings are only collected for
-``PyTorchTrial``. Profiles are collected for a maximum of 5 minutes, regardless of the settings
-below.
+The ``profiling`` section specifies configuration options for the Determined system metrics
+profiler. See :ref:`how-to-profiling` for a more detailed walkthrough.
 
 ``enabled``
------------
-
-Optional. Defines whether profiles should be collected or not. Defaults to false.
-
-``begin_on_batch``
-------------------
-
-Optional. Specifies the batch on which profiling should begin.
-
-``end_after_batch``
--------------------
-
-Optional. Specifies the batch after which profiling should end.
-
-``sync_timings``
-----------------
+===========
 
-Optional. Specifies whether Determined should wait for all GPU kernel streams before considering a
-timing as ended. Defaults to 'true'. Applies only for frameworks that collect timing metrics
-(currently just PyTorch).
+Optional. Enables system metrics profiling on the experiment, which can be viewed in the Web UI.
+Defaults to false.
 
 .. _experiment-configuration_training_units:
 

diff --git a/docs/reference/training/api-core-reference.rst b/docs/reference/training/api-core-reference.rst
@@ -68,6 +68,14 @@
    :members:
    :member-order: bysource
 
+*************************************
+ ``determined.core.ProfilerContext``
+*************************************
+
+.. autoclass:: determined.core.ProfilerContext
+   :members:
+   :member-order: bysource
+
 ***************************************
  ``determined.core.SearcherOperation``
 ***************************************