determined-ai · azhou-determined · Mar 19, 2024 · Mar 14, 2024 · Mar 15, 2024 · Mar 15, 2024
diff --git a/docs/get-started/architecture/introduction.rst b/docs/get-started/architecture/introduction.rst
@@ -183,65 +183,55 @@ practitioners might find inconvenient to implement. The Determined cohesive, end
 platform provides best-in-class functionality for deep learning model training, including the
 following benefits:
 
-+------------------------------------------------+-----------------------------------------------------------+
-| Implementation                                 | Benefit                                                   |
-+================================================+===========================================================+
-| Automated model tuning                         | Optimize models by searching through conventional         |
-|                                                | hyperparameters or macro- architectures, using a variety  |
-|                                                | of search algorithms. Hyperparameter searches are         |
-|                                                | automatically parallelized across the accelerators in the |
-|                                                | cluster. See :ref:`hyperparameter-tuning`.                |
-+------------------------------------------------+-----------------------------------------------------------+
-| Cluster-backed notebooks, commands, and shells | Leverage your shared cluster computing devices in a more  |
-|                                                | versatile environment. See :ref:`notebooks` and           |
-|                                                | :ref:`commands-and-shells`.                               |
-+------------------------------------------------+-----------------------------------------------------------+
-| Cluster management                             | Automatically manage ML accelerators, such as GPUs,       |
-|                                                | on-premise or in cloud VMs using your own environment,    |
-|                                                | automatically scaling for your on-demand workloads.       |
-|                                                | Determined runs in either AWS or GCP, so you can switch   |
-|                                                | easily according to your requirements. See :ref:`Resource |
-|                                                | Pools <resource-pools>`, :ref:`Scheduling <scheduling>`,  |
-|                                                | and :ref:`Elastic Infrastructure                          |
-|                                                | <elastic-infrastructure>`.                                |
-+------------------------------------------------+-----------------------------------------------------------+
-| Containerization                               | Develop and train models in customizable containers that  |
-|                                                | enable simple, consistent dependency management           |
-|                                                | throughout the model development lifecycle. See           |
-|                                                | :ref:`custom-env`.                                        |
-+------------------------------------------------+-----------------------------------------------------------+
-| Distributed training                           | Easily distribute a single training job across multiple   |
-|                                                | accelerators to speed up model training and reduce model  |
-|                                                | development iteration time. Determined uses synchronous,  |
-|                                                | data-parallel distributed training, with key performance  |
-|                                                | optimizations over other available options. See           |
-|                                                | :ref:`multi-gpu-training-concept`.                        |
-+------------------------------------------------+-----------------------------------------------------------+
-| Experiment collaboration                       | Automatically track your experiment configuration and     |
-|                                                | environment to facilitate reproducibility and             |
-|                                                | collaboration among teams. See :ref:`experiments`.        |
-+------------------------------------------------+-----------------------------------------------------------+
-| Fault tolerance                                | Models are checkpointed throughout the training process   |
-|                                                | and can be restarted from the latest checkpoint,          |
-|                                                | automatically. This enables training jobs to              |
-|                                                | automatically tolerate transient hardware or system       |
-|                                                | issues in the cluster.                                    |
-+------------------------------------------------+-----------------------------------------------------------+
-| Framework support                              | Broad framework support leverages these capabilities      |
-|                                                | using any of the leading machine learning frameworks      |
-|                                                | without needing to manage a different cluster for each.   |
-|                                                | Different frameworks for different models can be used     |
-|                                                | without risking future lock-in. See                       |
-|                                                | :ref:`apis-howto-overview`.                               |
-+------------------------------------------------+-----------------------------------------------------------+
-| Profiling                                      | Out-of-the-box system metrics (measurements of hardware   |
-|                                                | usage) and timings (durations of actions taken during     |
-|                                                | training, such as data loading).                          |
-+------------------------------------------------+-----------------------------------------------------------+
-| Visualization                                  | Visualize your model and training procedure by using The  |
-|                                                | built-in WebUI and by launching managed                   |
-|                                                | :ref:`tensorboards` instances.                            |
-+------------------------------------------------+-----------------------------------------------------------+
++------------------------------------------------+------------------------------------------------------------------------------+
+| Implementation                                 | Benefit                                                                      |
++================================================+==============================================================================+
+| Automated model tuning                         | Optimize models by searching through conventional hyperparameters or macro-  |
+|                                                | architectures, using a variety of search algorithms. Hyperparameter searches |
+|                                                | are automatically parallelized across the accelerators in the cluster. See   |
+|                                                | :ref:`hyperparameter-tuning`.                                                |
++------------------------------------------------+------------------------------------------------------------------------------+
+| Cluster-backed notebooks, commands, and shells | Leverage your shared cluster computing devices in a more versatile           |
+|                                                | environment. See :ref:`notebooks` and :ref:`commands-and-shells`.            |
++------------------------------------------------+------------------------------------------------------------------------------+
+| Cluster management                             | Automatically manage ML accelerators, such as GPUs, on-premise or in cloud   |
+|                                                | VMs using your own environment, automatically scaling for your on-demand     |
+|                                                | workloads. Determined runs in either AWS or GCP, so you can switch easily    |
+|                                                | according to your requirements. See :ref:`Resource Pools <resource-pools>`,  |
+|                                                | :ref:`Scheduling <scheduling>`, and :ref:`Elastic Infrastructure             |
+|                                                | <elastic-infrastructure>`.                                                   |
++------------------------------------------------+------------------------------------------------------------------------------+
+| Containerization                               | Develop and train models in customizable containers that enable simple,      |
+|                                                | consistent dependency management throughout the model development lifecycle. |
+|                                                | See :ref:`custom-env`.                                                       |
++------------------------------------------------+------------------------------------------------------------------------------+
+| Distributed training                           | Easily distribute a single training job across multiple accelerators to      |
+|                                                | speed up model training and reduce model development iteration time.         |
+|                                                | Determined uses synchronous, data-parallel distributed training, with key    |
+|                                                | performance optimizations over other available options. See                  |
+|                                                | :ref:`multi-gpu-training-concept`.                                           |
++------------------------------------------------+------------------------------------------------------------------------------+
+| Experiment collaboration                       | Automatically track your experiment configuration and environment to         |
+|                                                | facilitate reproducibility and collaboration among teams. See                |
+|                                                | :ref:`experiments`.                                                          |
++------------------------------------------------+------------------------------------------------------------------------------+
+| Fault tolerance                                | Models are checkpointed throughout the training process and can be restarted |
+|                                                | from the latest checkpoint, automatically. This enables training jobs to     |
+|                                                | automatically tolerate transient hardware or system issues in the cluster.   |
++------------------------------------------------+------------------------------------------------------------------------------+
+| Framework support                              | Broad framework support leverages these capabilities using any of the        |
+|                                                | leading machine learning frameworks without needing to manage a different    |
+|                                                | cluster for each. Different frameworks for different models can be used      |
+|                                                | without risking future lock-in. See :ref:`apis-howto-overview`.              |
++------------------------------------------------+------------------------------------------------------------------------------+
+| Profiling                                      | Optimize your model's performance and diagnose bottlenecks with              |
+|                                                | comprehensive profiling support across different layers deployment, from     |
+|                                                | out-of-the-box system metrics tracking to seamless integrations with native  |
+|                                                | training profilers.                                                          |
++------------------------------------------------+------------------------------------------------------------------------------+
+| Visualization                                  | Visualize your model and training procedure by using The built-in WebUI and  |
+|                                                | by launching managed :ref:`tensorboards` instances.                          |
++------------------------------------------------+------------------------------------------------------------------------------+
 
 **********
  Concepts

diff --git a/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst b/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst
@@ -508,13 +508,57 @@ In the Determined WebUI, go to the **Cluster** pane.
 You should be able to see multiple slots active corresponding to the value you set for
 ``slots_per_trial`` you set in ``distributed.yaml``, as well as logs appearing from multiple ranks.
 
+.. _core-profiling:
+
 ***********
  Profiling
 ***********
 
-Profiling with native profilers such as can be configured as usual. If running on a Determined
-cluster, the profiling log output path can be configured for automatic upload to the Determined
-Tensorboard UI.
+There are two ways to profile the performance of your training job:
+
+#. Core API's built-in system metrics profiler
+
+#. Integration with profilers native to your training framework, such as the Tensorflow and PyTorch
+profilers
+
+.. _core-profiler:
+
+Core API Profiler
+=================
+
+Core API comes with profiling functionality that tracks system metrics throughout the training run.
+These metrics are recorded at specified intervals to the master and can be viewed in the Web UI
+under your experiment's "Profiling" tab.
+
+Use ``core_context.profiler`` to interact with the Core API profiler. It can be toggled on or off by
+calling ``core_context.profiler.on()`` and ``core_context.profiler.off()``.
+
+The following code snippet demonstrates how to enable profiling for only a portion of your training
+code, but the profiler can be turned on and off at any point within the ``core.Context``.
+
+.. code:: python
+
+   import determined as det
+
+
+   with det.core.init() as core_context:
+       ...
+       for batch_idx in range(1, 10):
+           # In this example we just want to profile the first 5 batches.
+           if batch_idx == 1:
+               core_context.profiler.on()
+           if batch_idx == 5:
+               core_context.profiler.off()
+           train_batch(...)
+
+.. _core-native-profilers:
+
+Native Profilers
+================
+
+Profiling with native profilers such as PyTorch profiler and TensorFlow profiler can be configured
+as usual. If running on a Determined cluster, the profiling log output path can be configured for
+automatic upload to the Determined Tensorboard UI.
 
 The following snippet initializes the PyTorch Profiler. It will profile GPU and CPU activities,
 skipping batch 1, warming up on batch 2, profiling batches 3 and 4, then repeating the cycle. Result

diff --git a/docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst b/docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst
@@ -95,6 +95,8 @@ implement the :class:`determined.keras.callbacks.Callback` interface (an extensi
 :class:`~determined.keras.TFKerasTrial` by implementing
 :meth:`~determined.keras.TFKerasTrial.keras_callbacks`.
 
+.. _keras-profiler:
+
 ***********
  Profiling
 ***********

diff --git a/docs/model-dev-guide/api-guides/apis-howto/api-pytorch-ug.rst b/docs/model-dev-guide/api-guides/apis-howto/api-pytorch-ug.rst
@@ -515,6 +515,8 @@ you find that the built-in context.DataLoader() does not support your use case.
 
 See the :mod:`determined.pytorch.samplers` for details.
 
+.. _pytorch_profiler:
+
 Profiling
 ---------
 

diff --git a/docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst b/docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst
@@ -303,6 +303,8 @@ interleaving micro batches:
        loss = self.model_engine.eval_batch()
        return {"loss": loss}
 
+.. _deepspeed-profiler:
+
 ***********
  Profiling
 ***********

diff --git a/docs/model-dev-guide/dtrain/_index.rst b/docs/model-dev-guide/dtrain/_index.rst
@@ -33,8 +33,6 @@ Additional Resources:
 -  Learn how :ref:`Configuration Templates <config-template>` can help reduce redundancy.
 -  Discover how Determined aims to support reproducible machine learning experiments in
    :ref:`Reproducibility <reproducibility>`.
--  In :ref:`Optimizing Training <optimizing-training>`, you'll learn about out-of-the box tools you
-   can use for instrumenting training.
 
 .. toctree::
    :caption: Distributed Training
@@ -44,4 +42,3 @@ Additional Resources:
    Implementing Distributed Training <dtrain-implement>
    Configuration Templates <config-templates>
    Reproducibility <reproducibility>
-   Optimizing Training <optimize-training>
-Original file line number
+Diff line change
@@ Expand Up @@
     See the :mod:`determined.pytorch.samplers` for details.
+    .. _pytorch_profiler:
     Profiling
     ---------
@@ Expand Down @@