diff --git a/docs/.redirects/redirects.json b/docs/.redirects/redirects.json index e35c3d97d5e..51390a604a8 100644 --- a/docs/.redirects/redirects.json +++ b/docs/.redirects/redirects.json @@ -1,6 +1,7 @@ { "reference/python-sdk": "python-sdk/python-sdk.html", "reference/training/experiment-config-reference": "../experiment-config-reference.html", + "model-dev-guide/dtrain/optimize-training": "../profiling.html", "model-dev-guide/submit-experiment": "create-experiment.html", "setup-cluster/internet-access": "checklists/adv-requirements.html", "setup-cluster/deploy-cluster/internet-access": "../checklists/adv-requirements.html", @@ -463,4 +464,4 @@ "tutorials/porting-tutorial": "pytorch-mnist-tutorial.html", "tutorials/quick-start": "quickstart-mdldev.html", "tutorials/pytorch-mnist-local-qs": "../get-started/webui-qs.html" -} +} \ No newline at end of file diff --git a/docs/get-started/architecture/introduction.rst b/docs/get-started/architecture/introduction.rst index 5988f0f06b3..fcbaed15bfc 100644 --- a/docs/get-started/architecture/introduction.rst +++ b/docs/get-started/architecture/introduction.rst @@ -234,9 +234,12 @@ following benefits: | | without risking future lock-in. See | | | :ref:`apis-howto-overview`. | +------------------------------------------------+-----------------------------------------------------------+ -| Profiling | Out-of-the-box system metrics (measurements of hardware | -| | usage) and timings (durations of actions taken during | -| | training, such as data loading). | +| Profiling | Optimize your model's performance and diagnose | +| | bottlenecks with comprehensive profiling support across | +| | different layers of your deployment, from out-of-the-box | +| | system metrics tracking and seamless integrations with | +| | native training profilers to Prometheus/Grafana support. | +| | See :ref:`profiling`. | +------------------------------------------------+-----------------------------------------------------------+ | Visualization | Visualize your model and training procedure by using The | | | built-in WebUI and by launching managed | diff --git a/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst b/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst index f7b9a3473c0..8451e229996 100644 --- a/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst +++ b/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst @@ -508,18 +508,62 @@ In the Determined WebUI, go to the **Cluster** pane. You should be able to see multiple slots active corresponding to the value you set for ``slots_per_trial`` you set in ``distributed.yaml``, as well as logs appearing from multiple ranks. +.. _core-profiling: + *********** Profiling *********** -Profiling with native profilers such as can be configured as usual. If running on a Determined -cluster, the profiling log output path can be configured for automatic upload to the Determined -Tensorboard UI. +There are two ways to profile the performance of your training job: + +#. Core API's built-in system metrics profiler + +#. Integration with profilers native to your training framework, such as the TensorFlow and PyTorch +profilers + +.. _core-profiler: + +Core API Profiler +================= + +The Core API includes a profiling feature that monitors and records system metrics during the +training run. These metrics are recorded at specified intervals and sent to the master, allowing you +to view them in the "Profiling" tab of your experiment in the WebUI. + +Use ``core_context.profiler`` to interact with the Core API profiler. It can be toggled on or off by +calling ``core_context.profiler.on()`` and ``core_context.profiler.off()``. + +The following code snippet demonstrates how to enable profiling for only a portion of your training +code, but the profiler can be turned on and off at any point within the ``core.Context``. + +.. code:: python + + import determined as det + + + with det.core.init() as core_context: + ... + for batch_idx in range(1, 10): + # In this example we just want to profile the first 5 batches. + if batch_idx == 1: + core_context.profiler.on() + if batch_idx == 5: + core_context.profiler.off() + train_batch(...) + +.. _core-native-profilers: + +Native Profilers +================ + +Profiling with native profilers such as PyTorch profiler and TensorFlow profiler can be configured +as usual. If running on a Determined cluster, the profiling log output path can be configured for +automatic upload to the Determined TensorBoard UI. The following snippet initializes the PyTorch Profiler. It will profile GPU and CPU activities, skipping batch 1, warming up on batch 2, profiling batches 3 and 4, then repeating the cycle. Result files will be uploaded to the experiment's TensorBoard path and can be viewed under the "PyTorch -Profiler" tab in the Determined Tensorboard UI. +Profiler" tab in the Determined TensorBoard UI. See `PyTorch Profiler `_ documentation for details. diff --git a/docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst b/docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst index ec63dc7461b..f871c9172ec 100644 --- a/docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst +++ b/docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst @@ -95,6 +95,8 @@ implement the :class:`determined.keras.callbacks.Callback` interface (an extensi :class:`~determined.keras.TFKerasTrial` by implementing :meth:`~determined.keras.TFKerasTrial.keras_callbacks`. +.. _keras-profiler: + *********** Profiling *********** diff --git a/docs/model-dev-guide/api-guides/apis-howto/api-pytorch-ug.rst b/docs/model-dev-guide/api-guides/apis-howto/api-pytorch-ug.rst index 168e6519101..c68d9767d4f 100644 --- a/docs/model-dev-guide/api-guides/apis-howto/api-pytorch-ug.rst +++ b/docs/model-dev-guide/api-guides/apis-howto/api-pytorch-ug.rst @@ -515,6 +515,8 @@ you find that the built-in context.DataLoader() does not support your use case. See the :mod:`determined.pytorch.samplers` for details. +.. _pytorch_profiler: + Profiling --------- diff --git a/docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst b/docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst index 4de1d42c123..9be581ae08e 100644 --- a/docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst +++ b/docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst @@ -303,6 +303,8 @@ interleaving micro batches: loss = self.model_engine.eval_batch() return {"loss": loss} +.. _deepspeed-profiler: + *********** Profiling *********** diff --git a/docs/model-dev-guide/dtrain/_index.rst b/docs/model-dev-guide/dtrain/_index.rst index 4d2adb49da0..f2812582437 100644 --- a/docs/model-dev-guide/dtrain/_index.rst +++ b/docs/model-dev-guide/dtrain/_index.rst @@ -33,8 +33,6 @@ Additional Resources: - Learn how :ref:`Configuration Templates ` can help reduce redundancy. - Discover how Determined aims to support reproducible machine learning experiments in :ref:`Reproducibility `. -- In :ref:`Optimizing Training `, you'll learn about out-of-the box tools you - can use for instrumenting training. .. toctree:: :caption: Distributed Training @@ -44,4 +42,3 @@ Additional Resources: Implementing Distributed Training Configuration Templates Reproducibility - Optimizing Training diff --git a/docs/model-dev-guide/dtrain/optimize-training.rst b/docs/model-dev-guide/dtrain/optimize-training.rst deleted file mode 100644 index 5d478a0eabb..00000000000 --- a/docs/model-dev-guide/dtrain/optimize-training.rst +++ /dev/null @@ -1,86 +0,0 @@ -.. _optimizing-training: - -##################### - Optimizing Training -##################### - -When optimizing the training speed of a model, the first step is to understand where and why -training is slow. Once the bottlenecks have been identified, the next step is to do further -investigation and experimentation to alleviate those bottlenecks. - -To understand the performance profile of a training job, the training code and infrastructure need -to be instrumented. Many different layers can be instrumented, from raw throughput all the way down -to GPU kernels. - -Determined provides two tools out-of-the-box for instrumenting training: - -- :ref:`System Metrics `: measurements of hardware usage -- :ref:`Timings `: durations of actions taken during training, such as - data loading - -System Metrics are useful to see if the software is taking full advantage of the available hardware, -particularly around GPU usage, data loading, and network communication during distributed training. -Timings are useful for identifying the section of code to focus on for optimizations. Most commonly, -Timings help answer the question of whether the dataloader is the main bottleneck in training. - -.. _how-to-profiling: - -.. _how-to-profiling-system-metrics: - -**************** - System Metrics -**************** - -System Metrics are statistics around hardware usage, such as GPU utilization and network throughput. -These metrics are useful for seeing whether training is using the hardware effectively. When the -System Metrics reported for an experiment are below what is expected from the hardware, that is a -sign that the software may be able to be optimized to make better use of the hardware resources. - -Specifically, Determined tracks: - -- GPU utilization -- GPU free memory -- Network throughput (sent) -- Network throughput (received) -- Disk IOPS -- Disk throughput (read) -- Disk throughput (write) -- Host available memory -- CPU utilization averaged across cores - -For distributed training, these metrics are collected for every agent. The data are broken down by -agent, and GPU metrics can be further broken down by GPU. - -.. note:: - - System Metrics record agent-level metrics, so when there are multiple experiments on the same - agent, it is difficult to analyze. We suggest that profiling is done with only a single - experiment per agent. - -.. _how-to-profiling-timings: - -********* - Timings -********* - -The other type of profiling metric that Determined tracks is Timings. Timings are measurements of -how long specific training events take. Examples of training events include retrieving data from the -dataloader, moving data between host and device, running the forward/backward pass, and executing -callbacks. - -.. note:: - - Timings are currently only supported for ``PyTorchTrial``. - -These measurements provide a high-level picture of where to focus optimization efforts. -Specifically, Determined tracks the following Timings: - -- ``dataloader_next``: time to retrieve the next item from the dataloader -- ``to_device``: time to transfer input from host to device -- ``train_batch``: how long the user-defined ``train_batch`` function takes to execute\* -- ``step_lr_schedulers``: amount of time to update the LR schedules -- ``from_device``: time to transfer output from device to host -- ``reduce_metrics``: time taken to calculate global metrics in distributed training - -\* ``train_batch`` is typically the forward pass and the backward pass, but it is a user-defined -function so it could include other steps. diff --git a/docs/model-dev-guide/profiling.rst b/docs/model-dev-guide/profiling.rst new file mode 100644 index 00000000000..ee16c5345cd --- /dev/null +++ b/docs/model-dev-guide/profiling.rst @@ -0,0 +1,110 @@ +.. _profiling: + +########### + Profiling +########### + +Optimizing a model's performance is often a trade-off between accuracy, time, and resource +requirements. Training deep learning models is a time and resource intensive process, where each +iteration can take several hours and accumulate heavy hardware costs. Though sometimes this cost is +inherent to the task, unnecessary resource consumption can be caused by suboptimal code or bugs. +Thus, achieving optimal model performance requires an understanding of how your model interacts with +the system's computational resources. + +Profiling collects metrics on how computational resources like CPU, GPU, and memory are being +utilized during a training job. It can reveal patterns in resource utilization that indicate +performance bottlenecks and pinpoint areas of the code or pipeline that are causing slowdowns or +inefficiencies. + +Profiling a training job can be instrumented at many different layers, from generic system-level +metrics to individual model operators and GPU kernels. Determined provides a few options for +profiling, each targeting a different layer in a training job at various levels of detail: + +- :ref:`Determined system metrics profiler ` collects general + system-level metrics and provides an overview of hardware usage during an experiment. +- :ref:`Native profiler integration ` enables model profiling in + training APIs that provides fine-grained metrics specific to your model. +- :ref:`Prometheus/Grafana integration ` can be set up to track + detailed hardware metrics and monitor overall cluster health. + +.. _how-to-profiling: + +.. _how-to-profiling-det-profiler: + +********************* + Determined Profiler +********************* + +Determined comes with a built-in profiler that provides out-of-the-box tracking for system-level +metrics. System metrics are statistics around hardware usage, such as GPU utilization, disk usage, +and network throughput. + +These metrics provide a general overview of resource usage during a training run, and can be useful +for quickly identifying ineffective usages of computational resources. When the system metrics +reported for an experiment do not match hardware expectations, that is a sign that the software may +be able to be optimized to make better use of the hardware resources. + +The Determined profiler collects a set of system metrics throughout an experiment which can be +visualized in the WebUI under the experiment's "Profiler" tab. It is supported for all training +APIs, but is not enabled by default. + +Visit :ref:`core-profiler` to find out how to enable and configure the Determined profiler for your +experiment. + +The following system metrics are tracked: + +- *GPU utilization (percent)*: current utilization of a GPU device +- *GPU free memory (bytes)*: current amount of free memory available on a GPU device +- *Network throughput - sent (bytes/s)*: change rate of number of bytes sent system-wide +- *Network throughput (received)*: change rate of number of bytes received system-wide +- *Disk IOPS (bytes/s)*: change rate of number of read + writes system-wide +- *Disk throughput - reads (bytes/s)*: change rate of bytes read system-wide +- *Disk throughput - writes (bytes/s)*: change rate of bytes written system-wide +- *Host available memory (gigabytes)*: current amount of memory available (not including swap) + system-wide +- *CPU utilization (percent)*: current utilization of CPU cores, averaged across all cores in the + system + +For distributed training, these metrics are collected for every agent. The data is broken down by +agent, and GPU metrics can be further broken down by GPU. + +.. note:: + + System Metrics record agent-level metrics, so when there are multiple experiments on the same + agent, it is difficult to analyze. It is recommended that profiling is done with only a single + experiment per agent. + +.. _how-to-profiling-native-profilers: + +*************************** + Native Training Profilers +*************************** + +Sometimes system-level profiling doesn't capture enough data to help debug bottlenecks in model +training code. Identifying inefficiencies in individual training operations or steps requires a more +fine-grained context than generic system metrics can provide. For this level of profiling, +Determined supports integration with training profilers that are native to their frameworks: + +- PyTorch Profiler (:ref:`PyTorch API `) +- DeepSpeed Profiler (:ref:`DeepSpeed API `) +- TensorFlow Keras Profiler (:ref:`Keras API `) + +Please see your framework's profiler documentation and the Determined Training API guide for usage +details. + +.. _how-to-profiling-prom-grafana: + +************************************ + Prometheus and Grafana Integration +************************************ + +For a more resource-centric view of Determined jobs, Determined provides a Prometheus endpoint along +with a pre-configured Grafana dashboard. These can be set up to track detailed hardware usage +metrics for a Determined cluster, and can be helpful for alerting and monitoring cluster health. + +The Prometheus endpoint aggregates system metrics and associates them with Determined concepts such +as experiments, tags, and resource pools, which can be viewed in Grafana. Determined provides a +Grafana dashboard that shows real-time resource metrics across an entire cluster as well as +experiments, containers, and resource pools. + +Visit :ref:`configure-prometheus-grafana` to find out how to enable this functionality. diff --git a/docs/reference/experiment-config-reference.rst b/docs/reference/experiment-config-reference.rst index f537ae6caae..a851271f472 100644 --- a/docs/reference/experiment-config-reference.rst +++ b/docs/reference/experiment-config-reference.rst @@ -1483,38 +1483,19 @@ explicitly specified, the master will automatically generate an experiment seed. Profiling *********** -The ``profiling`` section specifies configuration options related to profiling experiments. See -:ref:`how-to-profiling` for a more detailed walkthrough. +The ``profiling`` section specifies configuration options for the Determined system metrics +profiler. See :ref:`how-to-profiling` for a more detailed walkthrough. ``profiling`` ============= -Optional. Profiling is supported for all frameworks, though timings are only collected for -``PyTorchTrial``. Profiles are collected for a maximum of 5 minutes, regardless of the settings -below. +Optional. Defaults to false. ``enabled`` ----------- Optional. Defines whether profiles should be collected or not. Defaults to false. -``begin_on_batch`` ------------------- - -Optional. Specifies the batch on which profiling should begin. - -``end_after_batch`` -------------------- - -Optional. Specifies the batch after which profiling should end. - -``sync_timings`` ----------------- - -Optional. Specifies whether Determined should wait for all GPU kernel streams before considering a -timing as ended. Defaults to 'true'. Applies only for frameworks that collect timing metrics -(currently just PyTorch). - .. _experiment-configuration_training_units: .. _slurm-config: