-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: better docs for profiling in determined [MD-252] #9011
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -508,13 +508,57 @@ In the Determined WebUI, go to the **Cluster** pane. | |
You should be able to see multiple slots active corresponding to the value you set for | ||
``slots_per_trial`` you set in ``distributed.yaml``, as well as logs appearing from multiple ranks. | ||
|
||
.. _core-profiling: | ||
|
||
*********** | ||
Profiling | ||
*********** | ||
|
||
Profiling with native profilers such as can be configured as usual. If running on a Determined | ||
cluster, the profiling log output path can be configured for automatic upload to the Determined | ||
Tensorboard UI. | ||
There are two ways to profile the performance of your training job: | ||
|
||
#. Core API's built-in system metrics profiler | ||
|
||
#. Integration with profilers native to your training framework, such as the Tensorflow and PyTorch | ||
profilers | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TensorFlow |
||
|
||
.. _core-profiler: | ||
|
||
Core API Profiler | ||
================= | ||
|
||
Core API comes with profiling functionality that tracks system metrics throughout the training run. | ||
These metrics are recorded at specified intervals to the master and can be viewed in the Web UI | ||
under your experiment's "Profiling" tab. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Core API includes a profiling feature that monitors and records system metrics during the training run. These metrics are recorded at specified intervals and sent to the master, allowing you to view them in the "Profiling" tab of your experiment in the WebUI. |
||
|
||
Use ``core_context.profiler`` to interact with the Core API profiler. It can be toggled on or off by | ||
calling ``core_context.profiler.on()`` and ``core_context.profiler.off()``. | ||
|
||
The following code snippet demonstrates how to enable profiling for only a portion of your training | ||
code, but the profiler can be turned on and off at any point within the ``core.Context``. | ||
|
||
.. code:: python | ||
|
||
import determined as det | ||
|
||
|
||
with det.core.init() as core_context: | ||
... | ||
for batch_idx in range(1, 10): | ||
# In this example we just want to profile the first 5 batches. | ||
if batch_idx == 1: | ||
core_context.profiler.on() | ||
if batch_idx == 5: | ||
core_context.profiler.off() | ||
train_batch(...) | ||
|
||
.. _core-native-profilers: | ||
|
||
Native Profilers | ||
================ | ||
|
||
Profiling with native profilers such as PyTorch profiler and TensorFlow profiler can be configured | ||
as usual. If running on a Determined cluster, the profiling log output path can be configured for | ||
automatic upload to the Determined Tensorboard UI. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TensorBoard |
||
|
||
The following snippet initializes the PyTorch Profiler. It will profile GPU and CPU activities, | ||
skipping batch 1, warming up on batch 2, profiling batches 3 and 4, then repeating the cycle. Result | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
should it be this: "across different deployment layers"?