Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: better docs for profiling in determined [MD-252] #9011

Merged
merged 5 commits into from
Mar 19, 2024

Conversation

azhou-determined
Copy link
Contributor

@azhou-determined azhou-determined commented Mar 15, 2024

Description

Better docs for profiling in determined

Rewrote most of existing profiler documentation in https://docs.determined.ai/latest/model-dev-guide/dtrain/optimize-training.html. my intention is for that doc to be the centralized location for "profiling within determined". so i renamed it to "Profiling", and moved it to not belong under Distributed Training (because it's not exclusive to dtrain).

Also some Trial API docs were changed as part of the ongoing Profiling V2 project, which is why this is getting merged to a feature branch.

Test Plan

Commentary (optional)

Related PR: #9034

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@cla-bot cla-bot bot added the cla-signed label Mar 15, 2024
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Mar 15, 2024
@determined-ci determined-ci requested a review from a team March 15, 2024 18:30
+------------------------------------------------+------------------------------------------------------------------------------+
| Profiling | Optimize your model's performance and diagnose bottlenecks with |
| | comprehensive profiling support across different layers deployment, from |
| | out-of-the-box system metrics tracking to seamless integrations with native |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?
should it be this: "across different deployment layers"?

#. Core API's built-in system metrics profiler

#. Integration with profilers native to your training framework, such as the Tensorflow and PyTorch
profilers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorFlow


Profiling with native profilers such as PyTorch profiler and TensorFlow profiler can be configured
as usual. If running on a Determined cluster, the profiling log output path can be configured for
automatic upload to the Determined Tensorboard UI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorBoard


The Determined profiler collects a set of system metrics throughout an experiment which can be
visualized in the Web UI under the experiment's "Profiler" tab. It is supported for all training
APIs, but is not enabled by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebUI


Please see :ref:`core-profiler` for details on enabling and configuring the Determined profiler for
your experiment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visit :ref:core-profiler to find out how to enable and configure the Determined profiler for
your experiment.

agent, it is difficult to analyze. We suggest that profiling is done with only a single
experiment per agent.

.. _how-to-profiling-native-profilers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. note::

System Metrics record agent-level metrics, so when there are multiple experiments on the same
agent, it is difficult to analyze. It is recommended that profiling is done with only a single
experiment per agent.

as experiments, tags, and resource pools, which can be viewed in Grafana. We provide a Grafana
dashboard that shows real-time resource metrics across an entire cluster as well as experiments,
containers, and resource pools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Prometheus endpoint aggregates system metrics and associates them with Determined concepts such
as experiments, tags, and resource pools, which can be viewed in Grafana. Determined provides a Grafana
dashboard that shows real-time resource metrics across an entire cluster as well as experiments,
containers, and resource pools.

containers, and resource pools.

Please follow :ref:`configure-prometheus-grafana` for instructions on how to enable this
functionality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visit :ref:configure-prometheus-grafana to find out how to enable this
functionality.


Core API comes with profiling functionality that tracks system metrics throughout the training run.
These metrics are recorded at specified intervals to the master and can be viewed in the Web UI
under your experiment's "Profiling" tab.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Core API includes a profiling feature that monitors and records system metrics during the training run. These metrics are recorded at specified intervals and sent to the master, allowing you to view them in the "Profiling" tab of your experiment in the WebUI.

Copy link
Contributor

@tara-hpe tara-hpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested edits

@determined-ci determined-ci requested a review from a team March 15, 2024 20:10
@azhou-determined azhou-determined merged commit dc51fec into profiling-v2 Mar 19, 2024
59 of 75 checks passed
@azhou-determined azhou-determined deleted the profiling-docs branch March 19, 2024 15:28
azhou-determined added a commit that referenced this pull request Mar 20, 2024
azhou-determined added a commit that referenced this pull request Mar 26, 2024
azhou-determined added a commit that referenced this pull request Mar 26, 2024
azhou-determined added a commit that referenced this pull request Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants