-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: better docs for profiling in determined [MD-252] #9011
Conversation
+------------------------------------------------+------------------------------------------------------------------------------+ | ||
| Profiling | Optimize your model's performance and diagnose bottlenecks with | | ||
| | comprehensive profiling support across different layers deployment, from | | ||
| | out-of-the-box system metrics tracking to seamless integrations with native | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
should it be this: "across different deployment layers"?
#. Core API's built-in system metrics profiler | ||
|
||
#. Integration with profilers native to your training framework, such as the Tensorflow and PyTorch | ||
profilers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TensorFlow
|
||
Profiling with native profilers such as PyTorch profiler and TensorFlow profiler can be configured | ||
as usual. If running on a Determined cluster, the profiling log output path can be configured for | ||
automatic upload to the Determined Tensorboard UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TensorBoard
|
||
The Determined profiler collects a set of system metrics throughout an experiment which can be | ||
visualized in the Web UI under the experiment's "Profiler" tab. It is supported for all training | ||
APIs, but is not enabled by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebUI
|
||
Please see :ref:`core-profiler` for details on enabling and configuring the Determined profiler for | ||
your experiment. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visit :ref:core-profiler
to find out how to enable and configure the Determined profiler for
your experiment.
agent, it is difficult to analyze. We suggest that profiling is done with only a single | ||
experiment per agent. | ||
|
||
.. _how-to-profiling-native-profilers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. note::
System Metrics record agent-level metrics, so when there are multiple experiments on the same
agent, it is difficult to analyze. It is recommended that profiling is done with only a single
experiment per agent.
as experiments, tags, and resource pools, which can be viewed in Grafana. We provide a Grafana | ||
dashboard that shows real-time resource metrics across an entire cluster as well as experiments, | ||
containers, and resource pools. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Prometheus endpoint aggregates system metrics and associates them with Determined concepts such
as experiments, tags, and resource pools, which can be viewed in Grafana. Determined provides a Grafana
dashboard that shows real-time resource metrics across an entire cluster as well as experiments,
containers, and resource pools.
containers, and resource pools. | ||
|
||
Please follow :ref:`configure-prometheus-grafana` for instructions on how to enable this | ||
functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visit :ref:configure-prometheus-grafana
to find out how to enable this
functionality.
|
||
Core API comes with profiling functionality that tracks system metrics throughout the training run. | ||
These metrics are recorded at specified intervals to the master and can be viewed in the Web UI | ||
under your experiment's "Profiling" tab. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Core API includes a profiling feature that monitors and records system metrics during the training run. These metrics are recorded at specified intervals and sent to the master, allowing you to view them in the "Profiling" tab of your experiment in the WebUI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested edits
* docs: better docs for profiling
* docs: better docs for profiling
* docs: better docs for profiling
* docs: better docs for profiling
Description
Better docs for profiling in determined
Rewrote most of existing profiler documentation in https://docs.determined.ai/latest/model-dev-guide/dtrain/optimize-training.html. my intention is for that doc to be the centralized location for "profiling within determined". so i renamed it to "Profiling", and moved it to not belong under Distributed Training (because it's not exclusive to dtrain).
Also some Trial API docs were changed as part of the ongoing Profiling V2 project, which is why this is getting merged to a feature branch.
Test Plan
Commentary (optional)
Related PR: #9034
Checklist
docs/release-notes/
.See Release Note for details.
Ticket