Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: profiling v2 [MD-27] #9032

Merged
merged 23 commits into from
Mar 26, 2024
Merged

feat: profiling v2 [MD-27] #9032

merged 23 commits into from
Mar 26, 2024

Conversation

azhou-determined
Copy link
Contributor

@azhou-determined azhou-determined commented Mar 20, 2024

Description

Profiling V2 (Project Doc)

Individual commits should have already been reviewed, this is the final feature branch to main PR.

Major changes:

  • Functionality for timing metrics in the Determined profiler was dropped. Determined profiler is now only responsible for system metrics.
  • Determined Profiler now lives in Core API.
  • Determined profiler's system metrics now use backend's generic metrics framework

Test Plan

This feature should be tested across a few different Trial APIs

PyTorch

For mnist_pytorch:

  • Change determined/examples/tutorials/mnist_pytorch/train.py to add profiling_enabled:
trainer.fit(max_length=max_length, latest_checkpoint=latest_checkpoint, profiling_enabled=True)
  • Submit the experiment. Go to the "Profiler" tab for that experiment in the Web UI. Verify that "System Metrics" is rendered with metric values.

TF Keras

  • In determined/examples/computer_vision/iris_tf_keras/distributed.yaml, enabled profiling:
profiling:
  enabled: true
  • Submit the experiment. Go to the "Profiler" tab for that experiment in the Web UI. Verify that "System Metrics" is rendered with metric values.
    image

Core API

Create a Core API script and expconf.

Experiment Config:

name: profiling
entrypoint: python3 profiling.py

searcher:
   name: single
   metric: x
   max_length: 1

max_restarts: 0

profiling.py

import logging
import time

import determined as det
from determined import core


def main(core_context):
    core_context.profiler.on(sampling_interval=0.1, samples_per_report=10)
    for batch in range(60*5):
        steps_completed = batch + 1
        if steps_completed % 5 == 0:
            core_context.train.report_training_metrics(
                steps_completed=steps_completed, metrics={"x": batch}
            )
        if steps_completed % 10 == 0:
            core_context.train.report_validation_metrics(steps_completed=steps_completed, metrics={"x": batch})
        time.sleep(1)
    core_context.profiler.off()


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format=det.LOG_FORMAT)
    with core.init() as core_context:
        main(core_context=core_context)
  • Submit the experiment. Go to the "Profiler" tab for that experiment in the Web UI. Verify that "System Metrics" is rendered with metric values.

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@azhou-determined azhou-determined requested review from a team as code owners March 20, 2024 21:22
@cla-bot cla-bot bot added the cla-signed label Mar 20, 2024
@determined-ci determined-ci requested a review from a team March 20, 2024 21:22
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Mar 20, 2024
Copy link

netlify bot commented Mar 20, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 9806379
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66033597fe907b00080a46b8

Copy link
Contributor

@gt2345 gt2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebUI stamp


System Metrics record agent-level metrics, so when there are multiple experiments on the same
agent, it is difficult to analyze. It is recommended that profiling is done with only a single
experiment per agent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a place to reference how this can be configured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can't really be configured. i left this warning from the previous doc because it's still relevant, but it has to do with being aware that there could be other experiments using the same agent as your experiment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be frustrating to me to read this note and then be unable to figure out how to do the thing it recommends. Let's chat about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this recommendation

Optional. Profiling is supported for all frameworks, though timings are only collected for
``PyTorchTrial``. Profiles are collected for a maximum of 5 minutes, regardless of the settings
below.
Optional. Defaults to false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not being purposefully dense, I can't differentiate between:

profiling:
  profiler: <val>
  enabled: <val>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this section


Supports up to 1 level of nesting. Returns a single merged dictionary where the values are
averaged across all dictionaries in the given list by key.
# TODO (anda): find a cleaner way to do this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is important to you, could you please create a ticket and reference the ticket in the TODO instead of yourself? And if it's not important enough for that, then please remove the TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've created the ticket https://hpe-aiatscale.atlassian.net/browse/MD-338. i will add it to the comment but don't want to commit right now because i'm waiting for a longrunning CI to finish.

for sample in metric_samples:
for k, v in sample.items():
if isinstance(v, dict):
aggregated_metrics[k] = aggregated_metrics.get(k, {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: little clearer with a defaultdict, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregated_metrics = defaultdict(int)
aggregated_metrics = defaultdict(defaultdict(int))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, don't love the way this reads. .get(k, {}) is easily readable IMO. i do think defaultdict is best-practices way of doing this, but since i have a ticket to refactor this method anyway, i don't think it's worth it.



class _Network(_MetricGroupCollector):
group = "network"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little surprised this works. I'd have expected that it had to look like

@property
def group:
  return "network"

My typechecker doesn't like it, either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsettled by this. I don't think the implemented code works, either, and that it hasn't resulted in failed tests makes me wonder if something is wrong with tests.

currently implemented:

    def group(self) -> str:
        return "network"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link

codecov bot commented Mar 22, 2024

Codecov Report

Attention: Patch coverage is 48.89241% with 323 lines in your changes are missing coverage. Please review.

Project coverage is 47.79%. Comparing base (1992c97) to head (9806379).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9032      +/-   ##
==========================================
+ Coverage   47.70%   47.79%   +0.08%     
==========================================
  Files        1166     1166              
  Lines      143876   143603     -273     
  Branches     2379     2377       -2     
==========================================
- Hits        68636    68630       -6     
+ Misses      75081    74814     -267     
  Partials      159      159              
Flag Coverage Δ
backend 42.83% <10.42%> (-0.14%) ⬇️
harness 64.34% <61.87%> (+0.59%) ⬆️
web 40.74% <80.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
harness/determined/_trial_controller.py 83.33% <100.00%> (+5.15%) ⬆️
harness/determined/core/__init__.py 100.00% <100.00%> (ø)
harness/determined/keras/callbacks.py 91.50% <100.00%> (+0.35%) ⬆️
...determined/pytorch/deepspeed/_deepspeed_context.py 80.62% <100.00%> (-0.19%) ⬇️
harness/tests/experiment/pytorch_utils.py 95.58% <ø> (ø)
master/pkg/model/metrics.go 66.66% <ø> (ø)
...es/ExperimentDetails/ExperimentSingleTrialTabs.tsx 85.55% <100.00%> (-0.08%) ⬇️
...react/src/pages/TrialDetails/Profiles/Profiler.tsx 70.31% <ø> (+1.19%) ⬆️
...bui/react/src/pages/TrialDetails/Profiles/utils.ts 62.85% <100.00%> (-1.04%) ⬇️
harness/determined/keras/_tf_keras_trial.py 82.83% <85.71%> (+1.46%) ⬆️
... and 16 more

... and 10 files with indirect coverage changes

[
(conf.fixtures_path("mnist_pytorch"), True),
],
"model_def",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't understand the parameterization. I think it's cleaner without it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@azhou-determined azhou-determined force-pushed the profiling-v2 branch 2 times, most recently from 961dbf5 to f591b31 Compare March 26, 2024 16:59
azhou-determined and others added 16 commits March 26, 2024 10:32
generic_metrics:
- DB schema changes
- Changes to backend ReportTrialMetrics APIs
remove throughput and timing metric views on web UI for profiler tab
* chore: aggregate profiling metrics before reporting
…MD-301] (#8970)

* Migrate existing profiler metrics:
- historical data migration `trial_profiler_metrics` -> `metrics`
- shim existing trial profiler metrics APIs to fetch from `metrics`
* chore: remove profiling not enabled check in web UI
@azhou-determined azhou-determined merged commit 74fe16b into main Mar 26, 2024
92 of 97 checks passed
@azhou-determined azhou-determined deleted the profiling-v2 branch March 26, 2024 22:07
azhou-determined added a commit that referenced this pull request Mar 27, 2024
optimize migrations on metrics table (landed as part of #9032)
dai-release bot pushed a commit that referenced this pull request Mar 27, 2024
optimize migrations on metrics table (landed as part of #9032)

(cherry picked from commit a07f0fb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants