-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: profiling v2 [MD-27] #9032
Conversation
✅ Deploy Preview for determined-ui canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebUI stamp
docs/model-dev-guide/profiling.rst
Outdated
|
||
System Metrics record agent-level metrics, so when there are multiple experiments on the same | ||
agent, it is difficult to analyze. It is recommended that profiling is done with only a single | ||
experiment per agent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a place to reference how this can be configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can't really be configured. i left this warning from the previous doc because it's still relevant, but it has to do with being aware that there could be other experiments using the same agent as your experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be frustrating to me to read this note and then be unable to figure out how to do the thing it recommends. Let's chat about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this recommendation
Optional. Profiling is supported for all frameworks, though timings are only collected for | ||
``PyTorchTrial``. Profiles are collected for a maximum of 5 minutes, regardless of the settings | ||
below. | ||
Optional. Defaults to false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not being purposefully dense, I can't differentiate between:
profiling:
profiler: <val>
enabled: <val>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this section
harness/determined/core/_profiler.py
Outdated
|
||
Supports up to 1 level of nesting. Returns a single merged dictionary where the values are | ||
averaged across all dictionaries in the given list by key. | ||
# TODO (anda): find a cleaner way to do this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is important to you, could you please create a ticket and reference the ticket in the TODO instead of yourself? And if it's not important enough for that, then please remove the TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've created the ticket https://hpe-aiatscale.atlassian.net/browse/MD-338. i will add it to the comment but don't want to commit right now because i'm waiting for a longrunning CI to finish.
for sample in metric_samples: | ||
for k, v in sample.items(): | ||
if isinstance(v, dict): | ||
aggregated_metrics[k] = aggregated_metrics.get(k, {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: little clearer with a defaultdict
, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aggregated_metrics = defaultdict(int)
aggregated_metrics = defaultdict(defaultdict(int))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eh, don't love the way this reads. .get(k, {})
is easily readable IMO. i do think defaultdict
is best-practices way of doing this, but since i have a ticket to refactor this method anyway, i don't think it's worth it.
harness/determined/core/_profiler.py
Outdated
|
||
|
||
class _Network(_MetricGroupCollector): | ||
group = "network" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little surprised this works. I'd have expected that it had to look like
@property
def group:
return "network"
My typechecker doesn't like it, either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little unsettled by this. I don't think the implemented code works, either, and that it hasn't resulted in failed tests makes me wonder if something is wrong with tests.
currently implemented:
def group(self) -> str:
return "network"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9032 +/- ##
==========================================
+ Coverage 47.70% 47.79% +0.08%
==========================================
Files 1166 1166
Lines 143876 143603 -273
Branches 2379 2377 -2
==========================================
- Hits 68636 68630 -6
+ Misses 75081 74814 -267
Partials 159 159
Flags with carried forward coverage won't be shown. Click here to find out more.
|
[ | ||
(conf.fixtures_path("mnist_pytorch"), True), | ||
], | ||
"model_def", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't understand the parameterization. I think it's cleaner without it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
961dbf5
to
f591b31
Compare
generic_metrics: - DB schema changes - Changes to backend ReportTrialMetrics APIs
* Refactor ProfilerAgent into Core API
remove throughput and timing metric views on web UI for profiler tab
* chore: aggregate profiling metrics before reporting
…MD-301] (#8970) * Migrate existing profiler metrics: - historical data migration `trial_profiler_metrics` -> `metrics` - shim existing trial profiler metrics APIs to fetch from `metrics`
* docs: better docs for profiling
* chore: remove profiling not enabled check in web UI
ae2f2ad
to
55bc58d
Compare
optimize migrations on metrics table (landed as part of #9032)
Description
Profiling V2 (Project Doc)
Individual commits should have already been reviewed, this is the final feature branch to main PR.
Major changes:
Test Plan
This feature should be tested across a few different Trial APIs
PyTorch
For
mnist_pytorch
:determined/examples/tutorials/mnist_pytorch/train.py
to addprofiling_enabled
:TF Keras
determined/examples/computer_vision/iris_tf_keras/distributed.yaml
, enabled profiling:Core API
Create a Core API script and expconf.
Experiment Config:
profiling.py
Commentary (optional)
Checklist
docs/release-notes/
.See Release Note for details.
Ticket