Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding hardware usage and software packages tracker #2195

Merged
merged 25 commits into from
Jul 15, 2022
Merged

Conversation

abidwael
Copy link
Contributor

This provides a new Tracker class to track hardware usage and software packages while a block of code is being executed.

Usage:

with Tracker(tag='train', output_dir=model.config['backend']['cache_dir'], num_batches=model.config[TRAINER]["batch_size"], num_examples=len(training_set)) as tracker:
    # code block
    .
    .

Will save hardware and software usage metrics under f"{output_dir}/{tag}_metrics.json"

@abidwael abidwael requested a review from ShreyaR June 27, 2022 07:19
@github-actions
Copy link

github-actions bot commented Jun 27, 2022

Unit Test Results

       6 files  +    1         6 suites  +1   2h 37m 9s ⏱️ + 36m 43s
2 913 tests  -   16  2 868 ✔️  -   15    45 💤  -   1  0 ±0 
8 739 runs  +123  8 600 ✔️ +103  139 💤 +20  0 ±0 

Results for commit c18b6bd. ± Comparison against base commit 9c58d5e.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@ShreyaR ShreyaR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting up this PR! Left some comments.

One general question I have is what the overhead of using Tracker is. If it isn't too expensive or slow to run Tracker, then it may make sense to use it by default for any Ludwig training process. I can see it being quite useful if we have a json with all the benchmarking stats generated in the model artifacts folder anytime we do training/evaluation.

ludwig/utils/tracker.py Outdated Show resolved Hide resolved
ludwig/utils/tracker.py Outdated Show resolved Hide resolved
ludwig/utils/tracker.py Outdated Show resolved Hide resolved
ludwig/utils/tracker.py Outdated Show resolved Hide resolved
ludwig/utils/tracker.py Outdated Show resolved Hide resolved
ludwig/utils/misc_utils.py Outdated Show resolved Hide resolved
@@ -0,0 +1,3 @@
experiment_impact_tracker
gpustat
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a separate requirements_tracker.txt file or do I need to add it to the main requirements.txt file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be ok with adding this to the main requirements.txt file, especially if hardware resource usage tracking adds marginal overhead.

Curious about other people's opinions on this: @dantreiman @w4nderlust @tgaddair

@abidwael
Copy link
Contributor Author

Thanks for putting up this PR! Left some comments.

One general question I have is what the overhead of using Tracker is. If it isn't too expensive or slow to run Tracker, then it may make sense to use it by default for any Ludwig training process. I can see it being quite useful if we have a json with all the benchmarking stats generated in the model artifacts folder anytime we do training/evaluation.

@ShreyaR I ran model.experiment with and without Tracker for ames_housing and mercedes_benz_greener and collected the total cpu and ram usage for all processes running on the machine at the time of execution. Here are the results:

Screen Shot 2022-07-12 at 3 32 20 AM

CPU seems to be more or less unaffected, but there's some RAM overhead. In my opinion, it's worth adding an optional Tracker ctx per @justinxzhao 's suggestion.

@abidwael abidwael requested a review from justinxzhao July 12, 2022 10:38
@@ -0,0 +1,3 @@
experiment_impact_tracker
gpustat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be ok with adding this to the main requirements.txt file, especially if hardware resource usage tracking adds marginal overhead.

Curious about other people's opinions on this: @dantreiman @w4nderlust @tgaddair

ludwig/benchmarking/tracker.py Outdated Show resolved Hide resolved
ludwig/benchmarking/tracker.py Outdated Show resolved Hide resolved
ludwig/benchmarking/tracker.py Outdated Show resolved Hide resolved
@abidwael abidwael marked this pull request as ready for review July 12, 2022 18:23
@abidwael abidwael requested a review from ShreyaR July 12, 2022 18:25
time.sleep(logging_interval)


class ResourceUsageTracker:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: Add a basic unit test that shows how this class can/should be used.

@abidwael
Copy link
Contributor Author

The pre-commit.ci check will not pass because of the following block

# disabling print because the following imports are verbose
f = open(os.devnull, "w")
sys.stdout = f
from experiment_impact_tracker.cpu.common import get_my_cpu_info
from experiment_impact_tracker.gpu.nvidia import get_gpu_info
from experiment_impact_tracker.py_environment.common import get_python_packages_and_versions

f.close()
sys.stdout = sys.__stdout__

I'm temporarily redirecting stdout because the import statement is verbose.

Made a PR in the original repo: Breakend/experiment-impact-tracker#74
Will follow up with the maintainer.

Copy link
Contributor

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, looks like there's a few last errors to resolve:

From pre-commit:

ludwig/benchmarking/resource_usage_tracker.py:24: [E402] module level import not at top of file
ludwig/benchmarking/resource_usage_tracker.py:25: [E402] module level import not at top of file
ludwig/benchmarking/resource_usage_tracker.py:26: [E402] module level import not at top of file

Finally, could you check the unit test you added? It looks like it's failing on one of the builds.

@abidwael
Copy link
Contributor Author

Changes LGTM, looks like there's a few last errors to resolve:

From pre-commit:

ludwig/benchmarking/resource_usage_tracker.py:24: [E402] module level import not at top of file
ludwig/benchmarking/resource_usage_tracker.py:25: [E402] module level import not at top of file
ludwig/benchmarking/resource_usage_tracker.py:26: [E402] module level import not at top of file

Finally, could you check the unit test you added? It looks like it's failing on one of the builds.

This one is due to my previous comment here.

Copy link
Contributor

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the pre-commit error is fixed, and the tests are all green.

@abidwael abidwael merged commit ae8de10 into master Jul 15, 2022
@abidwael abidwael deleted the monitoring-utils branch July 15, 2022 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants