-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Profiling for a training run #753
Comments
Thanks to Github's suggestion, I found this similar issue #251 |
Great suggestion. I still prefer the logger approach instead of decorator. APII imagine it like:
TrainerIn the trainer, we'd call it at key places
UsabilityWe also allow users to call it whenever they want
|
for the record, more discussion is here: |
What if we did the logger approach but made the scope more explicit? Eg.
And then we could expose an alternative interface using context handlers if the user prefers.
Internally, we can store the values as a |
I'm also curious what design @jeffling decided on for their profiling tool |
@jeremyjordan I like the context handler suggestion. That way, I can decide on my won contexts. Want to take a stab at implementing this? |
We use several, but the most useful one was really just using a lot of @timeits. We have a small edit that forces GPU sync so we can get GPU time. Unfortunately, still clunky to use but works. We expose both a decorator and logger api, and the logger one definitely is more popular with the team. |
Took an initial stab at implementing this yesterday evening (see branch here). I wrote a simple profiler which just records durations of actions:
And a more advanced profiler which leverages Python's cProfiler which recorded a more detailed view:
Still need to validate that everything works properly but figured I'd share the branch in case anyone wants to follow along with the progress. A couple things I might need help on:
|
Also, at what point do we want to print the results from the training profile? |
@jeremyjordan could pls make it as PR and add "wip" to its name so it is easier to check the chances... |
🚀 Feature
It'd be nice if the PyTorch Lightning Trainer had a way for profiling a training run so that I could easily identify where bottlenecks are occurring.
Motivation
I have been developing a model and had been using a small toy dataset to verify that everything is working properly. When I started a training run on a larger dataset, the estimated epoch duration was much longer than I anticipated. I have many ideas about where the bottleneck might be occurring, but I don't have an easy way to validate with data.
Pitch
We should include a flag when constructing the Trainer which signifies that we would like to profile this training run. The default will be False since the profiling output may be quite verbose and unnecessary in most cases.
I was initially thinking that we could write a simple decorator and apply this on all the actions (ie functions) which we wish to profile. The benefit of writing a decorator is that the scope of an action is clearly defined (so we know when to start and stop the timing).
Something along the lines of (non functional example):
Some issues with this approach:
@williamFalcon suggested (and please correct me if I misstate something) that we could construct a profiler object if
profile=True
in the Trainer initialization. We could then injectself.profiler.profile()
in the training loop code for every major event (get train data, get val data, training step, backward, optimizer step, val step, etc).As I understand it, the
profiler
would be something along the lines of:My concern with this approach (as I currently understand it) is that we define the "scope" of an action as simply waiting until the next invocation of
profile
. Especially given the fact that functions are calling other functions during training (eg.run_training_epoch
->run_training_batch
) it can be very hard to keep track of the scope of an action.Alternatives
One way we can address this is by invoking the profile within a context handler, but this might get messy by effectively introducing another level of indentation for most of the code.
The text was updated successfully, but these errors were encountered: