Feature: Profiling for a training run #753

jeremyjordan · 2020-01-26T02:02:35Z

🚀 Feature

It'd be nice if the PyTorch Lightning Trainer had a way for profiling a training run so that I could easily identify where bottlenecks are occurring.

Motivation

I have been developing a model and had been using a small toy dataset to verify that everything is working properly. When I started a training run on a larger dataset, the estimated epoch duration was much longer than I anticipated. I have many ideas about where the bottleneck might be occurring, but I don't have an easy way to validate with data.

Pitch

We should include a flag when constructing the Trainer which signifies that we would like to profile this training run. The default will be False since the profiling output may be quite verbose and unnecessary in most cases.

trainer = Trainer(args, profile=True)
trainer.fit(model)

I was initially thinking that we could write a simple decorator and apply this on all the actions (ie functions) which we wish to profile. The benefit of writing a decorator is that the scope of an action is clearly defined (so we know when to start and stop the timing).

Something along the lines of (non functional example):

from functools import wraps
import time


def profile(func, enabled=False):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        output = func(*args, **kwargs)
        end = time.time()
        print(f'{func.__name__} took {end-start}s')
        return output
    return wrapper


class Model:
    def __init__(self):
        pass

    @profile(enabled=trainer.profile)
    def training_step(self, batch, batch_idx):
        pass

    @profile(enabled=trainer.profile)
    def validation_step(self, batch, batch_idx):
        pass

Some issues with this approach:

We want to define whether or not a profile is enabled when constructing the trainer, so it's not easy to access this when decorating functions within the model.
Users are expected to overwrite these methods, so we won't actually be able to reliably decorate them anyways.

@williamFalcon suggested (and please correct me if I misstate something) that we could construct a profiler object if profile=True in the Trainer initialization. We could then inject self.profiler.profile() in the training loop code for every major event (get train data, get val data, training step, backward, optimizer step, val step, etc).

As I understand it, the profiler would be something along the lines of:

class Profiler:
    def __init__(self):
        self.last_recorded_time = None
        self.last_recorded_action = None

    def profile(self, action=''):
        current_time = time.time()
        if self.last_recorded_time is not None and self.last_recorded_action is not None:
            print(f'{self.last_recorded_action} took {current_time - self.last_recorded_time}s'
        self.last_recorded_time = current_time
        self.last_recorded_action = action

My concern with this approach (as I currently understand it) is that we define the "scope" of an action as simply waiting until the next invocation of profile. Especially given the fact that functions are calling other functions during training (eg. run_training_epoch -> run_training_batch) it can be very hard to keep track of the scope of an action.

Alternatives

One way we can address this is by invoking the profile within a context handler, but this might get messy by effectively introducing another level of indentation for most of the code.

with profile():
    # do action here

The text was updated successfully, but these errors were encountered:

jeremyjordan · 2020-01-26T02:03:04Z

Thanks to Github's suggestion, I found this similar issue #251

williamFalcon · 2020-01-26T14:51:10Z

Great suggestion. I still prefer the logger approach instead of decorator.

API

I imagine it like:

Trainer(profile=True|False|Profiler)

If True, we use default Profiler

self.profiler = Profiler(...)

If False, no profiler
Otherwise use the give profiler

Trainer

In the trainer, we'd call it at key places

tng_data = get_train_data()
self.profile('train data')

for epoch in epochs:
  for batch in dataset:
      training_step()
      self.profile('training step')

      backward()
      self.profile('backward')

      #...  etc

Usability

We also allow users to call it whenever they want

def training_step(...):
    x, y = batch
    x = self.forward(x)
    self.profile('fwd')

    x = self.second_nn(x)
    self.profile('projection a')
    
    x = self.loss(x)
    self.profile('loss')

Borda · 2020-01-26T15:09:06Z

for the record, more discussion is here:
https://pytorch-lightning.slack.com/archives/CRBLFHY79/p1579997800007700

jeremyjordan · 2020-01-26T15:36:13Z

What if we did the logger approach but made the scope more explicit?

Eg.

self.profile.start('fetch train data')
tng_data = get_train_data()
self.profile.stop('fetch train data')

for epoch in epochs:
  for batch in dataset:
      self.profile.start('training step')
      training_step()
      self.profile.stop('training step')

      self.profile.start('backward')
      backward()
      self.profile.stop('backward')

      #...  etc

And then we could expose an alternative interface using context handlers if the user prefers.

with self.profile('fetch train data'):
  tng_data = get_train_data()

for epoch in epochs:
  for batch in dataset:
      with self.profile('training step'):
          training_step()

      with self.profile('backward')
          backward()

      #...  etc

Internally, we can store the values as a defaultdict(list) and append each (stop-start) duration, so that at the end of the run we can report mean times for each action as well as standard deviations.

jeremyjordan · 2020-01-26T15:36:55Z

I'm also curious what design @jeffling decided on for their profiling tool

williamFalcon · 2020-01-29T19:40:00Z

@jeremyjordan I like the context handler suggestion. That way, I can decide on my won contexts.

Want to take a stab at implementing this?

jeffling · 2020-01-30T02:39:40Z

@jeremyjordan

We use several, but the most useful one was really just using a lot of @timeits. We have a small edit that forces GPU sync so we can get GPU time. Unfortunately, still clunky to use but works. We expose both a decorator and logger api, and the logger one definitely is more popular with the team.

jeremyjordan · 2020-02-02T16:58:40Z

Took an initial stab at implementing this yesterday evening (see branch here).

I wrote a simple profiler which just records durations of actions:

>>> p = Profiler()
>>> 
>>> with p.profile("context handler"):
...     time.sleep(5)
...     a = np.random.randn(3000, 2)
...     b = a + 2
...     c = b / 3
... 
>>> p.start('manual')
>>> time.sleep(5)
>>> a = np.random.randn(3000, 2)
>>> b = a + 2
>>> c = b / 3
>>> p.stop('manual')
>>> 
>>> p.describe()
Action                  |  Mean duration (s)    |  Std dev.       
------------------------------------------------------------
context handler         |  5.0022               |  0.0            
manual                  |  5.0036               |  0.0

And a more advanced profiler which leverages Python's cProfiler which recorded a more detailed view:

>>> ap = AdvancedProfiler()
>>> with ap.profile("test"):
...     time.sleep(5)
...     a = 5*3
...     b = a + 2
...     c = b / 3
... 
>>> ap.start('ok')
>>> a = 3*2
>>> ap.stop('ok')
>>> ap.describe()
Profile stats for: test
         7 function calls in 5.000 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    5.000    5.000    5.000    5.000 {built-in method time.sleep}
        1    0.000    0.000    0.000    0.000 contextlib.py:116(__exit__)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.next}
        1    0.000    0.000    0.000    0.000 profiler.py:37(profile)
        1    0.000    0.000    0.000    0.000 profiler.py:109(stop)
        1    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}



Profile stats for: ok
         4 function calls in 0.000 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 profiler.py:109(stop)
        1    0.000    0.000    0.000    0.000 <stdin>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Still need to validate that everything works properly but figured I'd share the branch in case anyone wants to follow along with the progress.

A couple things I might need help on:

I haven't done much distributed training in PyTorch so I might want to get some review and make sure this will work properly with that.
I'm not sure I understand the motivation behind a GPU sync to get GPU time

jeremyjordan · 2020-02-02T17:00:29Z

Also, at what point do we want to print the results from the training profile?

Borda · 2020-02-02T19:53:09Z

@jeremyjordan could pls make it as PR and add "wip" to its name so it is easier to check the chances...

jeremyjordan · 2020-02-02T20:14:34Z

@Borda sure thing! I wasn't sure if this repo preferred WIP PRs or not. is there a contributors guide? also not sure about things like code formatting (I usually run Black to format)

Edit: never mind I found it. The style guide is pretty light though :)

Borda · 2020-02-02T20:46:20Z

https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html

jeremyjordan added feature Is an improvement or enhancement help wanted Open to be worked on labels Jan 26, 2020

williamFalcon added the need fix label Jan 26, 2020

williamFalcon assigned jeremyjordan Jan 29, 2020

jeremyjordan mentioned this issue Feb 2, 2020

new feature for profiling training runs #782

Merged

4 tasks

Borda removed the help wanted Open to be worked on label Feb 3, 2020

williamFalcon closed this as completed in #782 Feb 7, 2020

Borda added this to the 0.6.1 milestone Feb 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Profiling for a training run #753

Feature: Profiling for a training run #753

jeremyjordan commented Jan 26, 2020

jeremyjordan commented Jan 26, 2020 •

edited

Loading

williamFalcon commented Jan 26, 2020

Borda commented Jan 26, 2020

jeremyjordan commented Jan 26, 2020

jeremyjordan commented Jan 26, 2020

williamFalcon commented Jan 29, 2020

jeffling commented Jan 30, 2020

jeremyjordan commented Feb 2, 2020

jeremyjordan commented Feb 2, 2020

Borda commented Feb 2, 2020

jeremyjordan commented Feb 2, 2020 •

edited

Loading

Borda commented Feb 2, 2020 •

edited

Loading

Feature: Profiling for a training run #753

Feature: Profiling for a training run #753

Comments

jeremyjordan commented Jan 26, 2020

🚀 Feature

Motivation

Pitch

Alternatives

jeremyjordan commented Jan 26, 2020 • edited Loading

williamFalcon commented Jan 26, 2020

API

Trainer

Usability

Borda commented Jan 26, 2020

jeremyjordan commented Jan 26, 2020

jeremyjordan commented Jan 26, 2020

williamFalcon commented Jan 29, 2020

jeffling commented Jan 30, 2020

jeremyjordan commented Feb 2, 2020

jeremyjordan commented Feb 2, 2020

Borda commented Feb 2, 2020

jeremyjordan commented Feb 2, 2020 • edited Loading

Borda commented Feb 2, 2020 • edited Loading

jeremyjordan commented Jan 26, 2020 •

edited

Loading

jeremyjordan commented Feb 2, 2020 •

edited

Loading

Borda commented Feb 2, 2020 •

edited

Loading