Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this a bug? - unnecessary CPU/GPU copying in supporters.py just for aggregating loss #12408

Closed
isvogor-foi opened this issue Mar 22, 2022 · 4 comments · Fixed by #12430
Closed
Labels
performance question Further information is requested

Comments

@isvogor-foi
Copy link
Contributor

isvogor-foi commented Mar 22, 2022

I've noticed that while training, this chunk here is eating a lot of time.

https://github.com/PyTorchLightning/pytorch-lightning/blob/0b682b807a76b732492a3d43e3208a80c80b6c2f/pytorch_lightning/trainer/supporters.py#L76-L80

I'm wondering whether this is a bug, and why we don't have it like:

 self.memory = torch.zeros(self.window_length, *x.shape, device=x.device)

It works, and it's much, faster.

cc @Borda @akihironitta

@isvogor-foi isvogor-foi changed the title Is this a bug? Is this a bug? - unnecessary CPU/GPU copying in supporters.py just for aggregating loss Mar 22, 2022
@rohitgr7
Copy link
Contributor

this will probably be removed in some time: #9372

@carmocca
Copy link
Contributor

This is a very old piece of code. I guess the choice was to prefer less GPU memory usage over speed

@carmocca carmocca added performance question Further information is requested labels Mar 22, 2022
@isvogor-foi
Copy link
Contributor Author

Well, I'm not sure how impactful this is on the memory, but it's an easy fix. Should I submit a PR?
I've made some measurements... So in the advance function I've measured how much does it take to perform a training step.. And then I've summed it up through all epochs, for a simple MNIST example.

Never mind the other bars in the plot, but focus on the red highlighted one. The total time to run is around 105 S, out of this time the green bar (10 S) is the training, the remaining time, red, is performing the loss update.

image

Just adding the x.device() fixes this. I see no reason why this would not be fixed...

@carmocca
Copy link
Contributor

Yes, go ahead!

isvogor-foi added a commit to isvogor-foi/pytorch-lightning that referenced this issue Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance question Further information is requested
Projects
None yet
3 participants