-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor grid default boxes with torch meshgrid #3799
Conversation
y_f_k, x_f_k = f_k | ||
|
||
shifts_x = (torch.arange(0, f_k[1], dtype=dtype) + 0.5) / x_f_k | ||
shifts_y = (torch.arange(0, f_k[0], dtype=dtype) + 0.5) / y_f_k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the default_boxes
are generated on the CPU device and then migrated to the CUDA device. I've tried the following method to generate the default_boxes
directly on CUDA device, but It will take longer than the for-loop
method.
shifts_x = (torch.arange(0, f_k[1], device=torch.device('cuda'), dtype=dtype) + 0.5) / x_f_k
shifts_y = (torch.arange(0, f_k[0], device=torch.device('cuda'), dtype=dtype) + 0.5) / y_f_k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've hit cases in the past where micro-benchmarks on exactly this part of the code could be faster if running on the CPU, but would present significant slowdowns when training on multiple GPUs. Even if this might be slower on micro-benchmarks if run on a single GPU, it might still be faster on multiple GPUs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fmassa Thus you recommend passing the target device to this method and putting them right away in there, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave it as is for now, but I would create a follow-up issue to benchmark this and the other configuration on multiple GPUs to verify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I've tested the inferring consumption time of the total COCO eval datasets betweed this two default boxes generations methods on different device, the consumption time of these two is very similar.
Validated with (using one card):
CUDA_VISIBLE_DEVICES=0 python train.py --dataset coco --model ssd300_vgg16 \
--batch-size 16 --pretrained --test-only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y_f_k, x_f_k = f_k | ||
|
||
shifts_x = (torch.arange(0, f_k[1], dtype=dtype) + 0.5) / x_f_k | ||
shifts_y = (torch.arange(0, f_k[0], dtype=dtype) + 0.5) / y_f_k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've hit cases in the past where micro-benchmarks on exactly this part of the code could be faster if running on the CPU, but would present significant slowdowns when training on multiple GPUs. Even if this might be slower on micro-benchmarks if run on a single GPU, it might still be faster on multiple GPUs
Co-authored-by: Francisco Massa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhiqwang Thanks a lot for the PR!
I confirmed that using your changes, the statistics of the model remain the same. I left a couple of comments for your consideration but overall it looks good to me. Let me know what you think.
y_f_k, x_f_k = f_k | ||
|
||
shifts_x = (torch.arange(0, f_k[1], dtype=dtype) + 0.5) / x_f_k | ||
shifts_y = (torch.arange(0, f_k[0], dtype=dtype) + 0.5) / y_f_k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fmassa Thus you recommend passing the target device to this method and putting them right away in there, right?
Co-authored-by: Vasilis Vryniotis <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the improvement @zhiqwang!
Thanks a ton @zhiqwang ! |
@zhiqwang I'm afraid that the PR might have broke a branch that is not currently taken on master but it's needed for follow up work. Could you please have a look? Edit: To avoid the back-and-forth I tried the following fix directly on the other PR. Please have a look and let me know what you think: ea1e2c4
|
Summary: * Refactor grid default boxes with torch.meshgrid * Fix torch jit tracing * Only doing the list multiplication once * Make grid_default_box private as suggested * Replace list multiplication with torch.repeat * Move the clipping into _grid_default_boxes to accelerate Reviewed By: datumbox Differential Revision: D28473335 fbshipit-source-id: 0e6705ecb9a80fc78213ac1e6da1b02eb3381652 Co-authored-by: Francisco Massa <[email protected]> Co-authored-by: Vasilis Vryniotis <[email protected]> Co-authored-by: Francisco Massa <[email protected]> Co-authored-by: Vasilis Vryniotis <[email protected]>
As mentioned in #3403 (comment) , I refactor the
for-loop
default boxes generation as following withtorch.meshgrid
.vision/torchvision/models/detection/anchor_utils.py
Lines 209 to 223 in 5339e63
I tested the consumption time of
torch.meshgrid
andfor-loop
methods locally usingtorch.utils.benchmark.Timer
(Just test thedefault_boxes
generation part).On CUDA device:
On CPU device:
EDIT: The parameters I am using is below:
cc @fmassa @datumbox