-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RetinaNet with MobileNetV3 FPN backbone #3223
RetinaNet with MobileNetV3 FPN backbone #3223
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review comments that highlight important changes.
@@ -104,7 +104,7 @@ def _test_detection_model(self, name, dev): | |||
kwargs = {} | |||
if "retinanet" in name: | |||
# Reduce the default threshold to ensure the returned boxes are not empty. | |||
kwargs["score_thresh"] = 0.01 | |||
kwargs["score_thresh"] = 0.0099999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight adjustment necessary for getting non-zero results on MobileNetV3.
num_classes=2, min_size=100, max_size=100) | ||
for name in ["retinanet_resnet50_fpn", "retinanet_mobilenet_v3_large_fpn"]: | ||
model = torchvision.models.detection.__dict__[name]( | ||
num_classes=2, min_size=100, max_size=100, pretrained_backbone=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous version seemed to download the weights of the backbone unnecessarily. I fix this inplace by adding
pretrained_backbone=False
.
|
||
# Gather the indeces of blocks which are strided. These are the locations of C1, ..., Cn-1 blocks. | ||
# The first and last blocks are always included because they are the C0 (conv1) and Cn. | ||
stage_indeces = [0] + [i for i, b in enumerate(backbone) if getattr(b, "is_strided", False)] + [len(backbone) - 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding meta-data with the location on the blocks that downsample, I get it by checking a new attribute called is_strided
. This attribute is added in both the MobileNetV2 and V3 residual blocks and indicates if the specific block downsamples. This is typically the location of C1...Cn-1 blocks.
Note that blocks at first and last position of the features block are always included.
freeze_before = num_stages if trainable_layers == 0 else stage_indeces[num_stages - trainable_layers] | ||
|
||
# freeze layers only if pretrained backbone is used | ||
for b in backbone[:freeze_before]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike the resnet implementation, here we need to find the location of the first block that we finetune and mark everything before that as frozen.
""" | ||
if pretrained: | ||
pretrained_backbone = False | ||
backbone = mobilenet_fpn_backbone("mobilenet_v3_large", pretrained_backbone, returned_layers=[4, 5]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the outputs of blocks C4 and C5 in our feature pyramid.
On C5, we do the same as the paper and we use the layer just before pooling.
On C4, we deviate from the original paper that suggests using the "the expansion layer of the 13th bottleneck block". In our case we use the output of the 13th bottleneck because it's very hard to get the output of the expansion without completely refactoring the entire mobilenetV2 and V3 architectures. As a result our C4 feature output is 160x7x7 instead of 672x14x14.
This could lead to a faster model but might reduce the accuracy metrics. We'll do experiments to assess the difference. Perhaps instead of refactoring completely the implementation, we could have a workaround using hooks?
pretrained_backbone = False | ||
backbone = mobilenet_fpn_backbone("mobilenet_v3_large", pretrained_backbone, returned_layers=[4, 5]) | ||
|
||
anchor_sizes = ((128,), (256,), (512,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anchor sizes for C4, C5 and pool. It's important to note that C4 and C5 have the same output stride of 32.
@@ -90,6 +91,8 @@ def __init__( | |||
norm_layer(oup), | |||
]) | |||
self.conv = nn.Sequential(*layers) | |||
self.output_channels = oup | |||
self.is_strided = stride > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meta data are added in the blocks to make it easier to detect the C1...Cn blocks and the out_channels in detection models. We do this both on mobilenetv2 and mobilenetv3.
217350a
to
a56fe27
Compare
7d4dd3d
to
06e3e72
Compare
# Conflicts: # torchvision/models/detection/retinanet.py
e211494
to
0419dbc
Compare
4a0c7a4
to
6c53bfc
Compare
5c15a2c
to
75933da
Compare
75fa2a6
to
81800cd
Compare
Partially fixes #1999
RetinaNet + MobileNetV3 large + FPN
Trained using the code committed at 7af35c3.
The current temporary pre-trained model was trained:
Submitted batch job 34643976
Then we took the 2 last checkpoints (epochs 22, 18) that improved the AP and averaged their parameters using the following script:
Accuracy metrics:
Validated with:
Submitted batch job 34643680
Speed benchmark:
0.74 sec per image on CPU