Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

The detection accuracy of the R-50-FPN Faster R-CNN is lower than your report, confusing... #672

Closed
chenjoya opened this issue Apr 14, 2019 · 9 comments

Comments

@chenjoya
Copy link
Contributor

chenjoya commented Apr 14, 2019

❓ Questions and Help

Hi @fmassa , thanks for your elegant implementation.
But it is confusing that the detection AP is only 32.8 when I re-train R-50-FPN Faster R-CNN, which should be 36.8 in your report:https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/MODEL_ZOO.md

2019-04-14 07:12:12,977 maskrcnn_benchmark.inference INFO: Start evaluation on coco_2017_val dataset(5000 images).
2019-04-14 07:15:06,105 maskrcnn_benchmark.inference INFO: Total run time: 0:02:53.127008 (0.06925080318450928 s / img per device, on 2 devices)
2019-04-14 07:15:06,105 maskrcnn_benchmark.inference INFO: Model inference time: 0:02:32.530358 (0.061012143325805665 s / img per device, on 2 devices)
2019-04-14 07:15:07,906 maskrcnn_benchmark.inference INFO: Preparing results for COCO format
2019-04-14 07:15:07,906 maskrcnn_benchmark.inference INFO: Preparing bbox results
2019-04-14 07:15:09,584 maskrcnn_benchmark.inference INFO: Evaluating predictions
2019-04-14 07:16:17,912 maskrcnn_benchmark.inference INFO: OrderedDict([('bbox', OrderedDict([('AP', 0.3275950734831557), ('AP50', 0.5054028517973591), ('AP75', 0.36449119818971715), ('APs', 0.1492328236066365), ('APm', 0.3439931485309256), ('APl', 0.48224050452315087)]))])

The config is not changed, but I only have 2 V100 GPUS, therefore 8 images are on each device.
Other information:

OS: Ubuntu 18.04.1 LTS
GCC version: (GCC) 5.5.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB
GPU 2: Tesla P100-PCIE-16GB
GPU 3: Tesla V100-PCIE-16GB
GPU 4: Tesla V100-PCIE-16GB

Nvidia driver version: 418.43
cuDNN version: Probably one of the following:
/usr/local/cuda-9.0/lib64/libcudnn.so.7.2.1
/usr/local/cuda-9.0/lib64/libcudnn_static.a
/usr/local/cuda-9.2/lib64/libcudnn.so.7.2.1
/usr/local/cuda-9.2/lib64/libcudnn_static.a

Versions of relevant libraries:
[pip] Could not collect
[conda] pytorch                   1.0.1           py3.7_cuda9.0.176_cudnn7.4.2_2    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchvision               0.2.2                      py_3    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
        Pillow (5.4.1)
2019-04-13 08:24:36,398 maskrcnn_benchmark INFO: Loaded configuration file configs/e2e_faster_rcnn_R_50_FPN_1x.yaml
2019-04-13 08:24:36,398 maskrcnn_benchmark INFO:

Thanks for your attention! ^^

@chenjoya chenjoya changed the title The detection accuracy of the R-50-FPN Faster is lower than your report, confusing... The detection accuracy of the R-50-FPN Faster R-CNN is lower than your report, confusing... Apr 14, 2019
@chenjoya chenjoya reopened this Apr 14, 2019
@fmassa
Copy link
Contributor

fmassa commented Apr 14, 2019

@chenjoya this is probably due to this part

# different behavior during training and during testing:
# during training, post_nms_top_n is over *all* the proposals combined, while
# during testing, it is over the proposals for each image
# TODO resolve this difference and make it consistent. It should be per image,
# and not per batch
if self.training:
objectness = torch.cat(
[boxlist.get_field("objectness") for boxlist in boxlists], dim=0
)
box_sizes = [len(boxlist) for boxlist in boxlists]
post_nms_top_n = min(self.fpn_post_nms_top_n, len(objectness))
_, inds_sorted = torch.topk(objectness, post_nms_top_n, dim=0, sorted=True)
inds_mask = torch.zeros_like(objectness, dtype=torch.uint8)
inds_mask[inds_sorted] = 1
inds_mask = inds_mask.split(box_sizes)
for i in range(num_images):
boxlists[i] = boxlists[i][inds_mask[i]]

In fact, the behavior is not really exactly the same if you have a batch size of 2 per GPU or a batch size of 8 per GPU. This a bug in behavior in Detectron, that has been kept in maskrcnn-benchmark for consistency.

In order to obtain the same (or similar) results as if you were running on 8 GPUs with batch size of 2 on each GPU, I believe you should increase RPN.FPN_POST_NMS_TOP_N_TRAIN by a factor of 4

post_nms_top_n = min(self.fpn_post_nms_top_n, len(objectness))
.

What is probably happening is that the output of your RPN, which is fed to the classification head afterwards, is seeing 4x less examples due to that.

Can you try changing

_C.MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN = 2000
to 8000 and report back?

If this indeed works (which I expect will be the case), can you maybe send a PR improving a bit the documentation in this part?

Thanks!

@chenjoya
Copy link
Contributor Author

Thanks for your reply. Follow your advise, I change the number of proposals after NMS to 8k:

...
_C.MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN = 8000
_C.MODEL.RPN.FPN_POST_NMS_TOP_N_TEST = 2000
# Custom rpn head, empty to use default conv or separable conv
_C.MODEL.RPN.RPN_HEAD = "SingleConvRPNHead"
...

The training will last about 24 hours. I will reply here and report the results after training.
Thank you 👍

@chenjoya
Copy link
Contributor Author

chenjoya commented Apr 15, 2019

Hi @fmassa , You are so great !!!
After change the proposals to 8k (_C.MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN = 8000), the R-50-FPN Faster R-CNN model achieves 36.8 AP results:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.368
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.586
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.397
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.209
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.400
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.481
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.303
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.480
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.504
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.313
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.540
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.635

Moreover, I also implement select_over_all_levels function for single image rather than whole mini-batch.
The original version:

# different behavior during training and during testing:
# during training, post_nms_top_n is over *all* the proposals combined, while
# during testing, it is over the proposals for each image
# TODO resolve this difference and make it consistent. It should be per image,
# and not per batch
if self.training:
objectness = torch.cat(
[boxlist.get_field("objectness") for boxlist in boxlists], dim=0
)
box_sizes = [len(boxlist) for boxlist in boxlists]
post_nms_top_n = min(self.fpn_post_nms_top_n, len(objectness))
_, inds_sorted = torch.topk(objectness, post_nms_top_n, dim=0, sorted=True)
inds_mask = torch.zeros_like(objectness, dtype=torch.uint8)
inds_mask[inds_sorted] = 1
inds_mask = inds_mask.split(box_sizes)
for i in range(num_images):
boxlists[i] = boxlists[i][inds_mask[i]]

New version:

        num_images = len(boxlists)
        if self.training:
            for i in range(num_images):
                boxlist = boxlists[i]
                box_size = len(boxlist)
                objectness = boxlist.get_field("objectness")
                inds_mask = torch.zeros_like(objectness, dtype=torch.uint8)
                post_nms_top_n = min(self.fpn_post_nms_top_n, box_size)
                _, inds_sorted = torch.topk(objectness, post_nms_top_n, dim=0, sorted=True)
                inds_mask[inds_sorted] = 1
                boxlists[i] = boxlists[i][inds_mask]

It also achieves 36.8 AP:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.368
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.586
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.396
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.211
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.398
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.481
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.307
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.482
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.506
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.542
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634

Please help me check whether this implementation is correct and efficient, thank you fmassa ! ^ ^

@fmassa
Copy link
Contributor

fmassa commented Apr 15, 2019

Yes, this looks like it's right. Basically, there should not be any difference in behaviour during training and testing

Can you send a PR improving the README in the single-GPU case?

@my-hello-world
Copy link

my-hello-world commented Apr 16, 2019

@chenjoya @fmassa
hi, how running on 2 GPUs with batch size of 2 on each GPU?
print(num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1 ) got 1
my nvidia-smi information:
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K4000 Off | 00000000:01:00.0 On | N/A |
| 30% 35C P8 10W / 87W | 215MiB / 3016MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40c Off | 00000000:82:00.0 Off | 0 |
| 35% 73C P0 126W / 235W | 2619MiB / 11439MiB | 79% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K40c Off | 00000000:83:00.0 Off | 0 |
| 23% 33C P8 23W / 235W | 11MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

thanks

@rxqy
Copy link

rxqy commented Apr 22, 2019

Hi, @fmassa , @chenjoya , so this behavior is only related to FPN, right?
And we can solve the problem in either setting these configs to
PRE_NMS_TOP_N_TRAIN: NumImgsPerGPU*1000 FPN_POST_NMS_PER_BATCH: True
or
PRE_NMS_TOP_N_TRAIN: 1000 FPN_POST_NMS_PER_BATCH: False
Right?

@fmassa
Copy link
Contributor

fmassa commented Apr 22, 2019

@rxqy exactly.
Given that this issue has already been addressed in #695, I'm closing this

@buaaMars
Copy link

@fmassa
If I make sure that I have a lager batch size when testing than training, will it be better to use over batch strategy rather than over image strategy no matter in training or testing? Because it is more robust to the variance of the number of instances at per image.
What do you think?

@CoinCheung
Copy link
Contributor

Hi, Do I still need to consider this settings if I use naive faster rcnn without fpn?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants