-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INCREASING NMS SPEED #679
Comments
Results of the test is that torchvision.ops.boxes.nms() is fastest but not the highest mAP. Ultralytics MERGE method increases AP + 0.5, so I will leave it for testing (when calling test.py directly using Lines 513 to 517 in 1e9ddc5
|
I will look more into this during the weekend. |
great works! |
torchvision. ops implements operators that are specific for Computer Vision. Those operators currently do not support TorchScript. Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU) AttributeError: module 'torchvision' has no attribute 'ops' what should I do? |
@omizonly what is your use case for TorchScript? |
tensorflow= 1.3.1 |
@omizonly I don't understand, can you elaborate? This repo only runs PyTorch, and exports to ONNX for onward use in other formats, however we clearly can not support you with problems in those other formats. I suggest you raise an issue on the PyTorch or TF repos. |
I'll close this issue for now as the original issue appears to have been resolved, and/or no activity has been seen for some time. Feel free to comment if this is not the case. |
Quick update with latest code on one T4 GPU. Second line is current default. python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp.cfg --img 608
|
Is there a way to make the model print the JSON file if it detects an object regardless of classification? |
Hi, I saw a Fast NMS proposed by YOLACT. How is it? https://arxiv.org/abs/1912.06218 |
@Zzh-tju yes that seems an interesting approach. They apply NMS as a matrix operation to remove the Depending on the conf-thres used, NMS may or may not be a very expensive operation in this repo. For most actual use applications with conf-thres around 0.1-0.9, NMS is not a speed concern, taking <10% of the total processing time for an image, but when calculating mAP near conf-thres = 0.0001 for example, NMS may take up 90% of the processing time. If you can try to implement a fast NMS experiment here that would be very useful. The NMS function is here. In the meantime I will update this thread with the latest speeds on a T4 colab instance. Lines 504 to 512 in dce753e
UPDATE: I've posted an issue on yolact repo for this dbolya/yolact#366 (comment) |
Update: I discovered a majority of time in test.py was spent building pycocotools JSON files for official mAPs. If I turn off this functionality (compute mAP only with repo code) I get the following times for the 5k COCO2014 val images. Machine is a 12-vCPU V100 instance. python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp --img 608
|
@Zzh-tju FastNMS updates have been committed and pushed now after testing. Lines 564 to 571 in f915bf1
|
@Zzh-tju to clear up the timing a bit more, I added profiling code to test.py that specifically tracks inference and NMS times in e482392. This can be accessed with the python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile I ran with both default torchvision NMS and the yolact FastNMS, and actually saw a slight speed decrease with FastNMS: Default: So perhaps the slight speed increase from FastNMS observed in the total test time is due simply to a reduced box count produced by this NMS method, which results in less postprocessing work during testing (mAP calculation etc.). The other surprise was the great amount of total time spent on NMS vs inference. Even under the default settings 6.9/8.1 = 85% of the total time is spent on NMS! |
CORRECTION: My previous analysis was incorrect, it lacked the python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile Default: Conclusion is that inference uses most (80%) of the runtime in both cases, and that FastNMS appears to run slightly slower than default |
Inference can be sped up with larger batch sizes, but NMS is run per image in all cases, so the only ways to affect it's speed currently are here. Note that the 1.6 ms profile time uses all default settings though (none of these speedups are applied).
|
Running a few tests to document effects on speed. These are with a V100 from a docker container, which is slightly slower than running natively. python3 test.py --cfg yolov3-spp.cfg --weights yolov3-spp-ultralytics.pt --img 608 rect=False rect=True Running default natively: |
V100: 2080Ti: CPU: |
batch_size=32 means testing 32 images simultaneously including NMS? |
@Zzh-tju batch-size 32 means for example a 32x3x608x608 tensor is passed to the model for inference. The inference outputs are passed to NMS, which operates sequentially over the images: Line 508 in 4089735
|
Test-time augmentation study #931: Default + 0 ops: 11.8/1.5/13.3 ms inference/NMS/total per 608x608 image at batch-size 1 |
Updated V100 speeds with fused inference: Default + 0 ops: 11.1/1.7/12.8 ms inference/NMS/total per 608x608 image at batch-size 1 |
SOLOv2 Table 7: Matrix NMS: UPDATE: Unable to reproduce using this code: elif method == 'matrix_batch': # Matrix NMS from https://arxiv.org/abs/2003.10152
iou = box_iou(boxes, boxes).triu_(diagonal=1) # upper triangular iou matrix
m = iou.max(0)[0].view(-1, 1) # max values
decay = torch.exp(-(iou ** 2 - m ** 2) / 0.5).min(0)[0] # gauss with sigma=0.5
scores *= decay
i = torch.full((boxes.shape[0],), fill_value=1).bool() |
Have you solved it? I met the same problems |
This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days. |
@glenn-jocher Hi, could you tell me why we cannot do NMS cross batches. Currently, NMS is done on images one by one. However, we turn on batch testing. The number of detections from different images are different, is it the reason why we cannot perform real batch NMS? |
@Zzh-tju feel free to play around with the NMS code and try your idea out. If you see performance improvements please submit a PR! Thank you. |
@glenn-jocher Now, I just figured out a speed improvement. And will give you a PR later. You can try it and give it more optimization. Because Torchvision NMS cannot run across images mode. (if we add image related offset for boxes, it will enlarge the size of IoU matrix quadratically). So I have to try Cluster-NMS. I keep the preprocessing of NMS unchanged, and just replace the core part of your merge nms with Cluster-Weighted NMS.
Now I want to ask you why with batchsize increase, NMS time decrease? (for torchvision nms) I think maybe the best way is to intergrate the preprocessing of NMS into batch mode either, even if it will bring us a slight performance drop. Now it takes about 1.3~1.5ms for preprocessing. And just 0.8 ms for your torchvision merge NMS. It still room for accelarating. |
@Zzh-tju ah! Thanks for the interesting study. We've actually discovered that in yolov5 the regression is improved enough that we can stop using merge, and simply use the default pytorch NMS to get the same results. So the current NMS strategy we have is in yolov5 function is not to use merge anymore. It is an interesting idea to do a batched NMS approach instead of calling the nms function once per image. Your results show a significant improvement, 2.3 / 3.0 is about 25% faster (!). This would make a huge improvement on yolov5s for example, which has inference time of 2.1ms per image at batch-size 32 FP16, about half of which is used up with NMS. See speeds here. NMS is about 1 ms per image in these numbers, so a 25% speedup there would be noticeable in the table. |
Right now the boxes are offset by (class * max_image_size) to get batched per image (so different classes never overlap). I suppose to run once per batch we would offset boxes by (class * max_image_size * image_index)? Are you using torchvision.ops.nms() or torchvision.ops._batched_nms()? |
@glenn-jocher no, you misunderstand me. My question is why with batchsize increase, NMS speed increase either? |
@Zzh-tju in my experiments with yolov5, NMS speed is the same no matter the batch size. For example from the notebook:
Output:
So 1.8ms, 2.2ms, 2.1ms at batch sizes 1, 8, 32. Basically NMS speed per image is not correlated to batch size. |
got it @glenn-jocher , I will do more test with batchsize. |
@glenn-jocher Hi, I have just finished a marginal work about Batch Mode Weighted Cluster-NMS for speeding up NMS. You can check https://github.com/Zzh-tju/yolov5 for details. My conclusion is Batch mode Weighted Cluster-NMS will benefit us when TTA is used. |
@Zzh-tju ah, very interesting! I'll check out the forked repo. |
@Zzh-tju I looked things over. You've clearly done a lot of work and experimentation! I see it's hard to provide substantial gains off of the basic NMS unfortunately. I think this is because box regression is improving over past works, so perhaps the gains presented by merging two 0.90 iou boxes are less than for example merging two 0.5 iou boxes. It's unfortunate, because actually one of the yolov5 changes is increased grid sensetivity. In yolov3, only one cell per output layer could trigger on an object. In yolov5, >=3 cells per output layer always trigger per object (the nearest 3), so I'd expect many more boxes being proposed by yolov5 than by yolov3. It's frustrating that there isn't a better way to exploit all these extra statistics. One very interesting piece of information I found out during the TTA and Ensembling work, I discovered that merging output grids always produced better results than appending output boxes togethor. If you look at the YOLOv5 ensembling module you will see that there are 3 options:
If there was a way to mean() TTA output grids the way that mean ensemble works, this might produce the best results, but it is very complicated due to the varying output shapes unfortunately, so abandoned this effort. |
@glenn-jocher wait a second, why do TTA output grids have different shape of outputs? |
@glenn-jocher And I did saw an improvement when merging two 0.8 IoU boxes rather than two 0.65 boxes. |
@Zzh-tju ensemble output grids will have the same shape, for example if you run both YOLOv5s and YOLOv5m at the same image size, the 3 output grids from YOLOv5s are the same size as from YOLOv5m. TTA uses different inference sizes as part of it's augmentation, so naturally the output grids will change in size, and can no longer be directly meaned. Hmm, interesting, 0.8 IoU is higher than I've ever tried. I think the more accurate the box regressions, the higher you can raise the IoU threshold. What was the improvement you saw using 0.8 IoU? |
@glenn-jocher see the results in https://github.com/Zzh-tju/yolov5. weighted threshold is the merging threshold |
@glenn-jocher |
@Zzh-tju yes. YOLOv5 strides are 8, 16, 32 on the small, medium and large object output layers. So a 640x640 image will have 3 output grids of size 20x20, 40x40, 80x80. The same output grids for a 320x320 image are 10x10, 20x20, 40x40. |
Non Maximal Suppression (NMS) of bounding boxes is a significant speed constraint during testing. I am opening this issue to try to determine options for speeding up this operation. I am going to compare the default NMS method
'MERGE'
with two newly available PyTorch methods. If anyone has any additional methods we could test, please post here.yolov3/utils/utils.py
Line 456 in cadd2f7
The test code is below. Hardware is a 2080Ti.
UPDATE: THESE ARE OLD RESULTS, SEE BOTTOM OF THREAD FOR IMPROVED RESULTS
mm:ss
@0.5...0.95
@0.5
'OR'
'AND'
'SOFT'
'MERGE'
The text was updated successfully, but these errors were encountered: