Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ops to convert masks to boxes #3960

Closed
oke-aditya opened this issue Jun 3, 2021 · 20 comments · Fixed by #4290
Closed

Ops to convert masks to boxes #3960

oke-aditya opened this issue Jun 3, 2021 · 20 comments · Fixed by #4290

Comments

@oke-aditya
Copy link
Contributor

oke-aditya commented Jun 3, 2021

🚀 Feature

A simple torchvision.ops to convert Segmentation masks to bounding boxes.

Motivation

This has a few use-cases.

  1. This makes it easier to use semantic segmentation datasets for object detection.
    The pipeline can be easier. Also the bounding boxes are represented as xyxy in torchvision.ops as a convention.
    So probably convert masks to xyxy format.

  2. The other use case is to make it easier in comparing performance of segmentation model vs detection model.
    Let's Say that the detection model performs well for segmentation dataset. Then it would be better to go ahead with detection models as it is faster in real-time use-cases than to train a segmentation model.

New Pipeline

from torchvision.ops import masks_to_boxes, box_convert

class SegmentationToDetectionDataset(Dataset):
    def __getitem__(self, idx):
          boxes_xyxy = masks_to_boxes(segmentation_masks)

         # Now for any change of boxes to COCO Format.
          boxes_xywh = box_convert(boxes_xyxy, in_fmt="xyxy", out_fmt="xywh")
          return boxes_xywh

Pitch

Port the masks_to_boxes function from mDeTR.

masks_to_boxes was also used in DeTR.

Alternatives

The above function assumes masks of shape (N, H, W) -> num_masks, Height, Width. A floating tensor.
IIRC, we used a boolean tensor in draw_segmentation_masks (After Nicolas refactored). So perhaps we should be using boolean tensor? Though I see no particular use case of this util being only valid for instance segmentation.

Additional context

I can port this, we perhaps need a few tests to ensure it works fine.
Especially test for float16 overflow.

cc @datumbox @NicolasHug

@NicolasHug
Copy link
Member

I think this could be useful, at least as a plotting utils. I'm not sure I fully understand the use of such util within a model though?

So perhaps we should be using boolean tensor?

Yes the input should be boolean: models have different representations for floating masks as illustrated in the new example, so we can't have a unified approach for floating masks.

@oke-aditya
Copy link
Contributor Author

I think it should under be ops and not utils as utils contain utilities which do not do any manipulation of any masks / boxes.

They return the passed image tensors back

This is a operation (of course is non-differentiable) hece suits better to torchvision.ops.

The only models which have different representation are instance segmentation models. Is this utility essential for instance segmentation? (I'm not sure)

@datumbox
Copy link
Contributor

datumbox commented Jun 4, 2021

Thanks for bringing this up @oke-aditya.

This is a text-book example of utility we want to upstream from DeTR to TorchVision. I think it should be OK placing it in torchvision.ops.boxes along with the other box utilities ported already.

Especially test for float16 overflow.

Yes please, good call. We need to be careful for numeric overflows. See #3383 for addressing a similar issues on other box ops.

Edit: Also since we are porting things from DeTR, let's make sure we give credit by putting a reference to the original code. You can see examples of that already in our code.

@oke-aditya
Copy link
Contributor Author

What do you think @datumbox about the segmentation masks? Should they be kept float as (num_masks, H, W) or should we change to adopt bool masks?

I would prefer to keep float masks.

@datumbox
Copy link
Contributor

datumbox commented Jun 4, 2021

This makes it easier to use semantic segmentation datasets for object detection.

I think as per your original point, you would have to support (num_masks, H, W) if you were to convert a the masks of a specific dataset to boxes for object detection. See how DeTR uses the method on the CocoPanoptic dataset.

I also see value to support for bool masks, as this will make it work seamlessly with Nicolas' draw_segmentation_masks but it's unclear to me if that should happen on the same method or on a separate. I would personally start by porting DeTR's implementation, adding tests, handling overflows and take it from there.

Adding @fmassa in the discussion as he usually have good intuition around segmentation/detection.

@0x00b1
Copy link
Contributor

0x00b1 commented Aug 17, 2021

Has anyone started to work on this? If not, I'm more than happy to port it over.

@datumbox
Copy link
Contributor

I think supporting boolean masks probably requires additional discussion but porting DeTR's masks_to_boxes method can be done now. @0x00b1 happy to review a PR about it. Just tag me. :)

@NicolasHug
Copy link
Member

I would really rather not start implementing a method that is only specific to a small subset of models.
We should try to avoid the same scenario where we had to re-write the entire draw_segmentation_masks util: #3824

@datumbox
Copy link
Contributor

This is a general purpose utility unrelated to drawing. As part of the Batteries Included initiative, we are upstreaming some utils that exist on ecosystem projects such as DETR. I think the masks_to_boxes is a meaningful addition as it can be on specific datasets such as the CocoPanoptic dataset which I mentioned earlier.

@NicolasHug
Copy link
Member

I understand it's unrelated to drawing. My point is that we should aim for this to be as generic as possible so that we can support all possible torchvision use-cases, instead of a subset of them. I might be missing something but the current proposed API seems specifically targeted towards a subset of models / datasets.

@datumbox
Copy link
Contributor

Could you please elaborate on why this specific operator is not generic enough to be included?

IMO a quick search shows that the generic and useful to be used across different use-cases and as a result it's ended up being copy-pasted over and over in multiple projects across FAIR. This makes them an excellent candidate for up streaming which is the key goal of Batteries Included.

@NicolasHug
Copy link
Member

From our example https://pytorch.org/vision/stable/auto_examples/plot_visualization_utils.html, instance segmentation masks and semantic segmentation masks both rely on float values but they are encoded very differently and those float values don't mean the same thing. The "greatest common factor" of these 2 distinct representations is the boolean representation.

@fmassa
Copy link
Member

fmassa commented Aug 18, 2021

Hi all,

I think that having a function to convert from segmentation masks to bounding boxes would be very useful.
As @oke-aditya pointed out, some datasets come only with segmentation masks by default, but Faster R-CNN family of models actually require the bounding boxes as well, so this would simplify things.

In fact, in our object detection finetuning tutorial we re-implement a masks_to_boxes function to be able to train Faster R-CNN on a custom dataset.

But I understand @NicolasHug point that there can be some confusion about what a mask is, so we should be very clear about the input representation. I think enforcing it is a boolean mask of (num_masks, H, W) shape is an ok trade-off.
The confusion arises from the fact that segmentation is an overloaded term, with semantic segmentation, instance segmentation and panoptic segmentation all meaning slightly different things.

@datumbox
Copy link
Contributor

@oke-aditya @0x00b1 please coordinate between the two of you who will send the PR to avoid duplication of effort. It's worth including tests for different dtypes to ensure that everything works properly (see here for examples).

@oke-aditya
Copy link
Contributor Author

oke-aditya commented Aug 18, 2021

Hey @0x00b1 feel free to send a PR!

Edit:
It would be nice to document a small example (probably it can be in gallery) demonstrating how this could be used with existing datasets. Since lot of downstream libraries depend, they can make use of this.

@0x00b1
Copy link
Contributor

0x00b1 commented Aug 18, 2021

@oke-aditya Great! I'll send one this afternoon. I'll include a gallery example.

@RylanSchaeffer
Copy link

Quick question: this PR looks applicable only to (batched) 2D images. What about 3D images? I would also like to convert 3d segmentation masks to 3D bounding boxes.

@addisonklinke
Copy link

Could this be generalized to allow for the possibility of multiple, discrete boxes per mask, or would that be better suited for a separate function? The following example demonstrates the current vs. desired behavior

import torch
from torchvision.ops import masks_to_boxes

# Generate dummy [1, 5, 5] mask
masks = torch.tensor([
    [0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0],
    [1, 1, 0, 0, 0],
    [0, 0, 1, 1, 0],
    [0, 0, 1, 1, 0],
]).unsqueeze(0)

# Current (assumed single box) : [[0, 1, 3, 4]]
# Desired (allow multiple)     : [[[0, 1, 1, 2], [2, 3, 3, 4]]]
boxes = masks_to_boxes(masks)
print(boxes)

Note in this case the return tensor should have rank 3 instead of 2 to represent [num_masks, num_boxes, 4]

@syed-javed
Copy link

Could this be generalized to allow for the possibility of multiple, discrete boxes per mask, or would that be better suited for a separate function? The following example demonstrates the current vs. desired behavior

import torch
from torchvision.ops import masks_to_boxes

# Generate dummy [1, 5, 5] mask
masks = torch.tensor([
    [0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0],
    [1, 1, 0, 0, 0],
    [0, 0, 1, 1, 0],
    [0, 0, 1, 1, 0],
]).unsqueeze(0)

# Current (assumed single box) : [[0, 1, 3, 4]]
# Desired (allow multiple)     : [[[0, 1, 1, 2], [2, 3, 3, 4]]]
boxes = masks_to_boxes(masks)
print(boxes)

Note in this case the return tensor should have rank 3 instead of 2 to represent [num_masks, num_boxes, 4]
Hello @addisonklinke , Do you have a solution that gives the desired output as you mentioned?

@addisonklinke
Copy link

addisonklinke commented Jan 27, 2022

@syed-javed Yes I've got one working now. The strategy is to iterate through each (x, y) location where there's a positive (i.e confidence > threshold) prediction. From those locations, iteratively expand outwards as long as each boundary edge has an average confidence greater than the threshold. Ignore points that overlap with a previously created box to speed up the iteration

With the function below, you can reproduce my desired output. Please note my input tensor is slightly different, specifically torch.FloatTensor[H, W] instead of torch.BoolTensor[N, H, W]. Also the return is a tuple of (boxes, scores) where scores is the average confidence of each region

boxes, scores = heatmap_to_bboxes(masks.squeeze().float())
# boxes: [[0, 1, 1, 2], [2, 3, 3, 4]]]
# scores: [[1, 1]]

The function

from copy import deepcopy
import torch
from torchvision.ops import batched_nms


def heatmap_to_bboxes(heatmap, pos_thres=0.5, nms_thres=0.5, score_thres=0.5):
    """Cluster heatmap into discrete bounding boxes

    :param torch.Tensor[H, W] heatmap: Predicted probabilities
    :param float pos_thres: Threshold for assigning probability to positive class
    :param Optional[float] nms_thres: Threshold for non-max suppression (or ``None`` to skip)
    :param Optional[float] score_thres: Threshold for final bbox scores (or ``None`` to skip)
    :return Tuple[torch.Tensor]: Containing
        * bboxes[N, C=4]: bounding box coordinates in ltrb format
        * scores[N]: confidence scores (averaged across all pixels in the box)
    """

    def get_roi(data, bounds):
        """Extract region of interest from a tensor

        :param torch.Tensor[H, W] data: Original data
        :param dict bounds: With keys for left, right, top, and bottom
        :return torch.Tensor[H', W']: Subset of the original data
        """
        compound_slice = (
            slice(bounds['top'], bounds['bottom']),
            slice(bounds['left'], bounds['right']))
        return data[compound_slice]

    def is_covered(x, y, bbox):
        """Determine whether a point is covered/inside a bounding box

        :param int x: Point x-coordinate
        :param int y: Point y-coordinate
        :param torch.Tensor[int(4)] bbox: In ltrb format
        :return bool: Whether all boundaries are satisfied
        """
        left, top, right, bottom = bbox
        bounds = [
            x >= left,
            x <= right,
            y >= top,
            y <= bottom]
        return all(bounds)

    # Determine indices of each positive pixel
    heatmap_bin = torch.where(heatmap > pos_thres, 1, 0)
    mask = torch.ones(heatmap.size()).type_as(heatmap)
    idxs = torch.flip(torch.nonzero(heatmap_bin*mask), [1])
    heatmap_height, heatmap_width = heatmap.shape

    # Limit potential expansion to the heatmap boundaries
    edge_names = ['left', 'top', 'right', 'bottom']
    limits = {
        'left': 0,
        'top': 0,
        'right': heatmap_width,
        'bottom': heatmap_height}
    bboxes = []
    scores = []

    # Iterate over positive pixels
    for x, y in idxs:

        # Skip if an existing bbox already covers this point
        already_covered = False
        for bbox in bboxes:
            if is_covered(x, y, bbox):
                already_covered = True
                break
        if already_covered:
            continue

        # Start by looking 1 row/column in every direction and iteratively expand the ROI from there
        incrementers = {k: 1 for k in edge_names}
        max_bounds = {
            'left': deepcopy(x),
            'top': deepcopy(y),
            'right': deepcopy(x),
            'bottom': deepcopy(y)}
        while True:

            # Extract the new, expanded ROI around the current (x, y) point
            bounds = {
                'left': max(limits['left'], x - incrementers['left']),
                'top': max(limits['top'], y - incrementers['top']),
                'right': min(limits['right'], x + incrementers['right'] + 1),
                'bottom': min(limits['bottom'], y + incrementers['bottom'] + 1)}
            roi = get_roi(heatmap_bin, bounds)

            # Get the vectors along each edge
            edges = {
                'left': roi[:, 0],
                'top': roi[0, :],
                'right': roi[:, -1],
                'bottom': roi[-1, :]}

            # Continue if at least one new edge has more than ``pos_thres`` percent positive elements
            # Also check whether ROI has reached the heatmap boundary
            keep_going = False
            for k, v in edges.items():
                if v.sum()/v.numel() > pos_thres and limits[k] != max_bounds[k]:
                    keep_going = True
                    max_bounds[k] = bounds[k]
                    incrementers[k] += 1

            # If none of the newly expanded edges were useful
            # Then convert the maximum ROI to bbox and calculate its confidence
            # Single pixel islands are ignored since they have zero width/height
            if not keep_going:
                final_roi = get_roi(heatmap, max_bounds)
                if final_roi.numel() > 0:
                    bboxes.append([max_bounds[k] - 1 if i > 1 else max_bounds[k] 
                                   for i, k in enumerate(edge_names)])
                    scores.append(final_roi.mean())
                break

    # Type conversions and optional NMS + score filtering
    bboxes = torch.tensor(bboxes).type_as(heatmap)
    scores = torch.tensor(scores).type_as(heatmap)
    if nms_thres is not None:
        class_idxs = torch.zeros(bboxes.shape[0])
        keep_idxs = batched_nms(bboxes.float(), scores, class_idxs, iou_threshold=nms_thres)
        bboxes = bboxes[keep_idxs]
        scores = scores[keep_idxs]
    if score_thres is not None:
        high_confid = scores > score_thres
        bboxes = bboxes[high_confid]
        scores = scores[high_confid]
    return bboxes, scores

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants