segmentation.html

<!DOCTYPE html>

<html>

  <head>
    <title>Ch. 6 - Object detection and
segmentation</title>
    <meta name="Ch. 6 - Object detection and
segmentation" content="text/html; charset=utf-8;" />
    <link rel="canonical" href="http://manipulation.csail.mit.edu/segmentation.html" />

    <script src="https://hypothes.is/embed.js" async></script>
    <script type="text/javascript" src="htmlbook/book.js"></script>

    <script src="htmlbook/mathjax-config.js" defer></script> 
    <script type="text/javascript" id="MathJax-script" defer
      src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
    </script>
    <script>window.MathJax || document.write('<script type="text/javascript" src="htmlbook/MathJax/es5/tex-chtml.js" defer><\/script>')</script>

    <link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
    <script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
    <script>hljs.initHighlightingOnLoad();</script>

    <link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
  </head>

<body onload="loadChapter('manipulation');">

<div data-type="titlepage" pdf="no">
  <header>
    <h1><a href="index.html" style="text-decoration:none;">Robotic Manipulation</a></h1>
    <p data-type="subtitle">Perception, Planning, and Control</p> 
    <p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
    <p style="font-size: 14px; text-align: right;"> 
      &copy; Russ Tedrake, 2020-2021<br/>
      Last modified <span id="last_modified"></span>.</br>
      <script>
      var d = new Date(document.lastModified);
      document.getElementById("last_modified").innerHTML = d.getFullYear() + "-" + (d.getMonth()+1) + "-" + d.getDate();</script>
      <a href="misc.html">How to cite these notes, use annotations, and give feedback.</a><br/>
    </p>
  </header>
</div>

<p pdf="no"><b>Note:</b> These are working notes used for <a
href="http://manipulation.csail.mit.edu/Fall2021/">a course being taught
at MIT</a>. They will be updated throughout the Fall 2021 semester.  <!-- <a 
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture  videos are available on YouTube</a>.--></p> 

<table style="width:100%;" pdf="no"><tr style="width:100%">
  <td style="width:33%;text-align:left;"><a class="previous_chapter" href=clutter.html>Previous Chapter</a></td>
  <td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
  <td style="width:33%;text-align:right;"><a class="next_chapter" href=force.html>Next Chapter</a></td>
</tr></table>


<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 5"><h1>Object detection and
segmentation</h1>

  <p>Our study of geometric perception gave us good tools for estimating the
  pose of a known object.  These algorithms can produce highly accurate
  estimates, but are still subject to local minima.  When the scenes get more
  cluttered/complicated, or if we are dealing with many different object types,
  they really don't offer an adequate solution by themselves.</p>

  <p>Deep learning has given us data-driven solutions that complement our
  geometric approaches beautifully.  Finding correlations in massive datasets
  has proven to be a fantastic way to provide practical solutions to these more
  "global" problems like detecting whether the mustard bottle is even in the
  scene, segmenting out the portion of the image / point cloud that is relevant
  to the object, and even in providing a rough estimate of the pose that could
  be refined with a geometric method.</p>

  <p>There are many sources of information about deep learning on the internet,
  and I have no ambition of replicating nor replacing them here.  But this
  chapter does being our exploration of deep perception in manipulation, and I
  feel that I need to give just a little context.</p>

  <section><h1>Getting to big data</h1>

    <subsection><h1>Crowd-sourced annotation datasets</h1>
    
      <p>The modern revolution in computer vision was unquestionably fueled by
      the availability of massive annotated datasets.  The most famous of all is
      ImageNet, which eclipsed previous datasets with the number of images and
      the accuracy and usefulness of the labels<elib>Russakovsky15</elib>.
      Fei-fei Li, who led the creation of ImageNet has been giving talks that
      give some nice historical perspective on how ImageNet came to be.  
      <a href="https://youtu.be/0bb9EcaQupI">Here is one</a> (slightly) tailored
      to robotics and even manipulation; you might start <a
      href="https://youtu.be/0bb9EcaQupI?t=857">here</a>.  </p>

      <p><elib>Russakovsky15</elib> describes the annotations available in
      ImageNet: <blockquote>... annotations fall into one of two categories: (1)
      image-level annotation of a binary label for the presence or absence of an
      object class in the image, e.g., "there are cars in this image" but "there
      are no tigers," and (2) object-level annotation of a tight bounding box
      and class label around an object instance in the image, e.g., "there is a
      screwdriver centered at position (20,25) with width of 50 pixels and
      height of 30 pixels".</blockquote></p>

      <figure><img width="80%"
      src="data/coco_instance_segmentation.jpeg"/><figcaption>A sample annotated
      image from the COCO dataset, illustrating the difference between
      image-level annotations, object-level annotations, and segmentations at
      the class/semantic- or instance- level..</figcaption></figure>

      <p>The COCO dataset similarly enabled pixel-wise <i>instance-level</i>
      segmentation <elib>Lin14a</elib>, where distinct instances of a class are
      given a unique label (and also associated with the class label).  COCO has
      fewer object categories than ImageNet, but more instances per category.
      It's still shocking to me that they were able to get 2.5 million images
      labeled at the pixel level.  I remember some of the early projects at MIT
      when crowd-sourced image labeling was just beginning (projects like
      LabelMe <elib>Russell08</elib>);  Antonio Torralba used to joke about how
      surprised he was about the accuracy of the (nearly) pixel-wise annotations
      that he was able to crowd-source (and that his mother was a particularly
      prolific and accurate labeler)!</p>

      <p>Instance segmentation turns out to be an very good match for the
      perception needs we have in manipulation.  In the last chapter we found
      ourselves with a bin full of YCB objects.  If we want to pick out only the
      mustard bottles, and pick them out one at a time, then we can use a deep
      network to perform an initial instance-level segmentation of the scene,
      and then use our grasping strategies on only the segmented point cloud.
      Or if we do need to estimate the pose of an object (e.g. in order to place
      it in a desired pose), then segmenting the point cloud can also
      dramatically improve the chances of success with our geometric pose
      estimation algorithms.</p>

    </subsection>

    <subsection><h1>Segmenting new classes via fine tuning</h1>

      <p>The ImageNet and COCO datasets contain labels for <a
      href="https://tech.amikelive.com/node-718/what-object-categories-labels-are-in-coco-dataset/">a
      variety of interesting classes</a>, including cows, elephants, bears,
      zebras and giraffes.  They have a few classes that are more relevant to
      manipulation (e.g., plates, forks, knives, and spoons), but they don't
      have a mustard bottle nor a can of potted meat like we have in the YCB
      dataset.  So what are we do to?  Must we produce the same image annotation
      tools and pay for people to label thousands of images for us?</p>

      <p>One of the most amazing and magical properties of the deep
      architectures that have been working so well for instance-level
      segmentation is their ability to transfer to new tasks.  A network that
      was pre-trained on a large dataset like ImageNet or COCO can be
      fine-tuned with a relatively much smaller amount of labeled data to a new
      set of classes that are relevant to our particular application.  In fact,
      the architectures are often referred to as having a "backbone" and a
      "head" -- in order to train a new set of classes, it is often possible to
      just pop off the existing head and replace it with a new head for the new
      labels.  A relatively small amount of training with a relatively small
      dataset can still achieve surprisingly robust performance.  Moreover, it
      seems that training initially on the diverse dataset (ImageNet or COCO)
      is actually important to learn the robust perceptual representations that
      work for a broad class of perception tasks.  Incredible!</p>

      <p>This is great news!  But we still need some amount of labeled data for
      our objects of interest.  The last few years have seen a number of
      start-ups based purely on the business model of helping you get your
      dataset labeled.  But thankfully, this isn't our only option.
      </p>
    
    </subsection>

    <subsection><h1>Annotation tools for manipulation</h1>
    
      <p>Just as projects like LabelMe helped to streamline the process of
      providing pixel-wise annotations for images downloaded from the web, there
      are a number of tools that have been developed to streamline the
      annotation process for robotics.  Our group introduced a tool called
      <a href="http://labelfusion.csail.mit.edu/">LabelFusion</a>, which combines geometric perception of point clouds
      with a simple user interface to very rapidly label a large number of
      images<elib>Marion17</elib>.</p>

      <figure><img id="labelfusion" style="width:75%" src="data/labelfusion_multi_object_scenes.png" onmouseenter="this.src = 'https://github.com/RobotLocomotion/LabelFusion/raw/master/docs/fusion_short.gif'" onmouseleave="this.src = 'data/labelfusion_multi_object_scenes.png'"/>
        <figcaption>A multi-object scene from <a
        href="http://labelfusion.csail.mit.edu/">LabelFusion</a>
        <elib>Marion17</elib>.  (Mouse over for animation)</figcaption>
      </figure>
      
      <p>In LabelFusion, we provide multiple RGB-D images of a static scene
      containing some objects of interest, and the CAD models for those
      objects. First we use a dense reconstruction algorithm,
      ElasticFusion<elib>Whelan16</elib>, to merge the point clouds from the
      individual images into a single dense reconstruction; this is just
      another instance of the point cloud registration problem.  The dense
      reconstruction algorithm also localizes the camera relative to the point
      cloud.  Then we provide a simple gui that asks the user to click on three
      points on the model and three points in the scene to establish the
      "global" correspondence, and then we run ICP to snap refine the pose
      estimate.  Finally, we render the pixels from the CAD model on top of all
      of the images in the established pose giving beautiful pixel-wise
      labels.</p>

      <p>Tools like LabelFusion can be use to label large numbers of images very
      quickly (three clicks from a user produces ground truth labels in many
      images).</p>
    
    </subsection>

    <subsection><h1>Synthetic datasets</h1>
    
      <p>All of this real world data is incredibly valuable.  But we have
      another super powerful tool at our disposal: simulation!  Computer vision
      researchers have traditionally been very skeptical of training perception
      systems on synthetic images, but as game-engine quality physics-based
      rendering has become a commodity technology, roboticists have been using
      it aggressively to supplement or even replace their real-world datasets.
      The annual robotics conferences now feature regular <a
      href="https://sim2real.github.io/">workshops and/or debates</a> on the
      topic of "sim2real".  For any specific scene or narrow class of objects,
      we can typically generate accurate enough art assets (with material
      properties that are often still tuned by an artist) and environment maps
      / lighting conditions that rendered images can be highly effective in a
      training dataset.  The bigger question is whether we can generate a
      <i>diverse</i> enough set of data with distributions representative of
      the real world to train robust feature detectors in the way that we've
      managed to train with ImageNet.  But for many serious robotics groups,
      synthetic data generation pipelines have significantly augmented or even
      replaced real-world labeled data.</p>
    
      <p>For the purposes of this chapter, I aim to train an instance-level
      segmentation system that will work well on our simulated images.  For
      this use case, there is (almost) no debate!  Leveraging the pre-trained
      backbone from COCO, I will use only synthetic data for fine tuning.</p>

      <p>You may have noticed it already, but the <code>RgbdSensor</code> that we've been using in Drake actually has a "label image" output port that we haven't used yet.</p>

      <div>
        <script type="text/javascript">
        var rgdbsensor = { "name": "RgbdSensor", 
          "input_ports" : ["geometry_query"],
          "output_ports" : ["color_image", "depth_image_32f", "depth_image_16u", "<b>label_image</b>", "X_WB"]
          };
        document.write(system_html(rgdbsensor, "https://drake.mit.edu/doxygen_cxx/classdrake_1_1systems_1_1sensors_1_1_rgbd_sensor.html"));
        </script>
      </div>
      
      <p>This output port exists precisely to support the perception training
      use case we have here.  It outputs an image that is identical to the RGB
      image, except that every pixel is "colored" with a unique instance-level
      identifier.</p>

      <figure><img style="width:100%" src="data/clutter_maskrcnn_labels.png"/>
      <figcaption>Pixelwise instance segmentation labels provided by the "label
      image" output port from <code>RgbdSensor</code>.  I've remapped the colors
      to be more visually distinct.</figcaption> </figure>

      <example><h1>Generating training data for instance segmentation</h1>
      
        <p>I've provided a simple script that runs our "clutter generator" from
        our bin picking example that drops random YCB objects into the bin.
        After a short simulation, I render the RGB image and the label image,
        and save them (along with some metadata with the instance and class
        identifiers) to disk.</p>

        <p>I've verified that this code <i>can</i> run on Colab, but to make a
        dataset of 10k images using this un-optimized process takes about an
        hour on my big development desktop. And curating the files is just
        easier if you run it locally.  So I've provided this one as a python
        script instead.</p>

        <pysrc no_colab=True>manipulation/clutter_maskrcnn_data.py</pysrc>
      
        <todo>Provide a colab version?</todo>

        <p>You can also feel free to skip this step!  I've uploaded the 10k
        images that I generated <a
        href="https://groups.csail.mit.edu/locomotion/clutter_maskrcnn_data.zip">here</a>.
        We'll download that directly in our training notebook.</p>

      </example>


    </subsection>

    <subsection id="self-supervised"><h1>Self-supervised learning</h1>
      
        <!-- simclr https://arxiv.org/abs/2002.05709 -->
        <!-- tcc https://arxiv.org/abs/1904.07846 -->
        <!-- mention dense descriptors? -->
        <!-- GPT-3 and DALL-E https://openai.com/blog/dall-e/ -->
        <!-- clip https://openai.com/blog/clip/ out of the box detectors -->
      
    </subsection>

  </section>

  <section><h1>Object detection and segmentation</h1>
  
    <p>There is a lot to know about modern object detection and segmentation
    pipelines.  I'll stick to the very basics.</p>
    
    <!-- nice overview slides at
    http://cs231n.stanford.edu/slides/2020/lecture_12.pdf .  Videos at
    https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&index=1
    -->

    <p>For image recognition (see Figure 1), one can imagine training a standard
    convolutional network that takes the entire image as an input, and outputs a
    probability of the image containing a sheep, a dog, etc.  In fact, these
    architectures can even work well for semantic segmentation, where the input
    is an image and the output is another image; a famous architecture for this
    is the Fully Convolutional Network (FCN) <elib>Long15</elib>.  But for
    object detection and instance segmentation, even the number of outputs of
    the network can change.  How do we train a network to output a variable
    number of detections?</p>

    <p>The mainstream approach to this is to first break the input image up into
    many (let's say on the order of 1000) overlapping regions that might
    represent interesting sub-images.  Then we can run our favorite image
    recognition and/or segmentation network on each subimage individually, and
    output a detection for each region that that is scored as having a high
    probability.  In order to output a tight bounding box, the detection
    networks are also trained to output a "bounding box refinement" that selects
    a subset of the final region for the bounding box.  Originally, these region
    proposals were done with more traditional image preprocessing algorithms, as
    in R-CNN (Regions with CNN Features)<elib>Girshick14</elib>.  But the "Fast"
    and "Faster" versions of R-CNN replaced even these preprocessing with
    learned "region proposal networks"<elib>Girshick15+Ren15</elib>.</p>

    <p>For instance segmentation, we will use the very popular Mask R-CNN
    network which puts all of these ideas, using region proposal networks and a
    fully convolutional networks for the object detection and for the masks
    <elib>He17</elib>.  In Mask R-CNN, the masks are evaluated in parallel from
    the object detections, and only the masks corresponding to the most likely
    detections are actually returned.  At the time of this writing, the latest
    and most performant implementation of Mask R-CNN is available in the <a
    href="https://github.com/facebookresearch/detectron2">Detectron2</a> project
    from Facebook AI Research.  But that version is not quite as user-friendly
    and clean as the original version that was released in the PyTorch <a
    href="https://pytorch.org/docs/stable/torchvision/index.html"><code>torchvision</code></a>
    package; we'll stick to the <code>torchvision</code> version for our
    experiments here.</p>

    <example><h1>Fine-tuning Mask R-CNN for bin picking</h1>

      <p>The following notebook loads our 10k image dataset and a Mask R-CNN
      network pre-trained on the COCO dataset.  It then replaces the head of the
      pre-trained network with a new head with the right number of outputs for
      our YCB recognition task, and then runs just a 10 epochs of training with
      my new dataset.</p>

      <p><a target="segmentation_train"
      href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/segmentation_train.ipynb"><img
      src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open
      In Colab"/></a> (Training Notebook)
      </p>
    
      <p>Training a network this big (it will take about 150MB on disk) is not
      fast.  I strongly recommend hitting play on the cell immediately after the
      training cell while you are watching it train so that the weights are
      saved and downloaded even if your Colab session closes.  But when you're
      done, you should have a shiny new network for instance segmentation of the
      YCB objects in the bin!</p>

      <p>I've provided a second notebook that you can use to load and evaluate
      the trained model.  If you don't want to wait for your own to train, you
      can examine the one that I've trained!</p>

      <p><a target="segmentation_inference"
      href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/segmentation_inference.ipynb"><img
      src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open
      In Colab"/></a> (Inference Notebook)
      </p>
    
      <figure><img style="width:50%" src="data/segmentation_detections.png"><img
      style="width:50%" src="data/segmentation_mask.png"><figcaption>Outputs
      from the Mask R-CNN inference.  (Left) Object detections.  (Right) One of
      the instance masks.</figcaption></figure>

    </example>

  </section>

  <section><h1>Putting it all together</h1>

    <p>We can use our Mask R-CNN inference in a manipulation to do selective
    picking from the bin...</p>

  </section>

  <section><h1>Exercises</h1>

    <exercise><h1>Label Generation</h1>

      <p> For this exericse, you will look into a simple trick to automatically generate training data for Mask-RCNN. You will work exclusively in <a href="https://deepnote.com/project/61-Label-Generation-yRBDseApSxaZJJ96ynWgEA/%2Flabel_generation.ipynb" target="_blank">this notebook</a>. You will be asked to complete the following steps: </p>

      <ol type="a">
        <li> Automatically generate mask labels from pre-processed point clouds.
        </li>
        <li> Analyze the applicability of the method for more complex scenes.</li>
        <li> Apply data augmentation techniques to generate more training data.</li>
      </ol>

    </exercise>

    <exercise><h1>Segmentation + Antipodal Grasping</h1>

      <p> For this exercise, you will use Mask-RCNN and our previously
      developed antipodal grasp strategy to select a grasp given a point cloud.
      You will work exclusively in <a
      href="https://deepnote.com/project/62-Segmentation-Antipodal-Grasping-fYNgzQrKTW63BKOn2bd2FQ/%2Fsegmentation_and_grasp.ipynb"
      target="_blank">this notebook</a>. You will be asked to complete the
      following steps: </p>

      <ol type="a">
        <li> Automatically filter the point cloud for points that correspond to
        our intended grasped object</li>
        <li> Analyze the impact of a multi-camera setup.</li>
        <li> Consider why filtering the point clouds is a useful step in this
        grasping pipeline.</li>
        <li> Discuss how we could improve this grasping pipeline. </li>
      </ol>

    </exercise>    
  </section>

</chapter>
<!-- EVERYTHING BELOW THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->

<div id="references"><section><h1>References</h1>
<ol>

<li id=Russakovsky15>
<span class="author">Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael Bernstein and others</span>, 
<span class="title">"Imagenet large scale visual recognition challenge"</span>, 
<span class="publisher">International journal of computer vision</span>, vol. 115, no. 3, pp. 211--252, <span class="year">2015</span>.

</li><br>
<li id=Lin14a>
<span class="author">Tsung-Yi Lin and Michael Maire and Serge Belongie and James Hays and Pietro Perona and Deva Ramanan and Piotr Doll{\'a}r and C Lawrence Zitnick</span>, 
<span class="title">"Microsoft coco: Common objects in context"</span>, 
<span class="publisher">European conference on computer vision</span> , pp. 740--755, <span class="year">2014</span>.

</li><br>
<li id=Russell08>
<span class="author">Bryan C Russell and Antonio Torralba and Kevin P Murphy and William T Freeman</span>, 
<span class="title">"LabelMe: a database and web-based tool for image annotation"</span>, 
<span class="publisher">International journal of computer vision</span>, vol. 77, no. 1-3, pp. 157--173, <span class="year">2008</span>.

</li><br>
<li id=Marion17>
<span class="author">Pat Marion and Peter R. Florence and Lucas Manuelli and Russ Tedrake</span>, 
<span class="title">"A Pipeline for Generating Ground Truth Labels for Real {RGBD} Data of Cluttered Scenes"</span>, 
<span class="publisher">International Conference on Robotics and Automation (ICRA), Brisbane, Australia</span>, May, <span class="year">2018</span>.
[&nbsp;<a href="https://arxiv.org/abs/1707.04796">link</a>&nbsp;]

</li><br>
<li id=Whelan16>
<span class="author">Thomas Whelan and Renato F Salas-Moreno and Ben Glocker and Andrew J Davison and Stefan Leutenegger</span>, 
<span class="title">"{ElasticFusion}: Real-time dense {SLAM} and light source estimation"</span>, 
<span class="publisher">The International Journal of Robotics Research</span>, vol. 35, no. 14, pp. 1697--1716, <span class="year">2016</span>.

</li><br>
<li id=Long15>
<span class="author">Jonathan Long and Evan Shelhamer and Trevor Darrell</span>, 
<span class="title">"Fully convolutional networks for semantic segmentation"</span>, 
<span class="publisher">Proceedings of the IEEE conference on computer vision and pattern recognition</span> , pp. 3431--3440, <span class="year">2015</span>.

</li><br>
<li id=Girshick14>
<span class="author">Ross Girshick and Jeff Donahue and Trevor Darrell and Jitendra Malik</span>, 
<span class="title">"Rich feature hierarchies for accurate object detection and semantic segmentation"</span>, 
<span class="publisher">Proceedings of the IEEE conference on computer vision and pattern recognition</span> , pp. 580--587, <span class="year">2014</span>.

</li><br>
<li id=Girshick15>
<span class="author">Ross Girshick</span>, 
<span class="title">"Fast r-cnn"</span>, 
<span class="publisher">Proceedings of the IEEE international conference on computer vision</span> , pp. 1440--1448, <span class="year">2015</span>.

</li><br>
<li id=Ren15>
<span class="author">Shaoqing Ren and Kaiming He and Ross Girshick and Jian Sun</span>, 
<span class="title">"Faster r-cnn: Towards real-time object detection with region proposal networks"</span>, 
<span class="publisher">Advances in neural information processing systems</span> , pp. 91--99, <span class="year">2015</span>.

</li><br>
<li id=He17>
<span class="author">Kaiming He and Georgia Gkioxari and Piotr Doll{\'a}r and Ross Girshick</span>, 
<span class="title">"Mask {R-CNN}"</span>, 
<span class="publisher">Proceedings of the IEEE international conference on computer vision</span> , pp. 2961--2969, <span class="year">2017</span>.

</li><br>
</ol>
</section><p/>
</div>

<table style="width:100%;" pdf="no"><tr style="width:100%">
  <td style="width:33%;text-align:left;"><a class="previous_chapter" href=clutter.html>Previous Chapter</a></td>
  <td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
  <td style="width:33%;text-align:right;"><a class="next_chapter" href=force.html>Next Chapter</a></td>
</tr></table>

<div id="footer" pdf="no">
  <hr>
  <table style="width:100%;">
    <tr><td><a href="https://accessibility.mit.edu/">Accessibility</a></td><td style="text-align:right">&copy; Russ
      Tedrake, 2021</td></tr>
  </table>
</div>


</body>
</html>