forked from RussTedrake/manipulation
-
Notifications
You must be signed in to change notification settings - Fork 0
/
segmentation.html
510 lines (412 loc) · 26.8 KB
/
segmentation.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
<!DOCTYPE html>
<html>
<head>
<title>Ch. 6 - Object detection and
segmentation</title>
<meta name="Ch. 6 - Object detection and
segmentation" content="text/html; charset=utf-8;" />
<link rel="canonical" href="http://manipulation.csail.mit.edu/segmentation.html" />
<script src="https://hypothes.is/embed.js" async></script>
<script type="text/javascript" src="htmlbook/book.js"></script>
<script src="htmlbook/mathjax-config.js" defer></script>
<script type="text/javascript" id="MathJax-script" defer
src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
</script>
<script>window.MathJax || document.write('<script type="text/javascript" src="htmlbook/MathJax/es5/tex-chtml.js" defer><\/script>')</script>
<link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
<script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
<script>hljs.initHighlightingOnLoad();</script>
<link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
</head>
<body onload="loadChapter('manipulation');">
<div data-type="titlepage" pdf="no">
<header>
<h1><a href="index.html" style="text-decoration:none;">Robotic Manipulation</a></h1>
<p data-type="subtitle">Perception, Planning, and Control</p>
<p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
<p style="font-size: 14px; text-align: right;">
© Russ Tedrake, 2020-2021<br/>
Last modified <span id="last_modified"></span>.</br>
<script>
var d = new Date(document.lastModified);
document.getElementById("last_modified").innerHTML = d.getFullYear() + "-" + (d.getMonth()+1) + "-" + d.getDate();</script>
<a href="misc.html">How to cite these notes, use annotations, and give feedback.</a><br/>
</p>
</header>
</div>
<p pdf="no"><b>Note:</b> These are working notes used for <a
href="http://manipulation.csail.mit.edu/Fall2021/">a course being taught
at MIT</a>. They will be updated throughout the Fall 2021 semester. <!-- <a
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture videos are available on YouTube</a>.--></p>
<table style="width:100%;" pdf="no"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=clutter.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=force.html>Next Chapter</a></td>
</tr></table>
<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 5"><h1>Object detection and
segmentation</h1>
<p>Our study of geometric perception gave us good tools for estimating the
pose of a known object. These algorithms can produce highly accurate
estimates, but are still subject to local minima. When the scenes get more
cluttered/complicated, or if we are dealing with many different object types,
they really don't offer an adequate solution by themselves.</p>
<p>Deep learning has given us data-driven solutions that complement our
geometric approaches beautifully. Finding correlations in massive datasets
has proven to be a fantastic way to provide practical solutions to these more
"global" problems like detecting whether the mustard bottle is even in the
scene, segmenting out the portion of the image / point cloud that is relevant
to the object, and even in providing a rough estimate of the pose that could
be refined with a geometric method.</p>
<p>There are many sources of information about deep learning on the internet,
and I have no ambition of replicating nor replacing them here. But this
chapter does being our exploration of deep perception in manipulation, and I
feel that I need to give just a little context.</p>
<section><h1>Getting to big data</h1>
<subsection><h1>Crowd-sourced annotation datasets</h1>
<p>The modern revolution in computer vision was unquestionably fueled by
the availability of massive annotated datasets. The most famous of all is
ImageNet, which eclipsed previous datasets with the number of images and
the accuracy and usefulness of the labels<elib>Russakovsky15</elib>.
Fei-fei Li, who led the creation of ImageNet has been giving talks that
give some nice historical perspective on how ImageNet came to be.
<a href="https://youtu.be/0bb9EcaQupI">Here is one</a> (slightly) tailored
to robotics and even manipulation; you might start <a
href="https://youtu.be/0bb9EcaQupI?t=857">here</a>. </p>
<p><elib>Russakovsky15</elib> describes the annotations available in
ImageNet: <blockquote>... annotations fall into one of two categories: (1)
image-level annotation of a binary label for the presence or absence of an
object class in the image, e.g., "there are cars in this image" but "there
are no tigers," and (2) object-level annotation of a tight bounding box
and class label around an object instance in the image, e.g., "there is a
screwdriver centered at position (20,25) with width of 50 pixels and
height of 30 pixels".</blockquote></p>
<figure><img width="80%"
src="data/coco_instance_segmentation.jpeg"/><figcaption>A sample annotated
image from the COCO dataset, illustrating the difference between
image-level annotations, object-level annotations, and segmentations at
the class/semantic- or instance- level..</figcaption></figure>
<p>The COCO dataset similarly enabled pixel-wise <i>instance-level</i>
segmentation <elib>Lin14a</elib>, where distinct instances of a class are
given a unique label (and also associated with the class label). COCO has
fewer object categories than ImageNet, but more instances per category.
It's still shocking to me that they were able to get 2.5 million images
labeled at the pixel level. I remember some of the early projects at MIT
when crowd-sourced image labeling was just beginning (projects like
LabelMe <elib>Russell08</elib>); Antonio Torralba used to joke about how
surprised he was about the accuracy of the (nearly) pixel-wise annotations
that he was able to crowd-source (and that his mother was a particularly
prolific and accurate labeler)!</p>
<p>Instance segmentation turns out to be an very good match for the
perception needs we have in manipulation. In the last chapter we found
ourselves with a bin full of YCB objects. If we want to pick out only the
mustard bottles, and pick them out one at a time, then we can use a deep
network to perform an initial instance-level segmentation of the scene,
and then use our grasping strategies on only the segmented point cloud.
Or if we do need to estimate the pose of an object (e.g. in order to place
it in a desired pose), then segmenting the point cloud can also
dramatically improve the chances of success with our geometric pose
estimation algorithms.</p>
</subsection>
<subsection><h1>Segmenting new classes via fine tuning</h1>
<p>The ImageNet and COCO datasets contain labels for <a
href="https://tech.amikelive.com/node-718/what-object-categories-labels-are-in-coco-dataset/">a
variety of interesting classes</a>, including cows, elephants, bears,
zebras and giraffes. They have a few classes that are more relevant to
manipulation (e.g., plates, forks, knives, and spoons), but they don't
have a mustard bottle nor a can of potted meat like we have in the YCB
dataset. So what are we do to? Must we produce the same image annotation
tools and pay for people to label thousands of images for us?</p>
<p>One of the most amazing and magical properties of the deep
architectures that have been working so well for instance-level
segmentation is their ability to transfer to new tasks. A network that
was pre-trained on a large dataset like ImageNet or COCO can be
fine-tuned with a relatively much smaller amount of labeled data to a new
set of classes that are relevant to our particular application. In fact,
the architectures are often referred to as having a "backbone" and a
"head" -- in order to train a new set of classes, it is often possible to
just pop off the existing head and replace it with a new head for the new
labels. A relatively small amount of training with a relatively small
dataset can still achieve surprisingly robust performance. Moreover, it
seems that training initially on the diverse dataset (ImageNet or COCO)
is actually important to learn the robust perceptual representations that
work for a broad class of perception tasks. Incredible!</p>
<p>This is great news! But we still need some amount of labeled data for
our objects of interest. The last few years have seen a number of
start-ups based purely on the business model of helping you get your
dataset labeled. But thankfully, this isn't our only option.
</p>
</subsection>
<subsection><h1>Annotation tools for manipulation</h1>
<p>Just as projects like LabelMe helped to streamline the process of
providing pixel-wise annotations for images downloaded from the web, there
are a number of tools that have been developed to streamline the
annotation process for robotics. Our group introduced a tool called
<a href="http://labelfusion.csail.mit.edu/">LabelFusion</a>, which combines geometric perception of point clouds
with a simple user interface to very rapidly label a large number of
images<elib>Marion17</elib>.</p>
<figure><img id="labelfusion" style="width:75%" src="data/labelfusion_multi_object_scenes.png" onmouseenter="this.src = 'https://github.com/RobotLocomotion/LabelFusion/raw/master/docs/fusion_short.gif'" onmouseleave="this.src = 'data/labelfusion_multi_object_scenes.png'"/>
<figcaption>A multi-object scene from <a
href="http://labelfusion.csail.mit.edu/">LabelFusion</a>
<elib>Marion17</elib>. (Mouse over for animation)</figcaption>
</figure>
<p>In LabelFusion, we provide multiple RGB-D images of a static scene
containing some objects of interest, and the CAD models for those
objects. First we use a dense reconstruction algorithm,
ElasticFusion<elib>Whelan16</elib>, to merge the point clouds from the
individual images into a single dense reconstruction; this is just
another instance of the point cloud registration problem. The dense
reconstruction algorithm also localizes the camera relative to the point
cloud. Then we provide a simple gui that asks the user to click on three
points on the model and three points in the scene to establish the
"global" correspondence, and then we run ICP to snap refine the pose
estimate. Finally, we render the pixels from the CAD model on top of all
of the images in the established pose giving beautiful pixel-wise
labels.</p>
<p>Tools like LabelFusion can be use to label large numbers of images very
quickly (three clicks from a user produces ground truth labels in many
images).</p>
</subsection>
<subsection><h1>Synthetic datasets</h1>
<p>All of this real world data is incredibly valuable. But we have
another super powerful tool at our disposal: simulation! Computer vision
researchers have traditionally been very skeptical of training perception
systems on synthetic images, but as game-engine quality physics-based
rendering has become a commodity technology, roboticists have been using
it aggressively to supplement or even replace their real-world datasets.
The annual robotics conferences now feature regular <a
href="https://sim2real.github.io/">workshops and/or debates</a> on the
topic of "sim2real". For any specific scene or narrow class of objects,
we can typically generate accurate enough art assets (with material
properties that are often still tuned by an artist) and environment maps
/ lighting conditions that rendered images can be highly effective in a
training dataset. The bigger question is whether we can generate a
<i>diverse</i> enough set of data with distributions representative of
the real world to train robust feature detectors in the way that we've
managed to train with ImageNet. But for many serious robotics groups,
synthetic data generation pipelines have significantly augmented or even
replaced real-world labeled data.</p>
<p>For the purposes of this chapter, I aim to train an instance-level
segmentation system that will work well on our simulated images. For
this use case, there is (almost) no debate! Leveraging the pre-trained
backbone from COCO, I will use only synthetic data for fine tuning.</p>
<p>You may have noticed it already, but the <code>RgbdSensor</code> that we've been using in Drake actually has a "label image" output port that we haven't used yet.</p>
<div>
<script type="text/javascript">
var rgdbsensor = { "name": "RgbdSensor",
"input_ports" : ["geometry_query"],
"output_ports" : ["color_image", "depth_image_32f", "depth_image_16u", "<b>label_image</b>", "X_WB"]
};
document.write(system_html(rgdbsensor, "https://drake.mit.edu/doxygen_cxx/classdrake_1_1systems_1_1sensors_1_1_rgbd_sensor.html"));
</script>
</div>
<p>This output port exists precisely to support the perception training
use case we have here. It outputs an image that is identical to the RGB
image, except that every pixel is "colored" with a unique instance-level
identifier.</p>
<figure><img style="width:100%" src="data/clutter_maskrcnn_labels.png"/>
<figcaption>Pixelwise instance segmentation labels provided by the "label
image" output port from <code>RgbdSensor</code>. I've remapped the colors
to be more visually distinct.</figcaption> </figure>
<example><h1>Generating training data for instance segmentation</h1>
<p>I've provided a simple script that runs our "clutter generator" from
our bin picking example that drops random YCB objects into the bin.
After a short simulation, I render the RGB image and the label image,
and save them (along with some metadata with the instance and class
identifiers) to disk.</p>
<p>I've verified that this code <i>can</i> run on Colab, but to make a
dataset of 10k images using this un-optimized process takes about an
hour on my big development desktop. And curating the files is just
easier if you run it locally. So I've provided this one as a python
script instead.</p>
<pysrc no_colab=True>manipulation/clutter_maskrcnn_data.py</pysrc>
<todo>Provide a colab version?</todo>
<p>You can also feel free to skip this step! I've uploaded the 10k
images that I generated <a
href="https://groups.csail.mit.edu/locomotion/clutter_maskrcnn_data.zip">here</a>.
We'll download that directly in our training notebook.</p>
</example>
</subsection>
<subsection id="self-supervised"><h1>Self-supervised learning</h1>
<!-- simclr https://arxiv.org/abs/2002.05709 -->
<!-- tcc https://arxiv.org/abs/1904.07846 -->
<!-- mention dense descriptors? -->
<!-- GPT-3 and DALL-E https://openai.com/blog/dall-e/ -->
<!-- clip https://openai.com/blog/clip/ out of the box detectors -->
</subsection>
</section>
<section><h1>Object detection and segmentation</h1>
<p>There is a lot to know about modern object detection and segmentation
pipelines. I'll stick to the very basics.</p>
<!-- nice overview slides at
http://cs231n.stanford.edu/slides/2020/lecture_12.pdf . Videos at
https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&index=1
-->
<p>For image recognition (see Figure 1), one can imagine training a standard
convolutional network that takes the entire image as an input, and outputs a
probability of the image containing a sheep, a dog, etc. In fact, these
architectures can even work well for semantic segmentation, where the input
is an image and the output is another image; a famous architecture for this
is the Fully Convolutional Network (FCN) <elib>Long15</elib>. But for
object detection and instance segmentation, even the number of outputs of
the network can change. How do we train a network to output a variable
number of detections?</p>
<p>The mainstream approach to this is to first break the input image up into
many (let's say on the order of 1000) overlapping regions that might
represent interesting sub-images. Then we can run our favorite image
recognition and/or segmentation network on each subimage individually, and
output a detection for each region that that is scored as having a high
probability. In order to output a tight bounding box, the detection
networks are also trained to output a "bounding box refinement" that selects
a subset of the final region for the bounding box. Originally, these region
proposals were done with more traditional image preprocessing algorithms, as
in R-CNN (Regions with CNN Features)<elib>Girshick14</elib>. But the "Fast"
and "Faster" versions of R-CNN replaced even these preprocessing with
learned "region proposal networks"<elib>Girshick15+Ren15</elib>.</p>
<p>For instance segmentation, we will use the very popular Mask R-CNN
network which puts all of these ideas, using region proposal networks and a
fully convolutional networks for the object detection and for the masks
<elib>He17</elib>. In Mask R-CNN, the masks are evaluated in parallel from
the object detections, and only the masks corresponding to the most likely
detections are actually returned. At the time of this writing, the latest
and most performant implementation of Mask R-CNN is available in the <a
href="https://github.com/facebookresearch/detectron2">Detectron2</a> project
from Facebook AI Research. But that version is not quite as user-friendly
and clean as the original version that was released in the PyTorch <a
href="https://pytorch.org/docs/stable/torchvision/index.html"><code>torchvision</code></a>
package; we'll stick to the <code>torchvision</code> version for our
experiments here.</p>
<example><h1>Fine-tuning Mask R-CNN for bin picking</h1>
<p>The following notebook loads our 10k image dataset and a Mask R-CNN
network pre-trained on the COCO dataset. It then replaces the head of the
pre-trained network with a new head with the right number of outputs for
our YCB recognition task, and then runs just a 10 epochs of training with
my new dataset.</p>
<p><a target="segmentation_train"
href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/segmentation_train.ipynb"><img
src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open
In Colab"/></a> (Training Notebook)
</p>
<p>Training a network this big (it will take about 150MB on disk) is not
fast. I strongly recommend hitting play on the cell immediately after the
training cell while you are watching it train so that the weights are
saved and downloaded even if your Colab session closes. But when you're
done, you should have a shiny new network for instance segmentation of the
YCB objects in the bin!</p>
<p>I've provided a second notebook that you can use to load and evaluate
the trained model. If you don't want to wait for your own to train, you
can examine the one that I've trained!</p>
<p><a target="segmentation_inference"
href="https://colab.research.google.com/github/RussTedrake/manipulation/blob/master/segmentation_inference.ipynb"><img
src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open
In Colab"/></a> (Inference Notebook)
</p>
<figure><img style="width:50%" src="data/segmentation_detections.png"><img
style="width:50%" src="data/segmentation_mask.png"><figcaption>Outputs
from the Mask R-CNN inference. (Left) Object detections. (Right) One of
the instance masks.</figcaption></figure>
</example>
</section>
<section><h1>Putting it all together</h1>
<p>We can use our Mask R-CNN inference in a manipulation to do selective
picking from the bin...</p>
</section>
<section><h1>Exercises</h1>
<exercise><h1>Label Generation</h1>
<p> For this exericse, you will look into a simple trick to automatically generate training data for Mask-RCNN. You will work exclusively in <a href="https://deepnote.com/project/61-Label-Generation-yRBDseApSxaZJJ96ynWgEA/%2Flabel_generation.ipynb" target="_blank">this notebook</a>. You will be asked to complete the following steps: </p>
<ol type="a">
<li> Automatically generate mask labels from pre-processed point clouds.
</li>
<li> Analyze the applicability of the method for more complex scenes.</li>
<li> Apply data augmentation techniques to generate more training data.</li>
</ol>
</exercise>
<exercise><h1>Segmentation + Antipodal Grasping</h1>
<p> For this exercise, you will use Mask-RCNN and our previously
developed antipodal grasp strategy to select a grasp given a point cloud.
You will work exclusively in <a
href="https://deepnote.com/project/62-Segmentation-Antipodal-Grasping-fYNgzQrKTW63BKOn2bd2FQ/%2Fsegmentation_and_grasp.ipynb"
target="_blank">this notebook</a>. You will be asked to complete the
following steps: </p>
<ol type="a">
<li> Automatically filter the point cloud for points that correspond to
our intended grasped object</li>
<li> Analyze the impact of a multi-camera setup.</li>
<li> Consider why filtering the point clouds is a useful step in this
grasping pipeline.</li>
<li> Discuss how we could improve this grasping pipeline. </li>
</ol>
</exercise>
</section>
</chapter>
<!-- EVERYTHING BELOW THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<div id="references"><section><h1>References</h1>
<ol>
<li id=Russakovsky15>
<span class="author">Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael Bernstein and others</span>,
<span class="title">"Imagenet large scale visual recognition challenge"</span>,
<span class="publisher">International journal of computer vision</span>, vol. 115, no. 3, pp. 211--252, <span class="year">2015</span>.
</li><br>
<li id=Lin14a>
<span class="author">Tsung-Yi Lin and Michael Maire and Serge Belongie and James Hays and Pietro Perona and Deva Ramanan and Piotr Doll{\'a}r and C Lawrence Zitnick</span>,
<span class="title">"Microsoft coco: Common objects in context"</span>,
<span class="publisher">European conference on computer vision</span> , pp. 740--755, <span class="year">2014</span>.
</li><br>
<li id=Russell08>
<span class="author">Bryan C Russell and Antonio Torralba and Kevin P Murphy and William T Freeman</span>,
<span class="title">"LabelMe: a database and web-based tool for image annotation"</span>,
<span class="publisher">International journal of computer vision</span>, vol. 77, no. 1-3, pp. 157--173, <span class="year">2008</span>.
</li><br>
<li id=Marion17>
<span class="author">Pat Marion and Peter R. Florence and Lucas Manuelli and Russ Tedrake</span>,
<span class="title">"A Pipeline for Generating Ground Truth Labels for Real {RGBD} Data of Cluttered Scenes"</span>,
<span class="publisher">International Conference on Robotics and Automation (ICRA), Brisbane, Australia</span>, May, <span class="year">2018</span>.
[ <a href="https://arxiv.org/abs/1707.04796">link</a> ]
</li><br>
<li id=Whelan16>
<span class="author">Thomas Whelan and Renato F Salas-Moreno and Ben Glocker and Andrew J Davison and Stefan Leutenegger</span>,
<span class="title">"{ElasticFusion}: Real-time dense {SLAM} and light source estimation"</span>,
<span class="publisher">The International Journal of Robotics Research</span>, vol. 35, no. 14, pp. 1697--1716, <span class="year">2016</span>.
</li><br>
<li id=Long15>
<span class="author">Jonathan Long and Evan Shelhamer and Trevor Darrell</span>,
<span class="title">"Fully convolutional networks for semantic segmentation"</span>,
<span class="publisher">Proceedings of the IEEE conference on computer vision and pattern recognition</span> , pp. 3431--3440, <span class="year">2015</span>.
</li><br>
<li id=Girshick14>
<span class="author">Ross Girshick and Jeff Donahue and Trevor Darrell and Jitendra Malik</span>,
<span class="title">"Rich feature hierarchies for accurate object detection and semantic segmentation"</span>,
<span class="publisher">Proceedings of the IEEE conference on computer vision and pattern recognition</span> , pp. 580--587, <span class="year">2014</span>.
</li><br>
<li id=Girshick15>
<span class="author">Ross Girshick</span>,
<span class="title">"Fast r-cnn"</span>,
<span class="publisher">Proceedings of the IEEE international conference on computer vision</span> , pp. 1440--1448, <span class="year">2015</span>.
</li><br>
<li id=Ren15>
<span class="author">Shaoqing Ren and Kaiming He and Ross Girshick and Jian Sun</span>,
<span class="title">"Faster r-cnn: Towards real-time object detection with region proposal networks"</span>,
<span class="publisher">Advances in neural information processing systems</span> , pp. 91--99, <span class="year">2015</span>.
</li><br>
<li id=He17>
<span class="author">Kaiming He and Georgia Gkioxari and Piotr Doll{\'a}r and Ross Girshick</span>,
<span class="title">"Mask {R-CNN}"</span>,
<span class="publisher">Proceedings of the IEEE international conference on computer vision</span> , pp. 2961--2969, <span class="year">2017</span>.
</li><br>
</ol>
</section><p/>
</div>
<table style="width:100%;" pdf="no"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=clutter.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=force.html>Next Chapter</a></td>
</tr></table>
<div id="footer" pdf="no">
<hr>
<table style="width:100%;">
<tr><td><a href="https://accessibility.mit.edu/">Accessibility</a></td><td style="text-align:right">© Russ
Tedrake, 2021</td></tr>
</table>
</div>
</body>
</html>