Past, present, and future of SLAM

Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age

1 Introduction
    Our main focus is on metric and semantic SLAM, and we refer the reader to the recent survey by Lowry et al. [160], which provides a comprehensive review of vision-based place recognition and topological SLAM.
    The keyword here is “loop closure”: if we sacrifice loop closures, SLAM reduces to odometry. However, more recent odometry algorithms are based on visual and inertial information, and have very small drift (< 0.5% of the trajectory length [82]).

[Do we really need SLAM?]

    Visual-Inertial Navigation (VIN) is SLAM: VIN can be considered a reduced SLAM system, in which the loop closure (or place recognition) module is disabled.
    A robot performing odometry and neglecting loop closures interprets the world as an “infinite corridor” (Fig. 1-left) in which the robot keeps exploring new areas indefinitely. A loop closure event informs the robot that this “corridor” keeps intersecting itself (Fig. 1-right). The advantage of loop closure now becomes clear: by finding loop closures, the robot understands the real topology of the environment, and is able to find shortcuts between locations (e.g., point B and C in the map).
    Therefore, if getting the right topology of the environment is one of the merits of SLAM, why not simply drop the metric information and just do place recognition? The answer is simple: the metric information makes place recognition much simpler and more robust; the metric reconstruction informs the robot about loop closure opportunities and allows discarding spurious loop closures [150]. Therefore, while SLAM might be redundant in principle, SLAM offers a natural defense against wrong data association and perceptual aliasing, where similarly looking scenes, corresponding to distinct locations in the environment, would deceive place recognition. In this sense, the SLAM map provides a way to predict and validate future measurements: we believe that this mechanism is key to robust operation.

[Is SLAM solved?]

    Other robot/environment/performance combinations still deserve a large amount of fundamental research. Current SLAM algorithms can be easily induced to fail when either the motion of the robot or the environment are too challenging (e.g., fast robot dynamics, highly dynamic environments); similarly, SLAM algorithms are often unable to face strict performance requirements, e.g., high rate estimation for fast closed-loop control. This survey will provide a comprehensive overview of these open problems, among others.

    In this paper, we argue that we are entering in a third era for SLAM, the robust-perception age, which is characterized by the following key requirements:
1) robust performance: the SLAM system operates with low failure rate for an extended period of time in a broad set of environments; the system includes self-tuning capabilities in that it can adapt the selection of the system parameters to the scenario.
2) high-level understanding: the SLAM system goes beyond basic geometry reconstruction to obtain a high-level understanding of the environment (e.g., high-level geometry, semantics, physics, affordances).
3) resource awareness: the SLAM system is tailored to the available sensing and computational resources, and provides means to adjust the computation load depending on the available resources;
4) task-driven perception: the SLAM system is able to select relevant perceptual information and filter out irrelevant sensor data, in order to support the task the robot has to perform; moreover, the SLAM system produces adaptive map representations, whose complexity may vary depending on the task at hand.


3. Long-term autonomy 1: Robustness
    Loop closure detection 想要做成实时一定要用bag-of-words。然而，BoW对于光照环境恶劣的情况还是不能工作，this has led to deveolp new methods that explicitly account for such variations by matching sequences[173], gathering different visual appearances into a unified representation[48], or using spatial as well as appearance information[106].
    In vision-based applications, RANSAC is commonly used for geometric verification and outlier rejection, see [218] and the references therein.
    In dynamic environments, the challenge is twofold. First（短期变化，主要指环境中成分的变化，如行人、车辆）, SLAM has to detect, discard, or track changes. While mainstream approaches attempt to discard the dynamic portion of the scene [180](2001年), some works include dynamic elements as part of the model [11, 253]. Second（长期变化，主要指环境本身的变化，如冬天变成夏天）, SLAM has to model permanent or semi-permanent changes, and understand how and when to update the map accordingly. Current SLAM systems that deal with dynamics either maintain multiple (time-dependent) maps of the same location [60], or have a single representation parameterized by some time-varying parameter [140].
    Automatic parameter tuning: SLAM systems (in particular, the data association modules) require extensive parameter tuning in order to work correctly for a given scenario. These parameters include thresholds that control feature matching, RANSAC parameters, and criteria to decide when to add new factors to the graph or when to trigger a loop closing algorithm to search for matches. If SLAM has to work “out of the box” in arbitrary scenarios, methods for automatic tuning of the involved parameters need to be considered.

关于有动态物体的SLAM建模，可看[91, 96, 180, 182]. Newcombe et al. [182] have address the non-rigid case for small-scale reconstruction. However, addressing the problem of non-rigid maps at a large scale is still largely unexplored.


4. Long-term autonomy 2: Scalability
    We focus on two ways to reduce the complexity of factor graph optimization: (i) sparsification methods, which trade off information loss for memory and computational efficiency, and (ii) out-of-core and multi-robot methods, which split the computation among many robots/processors.

------------------------------------------------------------------------
SVO是直接法和特征点法的一个折中，虽然没有ORB-SLAM2好，但还是很有吸引力的，暂时不要放弃对它的关注。

疑问：ORB-SLAM在回环检测后执行全局优化时，只优化所有关键帧组成的位姿图，而不优化点云。那点云的位置怎么优化？
------------------------------------------------------------------------


5. Representation 1: Metric Map Models
    This section discusses how to model geometry in SLAM. More formally, a metric representation (or metric map) is a symbolic structure that encodes the geometry of the environment.

[High-level object-based representations]
    While point clouds and boundary representations are currently dominating the landscape of dense mapping, we envision that higher-level representations, including objects and solid shapes, will play a key role in the future of SLAM. Early techniques to include object-based reasoning in SLAM are “SLAM++” from Salas-Moreno et al. [217], the work from Civera et al. [50], and Dame et al. [56].
    The following problems regarding metric representation for SLAM deserve a large amount of fundamental research, and are still vastly unexplored.

1) High-level, expressive and compact representations in SLAM. The point clouds or TSDF to model 3D geometry is wasteful of memory, yet the representations do not provide any high-level understanding of the 3D geometry. No SLAM can currently build higher-level representations. Recent efforts in this direction include [17, 51, 231].

2) Optimal representations: Few works have focused on understanding which criteria should guide the choise of a specific representation. Intuitively, in simple indoor environments one should prefer parametrized primitives since few parameters can sufficiently describe the 3D geometry; on the other hand, in complex outdoor environments, one might prefer mesh models. How should we compare different representations and how should we choose the "optimal" representation?

3) Automatic, adaptive representations: 很不好做


6. Representation 2: Semantic Map Models
    Semantic mapping consists in associating semantic concepts to geometric entities in a robot's surroundings. Recently, the limitations of purely geometric maps have been recognized and this has spawned a significant and ongoing body of work in semantic mapping of environments, in order to enhance robot's autonomy and robustness, facilitate more complex tasks(e.g. avoid muddy-road while driving), move from path-planning to task-planning, and enable advanced human-robot interaction [9, 26, 217]. These observations have led to different approaches for semantic mapping which vary in the numbers and types of semantic concepts and means of associating them with different parts of the environments. A semantic representation is build by defining the following aspects:

1) Level/Detail of semantic concepts: For a given robotic task, e.g. “going from room A to room B”, coarse categories (rooms, corridor, doors) would suffice for a successful performance, while for other tasks, e.g. “pick up a tea cup”, finer categories (table, tea cup, glass) are needed.

2) Organization of semantic concepts: The semantic concepts are not exclusive. Even more, a single entity can have an unlimited number of properties or concepts. A chair can be “movable” and “sittable”; a dinner table can be “movable” and “unsittable”. While the chair and the table are pieces of furniture, they share the movable property but with different usability. Flat or hierarchical organizations, sharing or not some properties, have to be designed to handle this multiplicity of concepts.

There are three main ways to attack semantic mapping, and assign semantic concepts to data:

1) SLAM helps Semantics:
An online semantic mapping system was later proposed by Pronobis et al. [206], who combine three layers of reasoning (sensory, categorical, and place) to build a semantic map of the environment using laser and camera sensors. More recently, Cadena et al. [26] use motion estimation, and interconnect a coarse semantic segmentation with different object detectors to outperform the individual systems. Pillai and Leonard [201] use a monocular SLAM system to boost the performance in the task of object recognition in videos.

2) Semantics helps SLAM: The idea is that if we can recognize objects or other elements in a map then we can use our prior knowledge about their geometry to improve the estimation of that map. First attempts were done in small scale by Castle et al. [44] and by Civera et al. [50] with a monocular SLAM with sparse features, and by Dame et al. [56] with a dense map representation. Taking advantage of RGB-D sensors, Salas-Moreno et al. [217] propose a SLAM system based on the detection of known objects in the environment.

3) Joint SLAM and Semantics inference: We could perform monocular SLAM and map segmentation within a joint formulation. Recently, a promising online system was proposed by Vineet et al. [251] using stereo cameras and a dense map representation.

Open problems: The problem of including semantic information in SLAM is in its infancy, and, contrary to metric SLAM, it still lacks a cohesive formulation.