A brief overview of the architecture of Ingress.
Ingress is composed of two main components. The components are connected through ROS actionlib. The Self-Referrential module uses a pretrained Densecap model to generated object proposals (bounding boxes), self-referrential captions, and grounding losses for a given self-referrential expression. The Relational module clusters out a set of relevant objects using the computed grounding losses and the METEOR scores between the captions and the input expression. Finally, the module uses the feature vectors and bounding boxes to ground a pair of objects. For ambiguous scenarios, the generated self-referential and relational captions are used to ask questions.
See the paper for more details.