MeaningMiner

The study of Ravenet et al. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6046454/) focus on representational gestures which are gestures used to accompany and illustrate the content of the speech. In particular, they present an approach to automatically producing metaphoric gestures that are aligned with the speech of the agent in terms of timing and meaning. The work faces the following challenges:

identifying a common representation between speech and gestures that could be computationally manipulated
proposing a mechanism to extract semantic elements of this representation from the speech of the agent
associating these elements to gesture characteristics
finally combining these gesture characteristics to align them with the speech of the agent.

In Wagner et al. (2014) , the authors gave an extensive review of work on communicative gestures, from psychology studies to computer systems. The results highlighted how closely tied together speech and gesture are (in terms of meaning and timing). According to some theoretical models, like McNeill ’s Growth Point Theory(McNeill, 1985 ), this could be explained by the fact that gestures and speech are produced from the same mental process. In particular, many studies investigated the effect of embodied cognition on speech and gesture production (Hostetter and Alibali, 2008 ) and hypothesized the existence of a common mental imagery between the two communicative channels (Kendon, 1980 ).

Gesture can be characterized by its physical constituents. The form of a gesture is described in term of the shape of the hand, the wrist and the palm orientation. A gesture can be made with one or two hands, symmetrically or in opposition. The movement of a gesture can be defined by its direction, its path, its dynamism. As mentioned by Kendon (1980) , gestures exhibit different structures. At the level of a gesture, there are different phases (e.g., preparation, stroke, hold and relaxation). Consecutive gestures can be co-articulated, meaning that the last phase of a gesture is mapped to the beginning phase of the next gesture. There is a higher structure that corresponds to discourse segments in which consecutive gestures share some of their constituents and are kinetically segmented. It corresponds to the ideational structure introduced by Calbris (2011). In her theory, Calbris argues that discourse is composed of units of meaning and rhythm she calls Ideational Units. Within an Ideational Unit, there is a consistency between the gestures of the person as they show similar properties.

Within the literature on embodied cognition, the conceptualization hypothesis states that the way we mentally represent our world is constrained by our body ( Wilson and Golonka, 2013 ). In other words, our interactions with the world through embodiment lead to the conceptual representations we manipulate in our mind to ground abstract and concrete concepts. Johnson suggested that humans use recurring patterns of reasoning, called Image Schemas , to map these conceptual metaphors from an entity to another (Johnson, 1987 ). These Image Schemas have also been studied by Grady in order to attempt to explain how our perception mechanisms are at the origin of our metaphorical reasoning (Grady, 2005 ).

Image Schemas as the basis for our representation, to bridge the speech of an ECA and its gestures.

While Image Schemas are good candidates for predicting gesture shapes, additional information will be required in order to identify the most appropriate meaning to be aligned with the speech by the gesture production (not all Image Schemas are turned into gesture; selection happens). Even if each word in isolation carried an embodied meaning represented by an Image Schema , people do not produce a gesture on every word. Prosodic and linguistic features of the speech seem to have the potential to be the contextual markers that could be correlated with the Image Schema selection process (Wagner et al., 2014). Several works showed that gesture and speech timings seem to be close to each other but not exactly simultaneous.

Architecture for automatically computing communicative gestures inspired by the different aspects of the challenges that have been investigated by previous researchers. The model proposed in the study takes into account the linguistic structure, the prosodic information and a representation of the meaning conveyed by the agent's speech to derive gesture characteristics that are combined into coherent gesture phrases thanks to an Ideational Unit mechanism. Our model is geared to integrate a richer representation of Image Schemas and to be integrated in an agent system that computes in real-time the multimodal behaviors linked to additional communicative functions (such as showing emotional states and attitudes).

The model proposed is organized around the concept of Image Schemas as the intermediate language between the verbal and nonverbal channels.

drawing

In the abowe image there is the framework architecture: the Image Schemas are retrieved from the text and combined with prosodic markers to generate gestures. Reproduced with the permission of the copyright holder IFAAMAS.

The architecture is composed of three levels:

Image Schema extractor,
gesture modeler
behavior realizer supporting Ideational Units.

The Image Schemas extraction component has the task of identifying the Image Schemas from the surface text of the agent’s speech and to align them properly with the spoken utterance (for future gesture alignment).

After obtaining a list of aligned Image Schemas for a sequence of spoken text, the gesture modeler builds the corresponding gestures. The first step is to retrieve the gesture invariants to build t he final gestures. According to the literature, the typical features of a gesture are: hand shape, orientation, movement and position i n gesture space (Bressem, 2013).

In the current work, for each Image Schemas they want to find which features are needed to express its meaning and how it is expressed. For this task, a dictionary that maps each Image Schema to its corresponding invariants (the features that need not to be al tered to properly express the meaning) has been proposed. This dictionary is depicted in following image.

drawing

The final layer of the framework has the role of combining the composed gesture obtained through the previous components to produce the final animation of the virtual agent.

They define a system that follows the Ideational Unit model proposed by Calbris (2011) and the computational model of Xu et al. (2014) . The system operates the following main functions:

co-articulates gestures within an Ideational Unit by computing either a hold or an intermediate relaxed pose between successive gestures (instead of returning to a rest pose),
transfers properties of the main gesture onto the variant properties of the other gestures of the same Ideational Unit 3. ensures that a meaning expressed through an invariant is carried on the same hand throughout an Ideational Unit and
finally dynamically raises the speed and amplitude of repeated gestures.

The system reads an XML-based text file (a Behavior Markup Language BML document as described in Vilhjálmsson et al., 2007 ) that describes the textual speech of the agent marked with prosodic and Ideational Unit information and produces the complete animation with the audio using a Text-To-Speech component.

drawing

To know more abou this study, see the paper: B. Ravenet, C. Pelachaud, C. Clavel, S. Marsella : Automating the production of communicative gestures in embodied characters, Frontiers in Psychology, 2018 vol: 9 (JUL)

Greta platform and ImageSchema Extractor

To use the model implemented in the study above with the Greta platform, two module can be used: **ImageSchemaExtractor ** (Add/Input/MeaningMiner/) or **FMLReceiver_MeaningMiner ** (Add/NetworkConnections/ActiveMQ/Receivers/).

ImageSchemaExtractor

In the figure below it is shown how to connect the module to the basic configuration.

drawing

The module allows to open a BML files, apply the model to create the automatic gestures from the speech and to send the intentions to the BehaviorPlanner to be processed and performed by the agent.

drawing

Note that the BML open by the ImageSchemaExtractor has to be in a specific format. An example can be found here: https://github.com/isir/greta/blob/multiCharacters/bin/Examples/DemoEN/MeaningMiner_BML_Example.xml

FMLReceiver_MeaningMiner

This module allows to receive a FML message via ActiveMQ, translate the speech in the right format to be processed by the ImageExtractore model. The automatic gestures are computed and added to the FML file. The files is sent to the BehaviorPlanner and processed like a generic FML.

drawing

Note that if an ActiveMQ server is not started you need to add the ActiveMQBroker module (Add/NetworkConnections/ActiveMQ/Broker) to the configuration.

Home

Getting started with Greta

Greta Architecture

Quick start

Advanced

Functionalities

Core functionality

Auxiliary functionalities

Incrementality
Microphone
Idle-behavior
AUs from external sources
- Open Face 1 integration
- Open Face 2 integration
Large language model (LLM)
- Mistral
- Mistral incremental
Automatic speech recognition (ASR)
- Speech Recognizer
- DeepASR module
  - DeepGram
Automatic gestures
- MeaningMiner
- NVBG (Nonverbal behavior generator)
Turn Management (Backchannel)
Extentions
Integration examples

Preview functionality

Nothing to show here

Previous functionality (possibly it still works, but not supported anymore)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MeaningMiner

Greta platform and ImageSchema Extractor

ImageSchemaExtractor

FMLReceiver_MeaningMiner

Home

Getting started with Greta

Functionalities

Core functionality

Auxiliary functionalities

Preview functionality

Previous functionality (possibly it still works, but not supported anymore)

Clone this wiki locally