DMD distraction related action annotation criteria

The DMD dataset contains events of different nature: distraction, drowsiness, hands and gaze. On this section, we present the criteria with which the currently available annotations of DMD were annotated. This only includes Temporal Distraction-related Annotations.

DMD video streams

The DMD dataset is composed of synchronized video streams from 3 different cameras. Each camera was placed to capture the activity of certain regions of the vehicle's cabin. In particular, they focus on parts of the driver. Namely, there is a stream which captures the body activity, one for the face and head and other to capture the hand's activity. Therefore we name these streams as body, face and hands camera, respectively.

To annotate the recording sessions we created a mosaic video which synchronously merges the body, face and hands camera streams. This mosaic video should be passed to the temporal annotation tool (TaTo) to start annotating the sequence or correct a previously annotated session.

Annotation Levels

The defined levels describe temporal actions or events which occur when the driver is performing some distraction-related actions. To annotate these temporal actions, we have defined 7 levels of annotation. This means, there are 7 types of annotations that can be simultaneously present and describe each frame. Each level of annotation has its own set of labels. Within each level, the labels are mutually exclusive, meaning that, for each level a maximum of one label is allowed.

The distraction-related annotation levels are:

distraction-levels

Depending on the annotation level, some require to have a label for each frame in the video (this is represented with a full cell in the previous table) while the annotation of other levels can have intervals with an absence of labels (this is represented with a shorter filled cell in the previous table).

The Annotation Levels that DON’T need annotations for all frames are:

0: Occlusion in cameras
2: Driver is Talking
4: Hand on Gear
5: Objects in the scene

The Annotation Levels that MUST have annotations for all frames are:

1: Gaze on Road
3: Hands using Wheel
6: Driver actions

Apply automatic annotation interpolation ⚠️

Some levels can contain labels which share similar characteristics with labels from other levels, although they are not the same from a semantic point of view. For instance, Drinking from level Driver actions normally require the driver to have the right hand occupied by the bottle, so during this action the Hands using Wheel level may be annotated as Only Left Hand. The TaTo tool was developed to include these rules when annotating distraction-related actions . Therefore, speeding up the whole annotation process.

This interpolation is domain-specific and depends on the semantic characteristics of the labels. For the distraction-related levels defined in the DMD dataset, annotating the level Driver actions is the starting point since then the implemented rules are applied to the rest of the level.

Applying automatic annotation interpolation is an optional operation which could help speed up the annotation process. However, be aware that this option could overwrite previous manual annotations so here we explain how to proceed to successfully apply this function.

Manually annotate level Driver actions.
When finished step 1, press X to apply rules and generate pre-annotations in other levels.
Continue annotating any of the other levels

Annotation instructions

The following sections describe the criteria to be taken when annotating distraction-related actions in the DMD dataset.

Level 0: Occlusion in cameras

An occlusion is an event that happens when above 50%-60% of the camera view is covered by the driver's own body or any other object and the scene is not recognizable. Since the dataset contains streams from 3 different cameras and each camera focus on specific parts of the driver (i.e. face, body and hands), special attention should be given to the relevant (objective) part of the driver. This means, for instance, if in the hands video the hands and wheel can not be recognized, then there is an occlusion.

Streams used for annotation

To annotate this level, all three streams (face camera, body camera and hands camera) should be considered equally to assign the corresponding labels.

Labels

If there is a frame where there is an occlusion in one of the cameras, you should label the frame with one of the following labels:

Key

Label

Description

0

Face occlusion

Stream from face camera is occluded and cannot recognize the action the driver is performing

1

Body occlusion

Stream from body camera is occluded and cannot recognize the action the driver is performing

2

Hands occlusion

Stream from hands camera is occluded and cannot recognize the action the driver is performing

Examples

Occlusion	No Occlusion

Special remarks

✔️ It is possible there is some ambiguity when defining if there is an occlusion. Especially in the hands camera, since some actions such as talking to the phone, hair and makeup could occlude part of the scene. However, if is it is possible to certainly recognize the driver actions then it should not be considered as an occlusion.

✔️ In this level, only one camera can be annotated as occluded. We have seen there is not any case in which there is a simultaneous occlusion in two or three video streams.

⚠️ If the current frame has no occlusion in any of the cameras then leave it without label. To clear a label press .

Level 1: Gaze on road

In this level, it is necessary to know whether the driver is putting all his/her visual attention on driving. This can be acquired by identifying if the driver is looking at the road or related driver zones (rear mirrors, left/right windows to check for other cars) or not (mobile phone, radio, lap, wheel, behind)

Streams used for annotation

To annotate this level, the face camera is primarily used, although the body camera can be useful to decide doubtful cases.

Labels

Key	Label	Description
`0`	Looking at the road	When the driver is looking at the the front road, rear mirrors, left/right windows.
Examples

`1`	Not Looking at the road	The driver is not looking at the road elements.
Examples

Special remarks

✔️ If the driver is continuously looking at the road and small closed eyes events occur (i.e. fast blinks), these small events should be considered as looking at the road

⚠️ There must be a label of this level in all frames

Level 2: Driver is talking

In this level, it is necessary to know whether the driver talking.

Streams used for annotation

To annotate this level, the face camera is primarily used, although the body camera can be useful to decide doubtful cases.

Labels

Key	Label	Description
`0`	Talking	The driver is moving his/her lips in an action that clearly corresponds to talking.
Examples

Special remarks

✔️ Put a label only when the driver is talking.

⚠️ If in the current frame the driver is not talking then leave it without label. To clear a label press .

Level 3: Hands using wheel

In this level, it is necessary to annotate the driver's involvement of his/her hands in the driving task. This can be confusing, but thinking about the proportion of the interaction of each hand with the steering wheel can be useful.

Streams used for annotation

To annotate this level, the hands camera is primarily used, although the body camera can be useful to validate the annotation

Labels

Key	Label	Description
`0`	Both	Both hands participate in the driving task.
Examples

`1`	Only Right	Only right hand participates in the driving task.
Examples

`2`	Only Left	Only left hand participates in the driving task.
Examples

`3`	None	No hand participates in the driving task.
Examples

Special remarks

✔️ If the driver releases the steering wheel with one or both hands momentarily but their hands are positioned to perform the driving task, then it should be considered that his/her hand or hands are using the wheel.

⚠️ There must be a label of this level in all frames

Level 4: Hand on Gear

In this level, it is necessary to know whether the driver has his/her hand on the gear.

Streams used for annotation

To annotate this level, both hands camera and body camera can be useful.

Labels

Key	Label	Description
`0`	Hand on gear	The driver has his/her hand on the gear lever.
Examples

Special remarks

✔️ Put the label when the driver is either holding the gear lever or is changing the gear.

⚠️ If the driver is not doing any of these actions then leave the frame without annotation. To clear a label press .

Level 5: Objects in the scene

In this level, we want to know if there is an object visible in the scene (captured from any to the cameras).

Streams used for annotation

To annotate this level, all the three cameras should be observed carefully to check the presence of the target objects.

Labels

Key	Label	Description
`0`	Cellphone	A cellphone appears in the scene
Examples

`1`	Hair Comb	A hair comb appears in the scene
Examples

`2`	Bottle	A bottle appears in the scene
Examples

Special remarks

✔️ Put any of these labels only when the object is fully or partially visible in any of the three video streams.

⚠️ In in the current frame there is not any visible object from the list, then leave the frame without label. To clear a label press .

Level 6: Driver Actions

In this level, we want to know which of the listed activities the driver is doing. There are 13 defined activities with instructed characteristics. An additional label "unclassified" is also available to annotate any other activity the driver might be doing but doesn't correspond to the previously defined actions. To annotate this level you may start from previously annotated data. Therefore the goal is to correct any mistakes in previous annotations.

Streams used for annotation

To annotate this level, the body camera is the most useful stream to identify actions. However, the face cameras and hands camera can support the annotation process.

Labels

Key	Label	Description
`0`	Safe Driving	When driving without doing any other activities
Examples

`1`	Texting (Right)	Starts when the driver just grabbed the phone and holds it at the wheel level. Finishes, when the person pretends to leave the phone aside (start the "reach side" action) and his/her hand is halfway between the wheel and the side space.
Examples

`2`	Phone call (Right)	Starts when the driver has just grabbed the phone and starts to raise it from the wheel level to his/her head. Finishes, when the person puts down the phone and pretends to start the “reach side” action and his/her hand is halfway between the wheel and the side space. If the driver holds the phone at the wheel level for more than 15 frames, then it should be labeled as Texting (Left or Right depending hand the driver is holding the cell phone with).
Examples

`3`	Texting (Left)	Starts when the driver just grabbed the phone and holds it at the wheel level. Finishes, when the person pretends to leave the phone aside (start the "reach side" action) and his/her hand is halfway between the wheel and the side space.
Examples

`4`	Phone call (Left)	Starts when the driver has just grabbed the phone and starts to raise it from the wheel level to his/her head. Finishes, when the person puts down the phone and pretends to start the “reach side” action and his/her hand is halfway between the wheel and the side space. If the driver holds the phone at the wheel level for more than 15 frames, then it should be labeled as Texting (Left or Right depending the hand the driver is holding the cell phone with).
Examples

`5`	Radio	Starts when the driver’s hand is halfway between the wheel and the radio. Finishes when the driver’s hand is halfway the radio and the wheel. During the whole action the driver should be interacting with the infotainment system.
Examples

`6`	Drinking	Starts when the driver has just grabbed the bottle and starts to raise it to his/her head. Finishes, when the driver puts down the bottle and pretends to start the “reach side” action and his/her hand is halfway between the wheel and the side space. If the driver holds the bottle at the wheel level (in the lap) for more than 15 frames, then it should be labeled as unclassified.
Examples

`7`	Reach Side	Starts when the driver has the intention to turn right and grab/leave something from/to the side (cellphone, bottle or hair comb), and his/her hand is halfway between the wheel and the side space. Finishes, when the driver turns back on his/her position. This action should be before and after the actions that involve interactions with objects.
Examples

`8`	Hair and Makeup	Starts when the driver has just grabbed the hair comb and starts moving the hand towards his/her head. Finishes, when the driver has put down the hair comb and pretends to start the “reach side” action to leave the hair comb aside and his/her hand is halfway the wheel and the side space. If the driver holds the hair comb at the wheel level for more than 15 frames, then it should be labeled as unclassified.
Examples

`9`	Talking to passenger	Starts when the driver starts turning his/her head to look and talk to the right passenger. Finishes when the driver turns his/her head to look at the road (if the person keeps talking while looking at the road, it should be considered as “Safe Driving”). While driving, the driver may move his head and talk to the passenger several times. Each time should be considered a separate action. If the driver turns his head, he/she is not talking and the movement lasts less than 15 frames, it should be considered as “Safe Driving”)
Examples

`/`	Reach Backseat	Starts when the driver starts turning his/her body to reach the backseat. Finishes when the driver turns back on his/her normal position.
Examples

`*`	Change Gear	Starts when the driver has the intention to change gear and his/her hand is halfway between the wheel and the gear. Finishes when the driver turns back on his/her position and his/her hand is halfway between the gear and the wheel. In case the driver leaves his/her hand on the gear, the action finishes when the arm movement has stopped.
Examples

`-`	Stand Still-Waiting	Starts when the driver performs none of the above activities, has none hands on the wheel and is quiet or calm. Finishes when the driver starts performing any activity.
Examples

`+`	Unclassified	When the driver is performing any other activity that doesn’t correspond to the defined above. (ex. clapping at the beginning, holding an object).
Examples