-
Notifications
You must be signed in to change notification settings - Fork 19
DMD distraction related action annotation criteria
The DMD dataset contains events of different nature: distraction, drowsiness, hands and gaze. On this section, we present the criteria with which the currently available annotations of DMD were annotated. This only includes Temporal Distraction-related Annotations.
The DMD dataset is composed of synchronized video streams from 3 different cameras. Each camera was placed to capture the activity of certain regions of the vehicle's cabin. In particular, they focus on parts of the driver. Namely, there is a stream which captures the body activity, one for the face and head and other to capture the hand's activity. Therefore we name these streams as body, face and hands camera, respectively.
To annotate the recording sessions we created a mosaic video which synchronously merges the body, face and hands camera streams. This mosaic video should be passed to the temporal annotation tool (TaTo) to start annotating the sequence or correct a previously annotated session.
The defined levels describe temporal actions or events which occur when the driver is performing some distraction-related actions. To annotate these temporal actions, we have defined 7 levels of annotation. This means, there are 7 types of annotations that can be simultaneously present and describe each frame. Each level of annotation has its own set of labels. Within each level, the labels are mutually exclusive, meaning that, for each level a maximum of one label is allowed.
The distraction-related annotation levels are:
Depending on the annotation level, some require to have a label for each frame in the video (this is represented with a full cell in the previous table) while the annotation of other levels can have intervals with an absence of labels (this is represented with a shorter filled cell in the previous table).
The Annotation Levels that DON’T need annotations for all frames are:
- 0: Occlusion in cameras
- 2: Driver is Talking
- 4: Hand on Gear
- 5: Objects in the scene
The Annotation Levels that MUST have annotations for all frames are:
- 1: Gaze on Road
- 3: Hands using Wheel
- 6: Driver actions
Some levels can contain labels which share similar characteristics with labels from other levels, although they are not the same from a semantic point of view. For instance, Drinking from level Driver actions normally require the driver to have the right hand occupied by the bottle, so during this action the Hands using Wheel level may be annotated as Only Left Hand. The TaTo tool was developed to include these rules when annotating distraction-related actions . Therefore, speeding up the whole annotation process.
This interpolation is domain-specific and depends on the semantic characteristics of the labels. For the distraction-related levels defined in the DMD dataset, annotating the level Driver actions is the starting point since then the implemented rules are applied to the rest of the level.
Applying automatic annotation interpolation is an optional operation which could help speed up the annotation process. However, be aware that this option could overwrite previous manual annotations so here we explain how to proceed to successfully apply this function.
- Manually annotate level Driver actions.
- When finished step 1, press X to apply rules and generate pre-annotations in other levels.
- Continue annotating any of the other levels
The following sections describe the criteria to be taken when annotating distraction-related actions in the DMD dataset.
An occlusion is an event that happens when above 50%-60% of the camera view is covered by the driver's own body or any other object and the scene is not recognizable. Since the dataset contains streams from 3 different cameras and each camera focus on specific parts of the driver (i.e. face, body and hands), special attention should be given to the relevant (objective) part of the driver. This means, for instance, if in the hands video the hands and wheel can not be recognized, then there is an occlusion.
To annotate this level, all three streams (face camera, body camera and hands camera) should be considered equally to assign the corresponding labels.
If there is a frame where there is an occlusion in one of the cameras, you should label the frame with one of the following labels:
Key | Label | Description | ||
---|---|---|---|---|
0 | Face occlusion | Stream from face camera is occluded and cannot recognize the action the driver is performing | ||
1 | Body occlusion | Stream from body camera is occluded and cannot recognize the action the driver is performing | ||
2 | Hands occlusion | Stream from hands camera is occluded and cannot recognize the action the driver is performing | ||
Examples | ||||
|
✔️ It is possible there is some ambiguity when defining if there is an occlusion. Especially in the hands camera, since some actions such as talking to the phone, hair and makeup could occlude part of the scene. However, if is it is possible to certainly recognize the driver actions then it should not be considered as an occlusion.
✔️ In this level, only one camera can be annotated as occluded. We have seen there is not any case in which there is a simultaneous occlusion in two or three video streams.
In this level, it is necessary to know whether the driver is putting all his/her visual attention on driving. This can be acquired by identifying if the driver is looking at the road or related driver zones (rear mirrors, left/right windows to check for other cars) or not (mobile phone, radio, lap, wheel, behind)
To annotate this level, the face camera is primarily used, although the body camera can be useful to decide doubtful cases.
Key | Label | Description |
---|---|---|
0 | Looking at the road | When the driver is looking at the the front road, rear mirrors, left/right windows. |
Examples | ||
![]() |
||
1 | Not Looking at the road | The driver is not looking at the road elements. |
Examples | ||
![]() |
✔️ If the driver is continuously looking at the road and small closed eyes events occur (i.e. fast blinks), these small events should be considered as looking at the road
In this level, it is necessary to know whether the driver talking.
To annotate this level, the face camera is primarily used, although the body camera can be useful to decide doubtful cases.
Key | Label | Description |
---|---|---|
0 | Talking | The driver is moving his/her lips in an action that clearly corresponds to talking. |
Examples | ||
![]() |
✔️ Put a label only when the driver is talking.
In this level, it is necessary to annotate the driver's involvement of his/her hands in the driving task. This can be confusing, but thinking about the proportion of the interaction of each hand with the steering wheel can be useful.
To annotate this level, the hands camera is primarily used, although the body camera can be useful to validate the annotation
Key | Label | Description |
---|---|---|
0 | Both | Both hands participate in the driving task. |
Examples | ||
![]() |
||
1 | Only Right | Only right hand participates in the driving task. |
Examples | ||
![]() |
||
2 | Only Left | Only left hand participates in the driving task. |
Examples | ||
![]() |
||
3 | None | No hand participates in the driving task. |
Examples | ||
![]() |
✔️ If the driver releases the steering wheel with one or both hands momentarily but their hands are positioned to perform the driving task, then it should be considered that his/her hand or hands are using the wheel.
In this level, it is necessary to know whether the driver has his/her hand on the gear.
To annotate this level, both hands camera and body camera can be useful.
Key | Label | Description |
---|---|---|
0 | Hand on gear | The driver has his/her hand on the gear lever. |
Examples | ||
![]() |
✔️ Put the label when the driver is either holding the gear lever or is changing the gear.
In this level, we want to know if there is an object visible in the scene (captured from any to the cameras).
To annotate this level, all the three cameras should be observed carefully to check the presence of the target objects.
Key | Label | Description |
---|---|---|
0 | Cellphone | A cellphone appears in the scene |
Examples | ||
![]() ![]() ![]() |
||
1 | Hair Comb | A hair comb appears in the scene |
Examples | ||
![]() ![]() ![]() |
||
2 | Bottle | A bottle appears in the scene |
Examples | ||
![]() ![]() ![]() |
✔️ Put any of these labels only when the object is fully or partially visible in any of the three video streams.
In this level, we want to know which of the listed activities the driver is doing. There are 13 defined activities with instructed characteristics. An additional label "unclassified" is also available to annotate any other activity the driver might be doing but doesn't correspond to the previously defined actions. To annotate this level you may start from previously annotated data. Therefore the goal is to correct any mistakes in previous annotations.
To annotate this level, the body camera is the most useful stream to identify actions. However, the face cameras and hands camera can support the annotation process.
Key | Label | Description |
---|---|---|
0 | Safe Driving | When driving without doing any other activities |
Examples | ||
![]() |
||
1 | Texting (Right) | Starts when the driver just grabbed the phone and holds it at the wheel level. Finishes, when the person pretends to leave the phone aside (start the "reach side" action) and his/her hand is halfway between the wheel and the side space. |
Examples | ||
![]() |
||
2 | Phone call (Right) | Starts when the driver has just grabbed the phone and starts to raise it from the wheel level to his/her head. Finishes, when the person puts down the phone and pretends to start the “reach side” action and his/her hand is halfway between the wheel and the side space. If the driver holds the phone at the wheel level for more than 15 frames, then it should be labeled as Texting (Left or Right depending hand the driver is holding the cell phone with). |
Examples | ||
![]() |
||
3 | Texting (Left) | Starts when the driver just grabbed the phone and holds it at the wheel level. Finishes, when the person pretends to leave the phone aside (start the "reach side" action) and his/her hand is halfway between the wheel and the side space. |
Examples | ||
![]() |
||
4 | Phone call (Left) | Starts when the driver has just grabbed the phone and starts to raise it from the wheel level to his/her head. Finishes, when the person puts down the phone and pretends to start the “reach side” action and his/her hand is halfway between the wheel and the side space.
If the driver holds the phone at the wheel level for more than 15 frames, then it should be labeled as Texting (Left or Right depending the hand the driver is holding the cell phone with). |
Examples | ||
![]() |
||
5 | Radio | Starts when the driver’s hand is halfway between the wheel and the radio. Finishes when the driver’s hand is halfway the radio and the wheel.
During the whole action the driver should be interacting with the infotainment system. |
Examples | ||
![]() |
||
6 | Drinking | Starts when the driver has just grabbed the bottle and starts to raise it to his/her head. Finishes, when the driver puts down the bottle and pretends to start the “reach side” action and his/her hand is halfway between the wheel and the side space.
If the driver holds the bottle at the wheel level (in the lap) for more than 15 frames, then it should be labeled as unclassified. |
Examples | ||
![]() |
||
7 | Reach Side | Starts when the driver has the intention to turn right and grab/leave something from/to the side (cellphone, bottle or hair comb), and his/her hand is halfway between the wheel and the side space. Finishes, when the driver turns back on his/her position.
This action should be before and after the actions that involve interactions with objects. |
Examples | ||
![]() |
||
8 | Hair and Makeup | Starts when the driver has just grabbed the hair comb and starts moving the hand towards his/her head. Finishes, when the driver has put down the hair comb and pretends to start the “reach side” action to leave the hair comb aside and his/her hand is halfway the wheel and the side space.
If the driver holds the hair comb at the wheel level for more than 15 frames, then it should be labeled as unclassified. |
Examples | ||
![]() |
||
9 | Talking to passenger | Starts when the driver starts turning his/her head to look and talk to the right passenger. Finishes when the driver turns his/her head to look at the road (if the person keeps talking while looking at the road, it should be considered as “Safe Driving”).
While driving, the driver may move his head and talk to the passenger several times. Each time should be considered a separate action. If the driver turns his head, he/she is not talking and the movement lasts less than 15 frames, it should be considered as “Safe Driving”) |
Examples | ||
![]() |
||
/ | Reach Backseat | Starts when the driver starts turning his/her body to reach the backseat. Finishes when the driver turns back on his/her normal position. |
Examples | ||
![]() |
||
* | Change Gear | Starts when the driver has the intention to change gear and his/her hand is halfway between the wheel and the gear. Finishes when the driver turns back on his/her position and his/her hand is halfway between the gear and the wheel. In case the driver leaves his/her hand on the gear, the action finishes when the arm movement has stopped. |
Examples | ||
![]() |
||
- | Stand Still-Waiting | Starts when the driver performs none of the above activities, has none hands on the wheel and is quiet or calm. Finishes when the driver starts performing any activity. |
Examples | ||
![]() |
||
+ | Unclassified | When the driver is performing any other activity that doesn’t correspond to the defined above. (ex. clapping at the beginning, holding an object). |
Examples | ||
![]() |
✔️ There are pre-annotations for this level, you can overwrite the frame’s label if it is wrong.