/ "OpNet: On the power of data augmentation for head pose estimation"
If you are looking for the code for the publication please note the paper
branch,
which is a special tailored snapshot for the publication.
This repository contains the code to train the neural nets for the NeuralNet tracker plugin of Opentrack. It allows head tracking with a simple webcam.
The tracker plugin is based on deep learning, i.e. neural network models optimized using data to perform their tasks. There are two parts: A localizer network, and the actual pose estimation network. The localizer tries to find a single face and generates a bounding box around it from where a crop is extracted for the pose network to analyze.
In the following there are steps outlined to reproduce the networks delivered with OpenTrack. This includes training and evaluation. However, the instructions are currently focussed on the pose estimator. At the end there is a section on the localizer.
Setup a Python environment with a recent PyTorch. Tested with Python 3.11 and PyTorch 2.3.0. Using Python Anaconda:
# Create and activate python environment
conda create -p <path> python=3.11
conda activate <path>
# Install dependencies
conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -c conda-forge numpy scipy opencv kornia matplotlib tqdm h5py onnx onnxruntime strenum tabulate
# Install `trackertraincode` from this repo in developer mode (as symlink)
cd <this repo>
pip install -e .
This should support training and eval. To generate the datasets you also need pytorch-minimize facenet-pytorch scikit-learn trimesh pyrender
.
Set the DATADIR
variable at least.
export DATADIR=<path to preprocessing outputs>
export NUM_WORKERS=<number of cpu cores> # For the data loader
Download AFLW2000-3D from http://www.cbsr.ia.ac.cn/users/xiangyuzhu/projects/3DDFA/main.htm.
Biwi can be obtained from Kaggle https://www.kaggle.com/datasets/kmader/biwi-kinect-head-pose-database. I couldn't find a better source that is still accessible.
Download a pytorch model checkpoint.
- Baseline Ensemble: https://drive.google.com/file/d/19LrssD36COWzKDp7akxtJFeVTcltNVlR/view?usp=sharing
- Additionally trained on Face Synthetics (BL+FS): https://drive.google.com/file/d/19zN8KICVEbLnGFGB5KkKuWrPjfet-jC8/view?usp=sharing
- Labeling Ensemble (RA-300W-LP from Table 3): https://drive.google.com/file/d/13LSi6J4zWSJnEzEXwZxr5UkWndFXdjcb/view?usp=sharing
Run scripts/AFLW20003dEvaluation.ipynb
It should give results pretty close to the paper. The face crop selection is different though and so the result won't be exactly the same.
Run the preprocessing and then the evaluation script.
# Preprocess the data. The output filename "aflw2k.h5" must match the hardcoded value in "pipelines.py"
python scripts/dsaflw2k_processing.py <path to>/AFLW2000-3D.zip $DATADIR/aflw2k.h5`
# Will look in $DATADIR for aflw2k.h5.
python scripts/evaluate_pose_network.py --ds aflw2k3d <path to model(.onnx|.ckpt)>
It supports ONNX conversions as well as PyTorch checkpoints. For PyTorch the script must be adapted to the concrete model configuration for the checkpoint. If you wish to process the outputs further, like for averaging like in the paper, there is an option to generate json files.
Evaluation on the Biwi benchmark works similarly. However, we use the annotations file from https://github.com/pcr-upm/opal23_headpose in order to adhere to the experimental protocol. It can be found under https://github.com/pcr-upm/opal23_headpose/blob/main/annotations/biwi_ann.txt.
# Preprocess the data.
python --opal-annotation <path to>/biwi_ann.txt scripts/dsprocess_biwi.py <path to>/biwi.zip $DATADIR/biwi-v3.h5
# Will look in $DATADIR for biwi-v3.h5.
python scripts/evaluate_pose_network.py --ds biwi --roi-expansion 0.8 --perspective-correction <path to model(.onnx|.ckpt)>
You want the --perspective-correction
for SOTA results. It enables that the orientation obtained from the face crop is corrected for camera perspective since with the Kinect's field of view, the assumption of orthographic projection no longer holds true. I.e. the pose from the crop is transformed into the global coordinate frame. W.r.t this frame it is compared with the original labels. Without the correction, the pose from the crop is taken directly for comparison with the labels.
Setting --roi-expansion 0.8
causes the cropped area to be smaller relative to the bounding box annotation. That is also necessary for good results because the annotations have much larger bounding boxes than the networks were trained with.
Choose the "Neuralnet" tracker plugin. It currently comes with some older models which don't achieve the same SOTA benchmark results but are a little bit more noise resistent and invariant to eye movements.
Rough guidelines for reproduction follow.
There should be download links for 300W-LP.zip
and AFLW2000-3D.zip
on http://www.cbsr.ia.ac.cn/users/xiangyuzhu/projects/3DDFA/main.htm.
My version of 300W-LP with custom out-of-plane rotation augmentation applied. Includes "closed-eyes" augmentation as well as directional illumination. On Google Drive https://drive.google.com/file/d/1uEqba5JCGQMzrULnPHxf4EJa04z_yHWw/view?usp=drive_link.
My pseudo / semi-automatically labeled subset of the Megaface frames from LaPa. On Google Drive https://drive.google.com/file/d/1K4CQ8QqAVXj3Cd-yUt3HU9Z8o8gDmSEV/view?usp=drive_link.
My pseudo / semi-automatically labeled subset. On Google Drive https://drive.google.com/file/d/1SY33foUF8oZP8RUsFmcEIjq5xF5m3oJ1/view?usp=drive_link.
There should be a download link on https://github.com/microsoft/FaceSynthetics for the 100k samples variant dataset_100000.zip
.
python scripts/dsprocess_aflw2k.py AFLW2000-3D.zip $DATADIR/aflw2k.h5
# Optional, for training on the original 300W-LP:
python scripts/dsprocess_300wlp.py --reconstruct-head-bbox 300W-LP.zip $DATADIR/300wlp.h5
# Face Synthetics
python scripts/dsprocess_synface.py dataset_100000.zip $DATADIR/microsoft_synface_100000-v1.1.h5
# Custom datasets
unzip lapa-megaface-augmented-v2.zip -d ../$DATADIR/
unzip wflw_augmented_v4.zip -d ../$DATADIR/
unzip reproduction_300wlp-v12.zip -d ../$DATADIR/
The processed files can be inspected in the notebook DataVisualization.ipynb
.
Now training should be possible. For the baseline it should be:
python scripts/train_poseestimator.py --lr 1.e-3 --epochs 1500 --ds "repro_300_wlp+lapa_megaface_lp:20000+wflw_lp" \
--save-plot train.pdf \
--with-swa \
--with-nll-loss \
--backbone mobilenetv1 \
--no-onnx \
--roi-override original \
--no-blurpool \
--outdir <output folder>
It will look at the environment variable DATADIR
to find the datasets. Notable flags and settings are the following:
--with-nll-loss # Enables NLL losses
--backbone resnet18 # Use ResNet backbone
--no-blurpool # Disable use of Blur Pool instead of learnable strided conv.
--no-imgaug # Disable image intensity augmentation
--no-pointhead # Disable landmark predictions
--raug <value in degree> # Set the maximum angle of in-plane rotation augmentation. Zero disables it.
--ds "300wlp" # Train on original 300W-LP
--ds "300wlp:92+synface:8" # Train on Face Synthetics + 300W-LP
--ds "repro_300_wlp_woextra" # Train on the 300W-LP reproduction without the eye + illumination variations. (Needs unpublished dataset :-/)
--ds "repro_300_wlp" # Train only on the 300W-LP reproduction
--ds "repro_300_wlp+lapa_megaface_lp+wflw_lp+synface" # Train the "BL + FS" case which should give best performing models.
I use ONNX for deployment and most evaluation purposes. There is a script for conversion. WARNING: it is necessary to adapt its code to the model configuration. :-/ It is easy though. Only one statement where the model is instantiated needs to be changed. The script has two modes. For exports for OpenTrack use
python scripts/export_model.py --posenet <model.ckpt>
It omits the landmark predictions and renames the output tensors (for historical reasons). The script performs sanity checks to ensure the outputs from ONNX are almost equal to PyTorch outputs.
To use the model in OpenTrack, find the directory with the other .onnx
models and copy the new one there. Then in OpenTrack, in the tracker settings, there is a button to select the model file.
For evaluation use
python scripts/export_model.py --full --posenet <model.ckpt>
The model created in this way includes all outputs.
- Preprocess LaPa and Megaface by scripts in
scripts/
. - Download pseudo labeling ensemble.
- Generate pseudo labels
- Find the github repository face-3d-rotation-augmentation. Install the package in it with pip.
- Use the notebooks (in this repo)
scripts/DsLapaMegafaceFitFaceModel.ipynb
,scripts/DsLapaMegafaceLargePoseCreation.ipynb
,scripts/DsWflwFitFaceModel.ipynb
andscripts/DsWflwLargePoseCreation.ipynb
.
It's a right handed system. Seen from the front, X is right, Y is down and Z is into the screen. This coordinate system is used for world space and screen space. Also as local coordinate system of the head, albeit the directions as described apply of course only at zero rotation.
Labels are stored in a HDF5 format. Input images maybe separated or integrated in the same file. Here is a dump of a file with image included, where N is the number of samples:
/coords shape=(N, 3), dtype=float32
ATTRIBUTE category: "xys" (str)
/images shape=(N,), dtype=object
ATTRIBUTE category: "img" (str)
ATTRIBUTE lossy: "True" (bool_)
ATTRIBUTE storage: "varsize_image_buffer" (str)
/pt3d_68 shape=(N, 68, 3), dtype=float32
ATTRIBUTE category: "pts" (str)
/quats shape=(N, 4), dtype=float32
ATTRIBUTE category: "q" (str)
/rois shape=(N, 4), dtype=float32
ATTRIBUTE category: "roi" (str)
/shapeparams shape=(N, 50), dtype=float16
ATTRIBUTE category: "" (str)
As you can see the top level has several HDF5 Datasets (DS) with label data. images
is the DS with the images.
The DS have attributes with metadata. There is the category
which implies the kind of information stored in the DS.
The image
DS has a storage
attribute which tells if it the images stored inline or externally. varsize_image_buffer
means that the data type is a variabled sized buffer which holds the image. When lossy
is true then the images are
encoded as JPG, else as PNG. When storage
is set to image_filename
then the DS contains relative paths to external
files. The other label fields are label data and should be relatively self-explanatory.
Relevant code for reading and writing those files can be found in trackertraincode/datasets/dshdf5.py
,
trackertraincode/datasets/dshdf5pose.py
and the preprocessing scripts scripts/dsprocess_*.py
.
There is an old notebook to train this network.
The training data is a processed version of the Wider Face dataset. The processing accounts for the fact that Wider Face contains images with potentially many faces. Therefore, sections which contain only one face or none are extracted.
The localizer network is trained to generate a "heatmap" with a peak where it suspects the center of a face. In addition, parameters of a bounding box are outputted.