This is the respository of End-to-End Audiovisual Speech Recognition. Our paper can be found here.
The video-only stream is based on T. Stafylakis and G. Tzimiropoulos's implementation. The paper can be found here.
This implementation includes 2-layer BGRU which consists of 1024 cells in each layer while Themos's implementation uses 2-layer BLSTM with 512 cells.
: Please check for our lipreading models which can easily achieve 85.5% on LRW dataset.
- python 2.7
- pytorch 0.3.1
- opencv-python 3.4.0
The results obtained with the proposed model on the LRW dataset. The coordinates for cropping mouth ROI are suggested as (x1, y1, x2, y2) = (80, 116, 175, 211) in Matlab. Please note that the fixed cropping mouth ROI (FxHxW) = [:, 115:211, 79:175] in python.
This is the suggested order to train models including video-only model, audio-only model and audiovisual models:
i) Start by training with temporal convolutional backend, you can run the script:
CUDA_VISIBLE_DEVICES='' python --path '' --dataset <dataset_path> \
--mode 'temporalConv' \
--batch_size 36 --lr 3e-4 \
--epochs 30
ii)Throw away the temporal convolutional backend, freeze the parameters of the frontend and the ResNet and train the LSTM backend, then run the script:
CUDA_VISIBLE_DEVICES='' python --path './temporalConv/' --dataset <dataset_path> \
--mode 'backendGRU' --every-frame \
--batch_size 36 --lr 3e-4 \
--epochs 5
iii)Train the whole network end-to-end. You can run the script:
CUDA_VISIBLE_DEVICES='' python --path './backendGRU/' --dataset <dataset_path> \
--mode 'finetuneGRU' --every-frame \
--batch_size 36 --lr 3e-4 \
--epochs 30
is activated when the backend module is recurrent neural network.
need be correctly specified before running. Code has strong assumptions on the dataset organisation.
are the models with best validation performance on step ii) or step iii).
Stream | Accuracy |
video-only | 83.39 |
audio-only | 97.72 |
audiovisual | 98.38 |
The results are slightly better than ones reported in the ICASSP paper due to further fine-tuning of the models. Please send email at pingchuan.ma16 <AT> with name and affiliation for the pre-trained models.
If the code of this repository was useful for your research, please cite our work:
title={End-to-end audiovisual speech recognition},
author={Petridis, Stavros and Stafylakis, Themos and Ma, Pingchuan and Cai, Feipeng and Tzimiropoulos, Georgios and Pantic, Maja},