ACMMM '22: Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging
This is the officical PyTorch implementation for the paper Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging.
The main paper of this work is published on ACMMM 2022 as a full paper [acmdl][arxiv]. Please refer to our supplementary material for more information about this paper.
We release low resolution and mid resolution versions of SF2F. Our baseline model, voice2face (paper) which generates low resolution images by default, is also released. All released implementations are designed and experimented on HQ-VoxCeleb dataset.
Model | Output Resolution | VGGFace Score |
---|---|---|
voice2face | 64 |
15.47 |
SF2F (no fuser) | 64 |
18.59 |
SF2F | 64 |
19.49 |
SF2F (no fuser) | 128 |
19.31 |
SF2F | 128 |
20.10 |
Instructions on the training and testing of above models are introduced in GETTING_STARTED.
To provide the users of this repo a better understand of our implementation, we hereby introduces the implementation of key modules.
Voice Encoders. The baseline voice encoder from voice2face is implemented as V2F1DCNN
in models/voice_encoders.py
. As mention in our main paper, we designed and implemented Inception1DBlock
to improve the performance of voice encoder. When parameter inception_mode
is set to True
, V2F1DCNN
is automatically built up with Inception1DBlock
, which results in our proposed 1D-Inception based voice encoder. (Jump to code)
Face Decoders. The baseline face decoder is implemented as V2FDecoder
in models/face_decoders.py
. Our enhanced face decoder is implemented as FaceGanDecoder
in the same file. (Jump to code)
Embedding Fuser. Our proposed attention fuser is implemented as AttentionFuserV1
in models/fusers.py
. A graphical demonstration of embedding fuser is shown below. (Jump to code)
Generative Models. All generative models in this repo are implemented as EncoderDecoder
in models/encoder_decoder.py
. Encoder, decoder, and fusers will be initialized as attributes of EncoderDecoder
class. (Jump to code)
FaceNet Perceptual Loss. FaceNet perceptual loss is implemented as FaceNetLoss
in models/perceptual.py
. (Jump to code)
VGGFace Score. VGGFace Score is implemented in scripts/compute_vggface_score.py
. (Jump to code)
Retrieval Metrics. Retrieval metrics are implemented in utils/s2f_evaluator.py
. (Jump to code)
To learn about environment setup, data preparation, launch of training, visualization, and evaluation, please refer to GETTING_STARTED.
If you find this project useful in your research, please consider cite:
@inproceedings{bai2022speech,
title={Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging},
author={Bai, Yeqi and Ma, Tao and Wang, Lipo and Zhang, Zhenjie},
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
pages={2042--2050},
year={2022}
}