语音识别

Table of Contents 语音识别研究方法语音文件预处理深度学习方法神经网络架构单元 LSTM CTC GRU Alex Graves 研究进展 Google Google Speech Baidu Amazon Alexa JHU CMU 相关研究唤醒词检测声音分离问题参考教材

语音识别

自动语音识别(Automatic Speech Recognition，简称ASR)技术是使人与人、人与机器更顺畅交流的关键技术

研究方法

过去，利用高斯混合模型（Gaussian Mixture Model, 即GMM）以及隐马尔科夫模型（Hidden Markov Model, 即HMM）实现了大词汇量级下的连续语音识别（Large Vocabulary Continuous Speech Recognition, 即LVCSR）。传统使用HMM和GMM混合，其中HMM用来规范时间的变化，GMM用来计算HMM之中各个组合的可能性。

深度神经网络（DNN）与隐藏马尔科夫模型（HMMs），上下文相关模型（context-dependent phone models），n-gram 语言模型（n-gram language models）和维特比搜索算法（Viterbi search algorithms）进行混合使用。

当下，采用了混合方法来构建语音识别引擎，混合模型比较复杂，需要一套精致的训练方法，以及相当多的专业知识来帮助搭建模型。

传统方法综述

S. Karpagavalli and E. Chandra. "A Review on Automatic Speech Recognition Architecture and Approaches." International Journal of Signal Processing, Image Processing and Pattern Recognition 9, No. 4 (2016): 393-404.

语音文件预处理

SOX程序用于对原始语音录音文件进行添加噪声、混响等扰动信号，增加可供训练的样本数目。

SOX SOX_CODE

ffmpeg程序用于对各种格式的录音文件，如*.m4a, *.wav, *.mp3, *,mp4, *.3gpp等，进行统一的转换为*.wav格式的数据，便于统一处理。

ffmpeg FFMPEG

深度学习方法

深度学习采用一种通用的神经网络来替代复杂的，多维度的机器学习方法。这些神经网络经过训练以后，可以用来优化可微分的代价或损失函数（loss/cost function）。这种方法已经在语音识别上取得了巨大的成功，也称为「纯正」的 DNN 方法。

只要拥有了相当多的训练数据和足够的计算资源，就可以构建一个高水准的大词汇量连续语音识别（Large Vocabulary Continuous Speech Recognition (LVCSR)）系统。

神经网络架构单元

LSTM

Long short term memory neural computation, Neural computation 9 (8), 1735-1780, 1997. LSTM

CTC

Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks, ICML 2006.
Sequence Modeling With CTC, https://distill.pub/2017/ctc/.

GRU

On the Properties of Neural Machine Translation Encoder-Decoder Approaches, SSST-8, 2014.

Alex Graves

Alex Graves，Google DeepMind研究员，语音识别多项技术开创者

研究进展

Google

Google Speech

Google AI Blog: An All-Neural On-Device Speech Recognizer, https://ai.googleblog.com/2019/03/an-all-neural-on-device-speech.html.
Google Speech Processing from Mobile to Farfield, CHiME 2016. Google_Speech_Processing
Tara N. Sainath et al., "Multichannel Signal Processing with Deep Neural Networks for Automatic Speech Recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing (2017).
Zazo Candil, Rubén; Tara N. Sainath, Simko, Gabor; Parada, Carolina, Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection, InterSpeech 2016.
Sainath, Tara N., Oriol Vinyals, Andrew Senior, and Haşim Sak. "Convolutional, long short-term memory, fully connected deep neural networks.", ICASSP 2015.
Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition." In Acoustics, Speech and Signal Processing, ICASSP 2015.
Context dependent phone models for LSTM RNN acoustic modelling, ICASSP 2015.
Learning the Speech Front-end With Raw Waveform CLDNNs, InterSpeech 2015.

Baidu

百度的DeepSpeech2的介绍：

（1）输入为频谱图；英文输出为{a, b, c, : : : , z, space, apostrophe, blank}，中文输出为｛包含罗马字母表，6000个汉字｝。

（2）采用BatchNorm(Batch Normalization) 方法，加速收敛；

（3）采用SortaGrid方法，保证CTC平稳性（短句优先训练）；

（4）采用GRU，GRU和LSTM的准确性相差不大，但GRU运算更快，；

（5）采用Lookahead Convolution和单向模型，因为双向LSTM的时延达不到要求。

Amazon Alexa

Cocktail party problem

JHU

Dan Povey

Parallel training of DNNs with natural gradient and parameter averaging, ICLR Workshop 2015.
Ko, Tom, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, "Audio augmentation for speech recognition.", InterSpeech 2015.

CMU

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding, ASRU 2015.

参考教材

俞栋，邓力著（俞凯，钱彦旻译），解析深度学习：语音识别实践，电子工业出版社，2016年7月.
李航. 统计学习方法. 清华大学出版社, 北京, 2012.

清华大学iCenter人工智能系统实验室

BDMI-2019 Autumn AI lab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

语音识别

Table of Contents

语音识别

研究方法

语音文件预处理

深度学习方法

神经网络架构单元

LSTM

CTC

GRU

Alex Graves

研究进展

Google

Google Speech

Baidu

Amazon Alexa

JHU

CMU

相关研究

唤醒词检测

声音分离问题

参考教材

Clone this wiki locally