-
Notifications
You must be signed in to change notification settings - Fork 4
EC2 Installation Walkthrough
In this guide I will explain how to setup OpenDcd with Kaldi on EC and decode open source models based on Librispeech corpus. For this walkthrough I used a large instance with four cores and 15GB of memory. OpenDcd is very memory efficient for both decoding and graph construction and this is easily enough to decode the large 4-gram model.
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install -y gcc-4.9 g++-4.9 cpp-4.9 subversion make zlib1g-dev automake libtool autoconf libatlas3-base flac
Due a bug in gcc 4.8 we installed gcc 4.9 and set hard links as the system default
sudo ln -s /usr/bin/g++-4.9 /usr/bin/g++
sudo ln -s /usr/bin/gcc-4.9 /usr/bin/gcc
sudo ln -s -f bash /bin/sh
svn co https://svn.code.sf.net/p/kaldi/code/trunk kaldi
cd kaldi/tools
make
cd ../src
./configure
For descent runtime performance it is essential to edit the kaldi.mk file and add the -O2 switch. Now just type make to build the Kaldi and optionally specify the number of cores.
make -j4
git clone https://github.com/edobashira/opendcd.git
cd opendcd/3rdparty
make
cd ../src/bin
make -j4
There are two graph construction methods, in the first we take a set of Kaldi component transducers as the input to the build process. In the second method we take raw language model and lexicon and build everything from scratch. In this recipe we will use the pre-built models from kaldi-asr.org and use the first method.
We need three sets of the models the language model and lexicon, the acoustic model and the models used in the iVector extractor.
The helper script makeclevel.sh will build the cascade from the model. It need three four parameters, two locations of the model files, the directory to write the result and the path where Kaldi is installed.
script/makeclevel.sh lang_test_tgsmall nnet_a graph_test_tgsmall ../../kaldi
In modern neural network based speech recognition the decoding pipeline consists of three steps: feature extraction, state like computation and the search algorithm.
First we will grab a set of utterance from openslr.
wget http://www.openslr.org/resources/12/test-clean.tar.gz
tar -zxf test-clean.tar.gz
In recent Kaldi there is new online decoder which contains a several tools for online decoding. In particular the online2-wav-nnet-am-compute is perfect for needs. This will take the raw waveform compute the features and neural networks output activations. This is perfect for connecting with OpenDcd to complete the recognition cascade.
online2bin/online2-wav-nnet2-am-compute \
--online=true \
--apply-log=true \
--config=online_nnet2_decoding.conf \
nnet_a/final.mdl \
ark:test-clean.utt2psk \
"ark:~/tools/kaldi/src/featbin/wav-copy scp,p:test-clean.scp ark:- |" \
ark:-
We first need to create several config files and utterance list. The OpenDcd repository contains the config files and the utterance list is generated by a helper script. The utterance file contents will be briefly described here. The utterance list test-clean.scp gives the files names and the flac command to convert them to raw Wwav files.
1089-134686-0011 flac -c -d -s LibriSpeech/test-clean/1089/134686//1089-134686-0011.flac |
1089-134686-0028 flac -c -d -s LibriSpeech/test-clean/1089/134686//1089-134686-0028.flac |
1089-134686-0032 flac -c -d -s LibriSpeech/test-clean/1089/134686//1089-134686-0032.flac |
...
...
The utterance list files is the utt2spk file. We won't be using any speaker adaptation in this walkthrough but the file is still needed and in the case of no adaptation the file name simples maps its self.
1089-134686-0011 1089-134686-0011
1089-134686-0028 1089-134686-0028
1089-134686-0032 1089-134686-0032
1089-134686-0012 1089-134686-0012
1089-134686-0022 1089-134686-0022
...
...
In the file step we connect the feature extraction to OpenDcd to complete the decoding pipeline.
~/tools/kaldi/src/online2bin/online2-wav-nnet2-am-compute \
--online=true \
--apply-log=true \
--config=online_nnet2_decoding.conf \
nnet_a/final.mdl \
ark:test-clean.utt2psk \
"ark:~/tools/kaldi/src/featbin/wav-copy scp,p:test-clean.scp ark:- |" \
ark:- 2> feats.log |\
../src/bin/dcd-recog \
--word_symbols_table=words.txt \
--decoder_type=hmm_lattice \
--beam=15 \
--acoustic_scale=0.1 \
--fst_reset_period=1 \
graph_test_tgsmall/arcs.far \
graph_test_tgsmall/la.C.det.L.fst,graph_test_tgsmall/G.fst \
ark:- recog.far
If everything worked correctly the decoder will write output like the following:
Currently the recognition results are written in two ways. Directly to stdout as part of the logging and as an OpenFst FAR file.
farinfo recog-dynamic.far
far type sttable
arc type standard
fst type vector
# of FSTs 38
total # of states 4043
total # of arcs 4005
total # of final states 38