Skip to content


Repository files navigation


An announcement voice recognition service for the hearing-impaired people based on deep learning using Python

A service that recognizes "stop" during subway announcements and recognizes "where, where, the door is on your left/right"

Team Members

  • Team Leader: Nam Seoyong (Division of Computer Science, HanYang University ERICA, Student ID : 2021075478)
  • Team Member: Choi Sooyeon (Division of Computer Science, HanYang University ERICA, Student ID : 2021023118)
  • Team Member: Lee Gyulim (Division of Computer Science, HanYang University ERICA, Student ID : 2021090646)


  1. Folder Structure
  2. Develoment Setting
  3. Libraries & Tools
  4. Data-Augmentation
  5. Noise-Reduction
  6. Keyword-Spotting
  7. Run-Demo

Folder Structure

 ┣ 📂noise-reduction
 ┃ ┣ 📂dataloader
 ┃ ┃ ┗ 📜
 ┃ ┣ 📂models
 ┃ ┃ ┗ 📂tscn
 ┃ ┃ ┃ ┣ 📜loss_history.csv
 ┃ ┃ ┃ ┣ 📜TSCN_CME.pth
 ┃ ┃ ┃ ┗ 📜TSCN_CSR.pth
 ┃ ┣ 📂tscn
 ┃ ┃ ┣ 📜
 ┃ ┃ ┣ 📜
 ┃ ┃ ┣ 📜
 ┃ ┃ ┗ 📜
 ┃ ┣ 📂utils
 ┃ ┃ ┗ 📜
 ┃ ┣ 📜dataset.csv
 ┃ ┣ 📜
 ┃ ┣ 📜
 ┃ ┣ 📜
 ┃ ┣ 📜
 ┃ ┣ 📜requirements.txt
 ┃ ┣ 📜sd1.wav
 ┃ ┣ 📜sn1.wav
 ┃ ┗ 📜
 ┣ 📂static
 ┣ 📂templates
 ┃ ┣ 📜index.html
 ┃ ┗ 📜result.html
 ┣ 📂Torch-KWT
 ┃ ┣ 📂docs
 ┃ ┃ ┗ 📜
 ┃ ┣ 📂models
 ┃ ┃ ┣ 📜
 ┃ ┃ ┗ 📜
 ┃ ┣ 📂runs
 ┃ ┣ 📂sample_configs
 ┃ ┃ ┗ 📜base_config.yaml
 ┃ ┣ 📂utils
 ┃ ┃ ┣ 📜
 ┃ ┃ ┣ 📜
 ┃ ┃ ┣ 📜
 ┃ ┃ ┣ 📜
 ┃ ┃ ┣ 📜
 ┃ ┃ ┣ 📜
 ┃ ┃ ┣ 📜
 ┃ ┃ ┗ 📜
 ┃ ┣ 📜
 ┃ ┣ 📜
 ┃ ┣ 📜
 ┃ ┣ 📜kwt1_pretrained.ckpt
 ┃ ┣ 📜label_map.json
 ┃ ┣ 📜
 ┃ ┣ 📜preds.json
 ┃ ┣ 📜preds_clip.json
 ┃ ┣ 📜
 ┃ ┣ 📜requirements.txt
 ┃ ┣ 📜
 ┃ ┗ 📜
 ┣ 📜
 ┣ 📜Data_Augmentation.ipynb
 ┣ 📜LICENSE.txt
 ┣ 📜
 ┣ 📜preds_clip.json
 ┗ 📜

Development Setting

  • Ubuntu 20.04
  • Python 3.8.16
  • PyTorch 1.12.1+cu116
  • CUDA 12.1

Libraries & Tools

  • tqdm
  • librosa
  • pandas
  • numpy
  • matplotlib
  • pystoi
  • scipy
  • openpyxl
  • pyyaml >= 5.3.1
  • audiomentations
  • pydub
  • einops
  • etc...

Data Augmentation

If you want to progress data augmentation then run data_augmentation. Only one file can do now(directory or multiple file to be implemented)

Noise Reduction


Download Dataset

How to make dataset for denoise

python3 noise-reduction/ \
--dataset_root {datapath} \
--csv_save_path {datapath}/dataset.csv
Structure of dataset for denoise
clean_path noisy_path script_path train_val_test
share/clean_file_1.wav share/noisy_file_1.wav share/script_file_1.json TR
share/clean_file_2.wav share/noisy_file_2.wav share/script_file_2.json VA
... ... ... ...
share/clean_file_n.wav share/noisy_file_n.wav share/script_file_n.json TE

Training denoise model

python noise-reduction/ \
--model=models/tscn \
--csv_file=share/dataset.csv \
--cme_epochs=40 \
--finetune_epochs=10 \
--csr_epochs=40 \
--batch_size=8 \

Keyword Spotting

Dataset for KWS

You can download using with sh

sh Torch-KWT/ <destination_path>

Training KWS model


python Torch-KWT/ -v <path/to/validation_list.txt> -t <path/to/testing_list.txt> -d <path/to/dataset/root> -o <output dir>

This will create the files training_list.txt, validation_list.txt, testing_list.txt and label_map.json at the specified output dir.

Running is fairly straightforward. Only a path to a config file is required.

python Torch/ --conf path/to/config.yaml

Refer to the example config to see how the config file looks like, and see the config explanation for a complete rundown of the various config parameters.

Pretrained Checkpoints
Model Name Test Accuracy Link
KWT-1 95.98* kwt1-v01.pth


You can use the model for inference,

  • For short ~1s clips, like the audios in the Speech Commands dataset
  • For running inference on longer audio clips, where multiple keywords may be present. Runs inference on the audio in a sliding window manner.
python --conf sample_configs/base_config.yaml \
                    --ckpt <path to pretrained_model.ckpt> \
                    --inp <path to audio.wav / path to audio folder> \
                    --out <output directory> \
                    --lmap label_map.json \
                    --device cpu \
                    --batch_size 8   # should be possible to use much larger batches if necessary, like 128, 256, 512 etc.

python --conf sample_configs/base_config.yaml \
                    --ckpt <path to pretrained_model.ckpt> \
                    --inp <path to audio.wav / path to audio folder> \
                    --out <output directory> \
                    --lmap label_map.json \
                    --device cpu \
                    --wlen 1 \
                    --stride 0.5 \
                    --thresh 0.85 \
                    --mode multi

There are three mode in window inference

  • multi: saves all found predictions (default)
  • max: saves the "most confident" prediction (outputs only a single 'clipwise; prediction for the whole clip)
  • n_voting: saves the "most frequent" prediction (outputs only a single 'clipwise' prediction for the whole clip)

If you run with mode "max" then result is like this

{"/home/a/SpeechRecognition/data/denoise/b.wav": ["stop", 0.9001830816268921, 25600.0]}

In preds_cilp.json

Run demo

How to run demo

Run with three arguments.

python --model_dir {this_file_dir} --noise_file {noise_file_name} --denoise_file {denoise_file_name}

Then, you can see result.

For example,

python --model ~/SpeechRecognition --noise_file handae --denosie_file de_handae

The result is

model_dir : /home/a/SpeechRecognition
noise_file : handae.wav
denoise_file : de_handae.wav
finish denoise 5.694255113601685 sec

finish kws 3.3709611892700195 sec

result : anyang university at anzan hanyang university at anzan the doors are on your left

4.017111778259277 sec

How to run demo with FastAPI

  1. Install FastAPI
pip install fastapi uvicorn
  1. Start FastAPI
uvicorn main:app --reload
  1. Open website to http://localhost:8080

  2. Enter 3 arguments same as demo


No description, website, or topics provided.







No releases published


No packages published