Skip to content

Implementation of Deep Speech 2 paper with end-to-end training scripts in Lightning AI ⚡

License

Notifications You must be signed in to change notification settings

LuluW8071/Deep-Speech-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Speech 2 with Parallel MinGRU Implementation

Code in Progress License Open Issues Closed Issues Open PRs Repo Size Last Commit

This repository contains an implementation of the paper Deep Speech 2: End-to-End Speech Recognition and newly proposed parallel minGRU architecture from Were RNNs All We Needed? using PyTorch 🔥 and Lightning AI ⚡.

📜 Paper & Blogs Review

📖 Introduction

Deep Speech 2 was a state-of-the-art ASR model designed to transcribe speech into text with end-to-end training using deep learning techniques in 2015.

On the other hand, Were RNNs All We Needed? introduces a new RNN-based architecture with a parallelized version of the minGRU (Minimum Gated Recurrent Unit), aiming to enhance the efficiency of RNNs by reducing the dependency on sequential data processing. This architecture enables faster training and inference, making it potentially more suitable for ASR tasks and other real-time applications.


Installation

  1. Clone the repository:

    git clone --recursive https://github.com/LuluW8071/Deep-Speech-2.git
    cd deep-speech-2
  2. Install Pytorch and required dependencies:

    pip install -r requirements.txt

    Ensure you have PyTorch and Lightning AI installed.

Usage

Training

Important

Before training make sure you have placed comet ml api key and project name in the environment variable file .env.

To train the Deep Speech 2 model, use the following command for default training configs:

python3 train.py

Customize the pytorch training parameters by passing arguments in train.py to suit your needs:

Refer to the provided table to change hyperparameters and train configurations.

Args Description Default Value
-g, --gpus Number of GPUs per node 1
-g, --num_workers Number of CPU workers 8
-db, --dist_backend Distributed backend to use for training ddp_find_unused_parameters_true
--epochs Number of total epochs to run 50
--batch_size Size of the batch 32
-lr, --learning_rate Learning rate 2e-4 (0.0002)
--checkpoint_path Checkpoint path to resume training from None
--precision Precision of the training 16-mixed
python3 train.py 
-g 4                   # Number of GPUs per node for parallel gpu training
-w 8                   # Number of CPU workers for parallel data loading
--epochs 10            # Number of total epochs to run
--batch_size 64        # Size of the batch
-lr 2e-5               # Learning rate
--precision 16-mixed   # Precision of the training
--checkpoint_path path_to_checkpoint.ckpt    # Checkpoint path to resume training from

Results

The model was trained on LibriSpeech train set (100 + 360 + 500 hours) and validated on the LibriSpeech test set ( ~ 10.5 hours).

Citations

@misc{amodei2015deepspeech2endtoend,
      title={Deep Speech 2: End-to-End Speech Recognition in English and Mandarin}, 
      author={Dario Amodei and Rishita Anubhai and Eric Battenberg and Carl Case and others,
      year={2015},
      url={https://arxiv.org/abs/1512.02595}, 
}
@inproceedings{Feng2024WereRA,
    title   = {Were RNNs All We Needed?},
    author  = {Leo Feng and Frederick Tung and Mohamed Osama Ahmed and Yoshua Bengio and Hossein Hajimirsadegh},
    year    = {2024},
    url     = {https://api.semanticscholar.org/CorpusID:273025630}
}

About

Implementation of Deep Speech 2 paper with end-to-end training scripts in Lightning AI ⚡

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published