We recommend using our project template.
Implement and train a neural-network speech recognition system with CTC loss. You are free to choose any model you like. We recommend you to have a look at these papers:
- DeepSpeech
- DeepSpeech 2
- Conformer
- Or you can try a simple LSTM\GRU with LayerNorm between layers
Try to avoid using implementations available on the internet.
Requirements:
-
The code should be stored in a public github (or gitlab) repository
-
All the necessary packages should be mentioned in
./requirements.txt
or be installed in dockerfile -
All necessary resources (such as model checkpoints, LMs, and logs) should be downloadable with a script. Mention the script (or lines of code) in the
README.md
-
You should implement all functions in
test.py
(for evaluation) so that one can reproduce your results -
Basically, your
test.py
andtrain.py
scripts should run without issues after running all commands in your installation guide. -
Log everything that is useful: losses, data, learning rate, gradient norm, etc.
-
Provide the logs for the training of your final model from the start of the training. We heavily recommend you to use W&B Reports feature.
-
Attach a brief report. That includes:
- How to reproduce your model? (example: train 50 epochs with config
train_1.yaml
and 50 epochs withtrain_2.yaml
) - Attach training logs to show how fast did you network train
- How did you train your final model?
- What have you tried?
- What worked and what didn't work?
- What were the major challenges?
Also attach a summary of all bonus tasks you've implemented.
- How to reproduce your model? (example: train 50 epochs with config
Score | Dataset | CER | WER | Description |
---|---|---|---|---|
1.0 | -- | -- | -- | At least you tried |
2.0 | LibriSpeech: test-clean | 50 | -- | Well, it's something |
3.0 | LibriSpeech: test-clean | 30 | -- | You can guess the target phrase if you try |
4.0 | LibriSpeech: test-clean | 20 | -- | It gets some words right |
5.0 | LibriSpeech: test-clean | -- | 40 | More than half of the words are looking fine |
6.0 | LibriSpeech: test-clean | -- | 30 | It's quite readable |
7.0 | LibriSpeech: test-clean | -- | 20 | Occasional mistakes |
8.0 | LibriSpeech: test-other | -- | 30 | Your network can handle somewhat noisy audio. |
8.5 | LibriSpeech: test-other | -- | 25 | Your network can handle somewhat noisy audio but it is still just close enough. |
9.0 | LibriSpeech: test-other | -- | 20 | Somewhat suitable for practical applications. |
10.0 | LibriSpeech: test-other | -- | 10 | Technically better than a human. Well done! |
Dataset can be found here and on Kaggle.
Important
Use only train partitions of LibriSpeech or Mozilla Common Voice and data augmentation techniques to train your model.
To calculate the metrics, you can use torchmetrics
implementation of CER and WER.
To save some coding time, you can use HuggingFace dataset library. Look how easy it is:
from datasets import load_dataset
dataset = load_dataset("librispeech_asr", split='train-clean-360')
- Use an external language model for evaluation. The choice of an LM-fusion method is up to you. You may find this library helpful. Note: implementing this part will yield a very significant quality boost (which will improve your score by a lot). We heavily recommend you to implement this part.
- BPE instead of characters. You can use SentencePiece, HuggingFace, or YouTokenToMe.