Low WER training pipeline in torchaudio with wav2letter #913

vincentqb · 2020-09-18T14:41:50Z

torchaudio is targeting speech recognition as full audio application (internal). Along this line, we implemented wav2letter pipeline to obtain a low character error rate (CER). We want to expand on this and showcase a new pipeline which also has a low word error rate (WER). To achieve this, we consider the following additions to torchaudio from higher to lower priority.

Token Decoder: Add a lexicon-constrained beam search algorithm, based on fairseq (search class, sequence generator) since it is torchscriptable.

Links: fairseq, wav2letter, ParlAI, user repository, caffe2, pyspeech.
Kaldi related: Kaldi Viterbi BeamSearch SimpleDecoder, internal, and OpenFST.
Other domains: paper for vision
Related algorithm: FAISS’s topk internal, paper and github

Acoustic Model: Add a transformer-based acoustic model, e.g. speech-transformer, comparison.

wav2vec 2.0 (paper, fairseq’s github) combines transformer and lexicon-constrained beam search
multi-speaker, streaming (paper, post)

Language Model: Add KenLM to use a 4-gram language model based on LibriSpeech Language Model, as done in paper.

wav2letter interfaces with both KenLM and ConvLM
This could land in torchtext.

Training Loss: Add the RNN Transducer loss to replace the CTC loss in the pipeline.

RNNT (paper, github)
see also internal

Transformations: SpecAugment is already available in wav2letter pipeline.

See also internal

cc @astaff @dongreenberg @cpuhrsch

scarecrow1123 · 2020-09-19T01:11:22Z

Hey @vincentqb these planned additions look great and useful! Can you clarify on the below points please?

Are these open for community contributions since there are a bunch of internal links you've linked to the description above?
For things like transducer loss and token decoder, are you planning on linking other libraries (like fairseq, warp-transducer, etc.) as dependencies to the example application? Or would they be self contained as it is done for wav2letter implementation in Example pipeline with wav2letter #632 ?

vincentqb · 2020-09-22T17:22:40Z

Hey, thanks for commenting :)

For now, we are looking for high level feedback on the plan. This has not yet been taskyfied for the community :)
We are aiming at keeping the implementations self-contained in torchaudio as much as possible.

Thoughts?

Edresson · 2020-09-27T12:24:29Z

@vincentqb Wouldn't it be interesting to create a torchaudio_ASR repository?

Support various things, an independent preprocess of features that saves the features in .pt so we can extract any feature from torchaudio and make an easy support to features from fairseq (like wav2vec).

I feel that ASR lacks a repository with torchaudio and is modular enough to accept new features. And that it is simple for people to add new models (which are different from those implemented in torchaudio (currently we only have wav2letter)). I believe that the community can like this idea and contribute to the repository.

vincentqb · 2020-09-28T14:35:10Z

@vincentqb Wouldn't it be interesting to create a torchaudio_ASR repository?

My first step is to understand what would be missing in torchaudio to serve the ASR community best :) Can you provide some examples?

Support various things, an independent preprocess of features that saves the features in .pt so we can extract any feature from torchaudio and make an easy support to features from fairseq (like wav2vec).

torch.save can save any nn.Module and tensors. Is that what you meant?

I feel that ASR lacks a repository with torchaudio and is modular enough to accept new features. And that it is simple for people to add new models (which are different from those implemented in torchaudio (currently we only have wav2letter)). I believe that the community can like this idea and contribute to the repository.

Are there models you would like to contribute to torchaudio? :)

Edresson · 2020-09-28T14:55:00Z

@vincentqb I would support the Jasper model. Right now I'm out of time so maybe soon I can send a PR :).

If that list of things has been added to torchaudio it will be very good for ASR :). It will be even simpler to build a great pipeline using only torchaudio :).

My initial suggestion is to keep external notebooks. Why not make a torchaudio_ASR repository? So some things would not need to be implemented in the torchaudio itself, but in these new repositories. There in this repository we can extract features before and save with torch.save and just write a generic dataloader that reads, this is interesting to facilitate support for new features like wav2vec. So to support new extraction methods independent of torchaudio, just write a new preprocessing class.

vincentqb · 2020-09-28T15:04:09Z

@vincentqb I would support the Jasper model. Right now I'm out of time so maybe soon I can send a PR :).

Great! Feel free to ping me when you do :)

If that list of things has been added to torchaudio it will be very good for ASR :). It will be even simpler to build a great pipeline using only torchaudio :).

My initial suggestion is to keep external notebooks.

We currently offer training examples such as wav2letter using torchaudio. The example I link shows one way of doing preprocessing. Does that help?

Why not make a torchaudio_ASR repository? So some things would not need to be implemented in the torchaudio itself, but in these new repositories. There in this repository we can extract features before and save with torch.save and just write a generic dataloader that reads, this is interesting to facilitate support for new features like wav2vec. So to support new extraction methods independent of torchaudio, just write a new preprocessing class.

Our goal with torchaudio is to provide flexible building blocks for audio-related fields, such as ASR. As such, we want to make sure we capture what would be useful to the community, and to ASR. Can you provide an example of your suggested workflow?

Edresson · 2020-09-28T15:19:58Z

We currently offer training examples such as wav2letter using torchaudio. The example I link shows one way of doing preprocessing. Does that help?

I had not seen this very good example :)

Our goal with torchaudio is to provide flexible building blocks for audio-related fields, such as ASR. As such, we want to make sure we capture what would be useful to the community, and to ASR. Can you provide an example of your suggested workflow?

Now that I've seen the example. Do you think about supporting Wav2vec?

The easiest way to do this support that I see is to change the example, separating the feature extraction (MFFC / waveform) from the model training. Basically a preprocess.py that extracts the characteristics and saves them with torch.save so the main only reads the files saved by torch.save.

Do you know a simpler way to integrate Wav2vec with torchaudio?

vincentqb · 2020-09-28T15:32:41Z

Adding an example workflow with wav2vec would be a great addition! I see you have already mentioned jasper in comment so let's move the discussion there :)

Now that I've seen the example. Do you think about supporting Wav2vec?

The easiest way to do this support that I see is to change the example, separating the feature extraction (MFFC / waveform) from the model training. Basically a preprocess.py that extracts the characteristics and saves them with torch.save so the main only reads the files saved by torch.save.

Do you know a simpler way to integrate Wav2vec with torchaudio?

We currently don't have a pipeline with wav2vec included, but this would be a great addition. torchaudio is made to be modular and uses standard pytorch operations, so using the pre-trained tensors from fairseq can be done using standard pytorch operations. Is that what you meant?

Edresson · 2020-09-28T16:17:24Z

Adding an example workflow with wav2vec would be a great addition! I see you have already mentioned jasper in comment so let's move the discussion there :)

Now that I've seen the example. Do you think about supporting Wav2vec?
The easiest way to do this support that I see is to change the example, separating the feature extraction (MFFC / waveform) from the model training. Basically a preprocess.py that extracts the characteristics and saves them with torch.save so the main only reads the files saved by torch.save.
Do you know a simpler way to integrate Wav2vec with torchaudio?

We currently don't have a pipeline with wav2vec included, but this would be a great addition. torchaudio is made to be modular and uses standard pytorch operations, so using the pre-trained tensors from fairseq can be done using standard pytorch operations. Is that what you meant?

Do you have any idea how this support would do?
Do you intend to add fairseq as a dependency on the torchaudio or example in question?

Or do you want to try to make it independent of the fairseq structure and just create a class with Wav2vec architecture and load it at the checkpoint?

vincentqb · 2020-10-14T17:26:51Z

@Edresson -- those are great questions, and thanks for sharing your thoughts :)

Do you intend to add fairseq as a dependency on the torchaudio or example in question?
Or do you want to try to make it independent of the fairseq structure and just create a class with Wav2vec architecture and load it at the checkpoint?

We do not want torchaudio to depend on fairseq, no. For the example implementation, we also aim to avoid such dependencies as much as possible.

I would aim instead for fairseq to use torchaudio building blocks in some places.

Do you have any idea how this support would do?

What I meant above about the checkpoint was really just that torchaudio uses standard pytorch, so a user can interact with through standard pytorch means. For instance, someone could pre-process with torchaudio on a torchaudio dataset and then import a model from somewhere else, and then follow a torchaudio example for training loop.

Is this what you meant?

Edresson · 2020-10-14T17:43:28Z

@Edresson -- those are great questions, and thanks for sharing your thoughts :)

Do you intend to add fairseq as a dependency on the torchaudio or example in question?
Or do you want to try to make it independent of the fairseq structure and just create a class with Wav2vec architecture and load it at the checkpoint?

We do not want torchaudio to depend on fairseq, no. For the example implementation, we also aim to avoid such dependencies as much as possible.

I would aim instead for fairseq to use torchaudio building blocks in some places.

Do you have any idea how this support would do?

What I meant above about the checkpoint was really just that torchaudio uses standard pytorch, so a user can interact with through standard pytorch means. For instance, someone could pre-process with torchaudio on a torchaudio dataset and then import a model from somewhere else, and then follow a torchaudio example for training loop.

Is this what you meant?

I believe so :). We could, for example, preprocess the dataset with torchaudio after extracting the features of these audios with wav2vec and save with torch.save. After going back to torchaudio and using the torchaudio ASR models. I think of saving with torch.save because extracting features at all times with wav2vec can be too slow, so the extraction is done only once. That makes sense?

vincentqb · 2020-10-14T19:44:49Z

I believe so :).

great!

We could, for example, preprocess the dataset with torchaudio after extracting the features of these audios with wav2vec and save with torch.save. After going back to torchaudio and using the torchaudio ASR models. I think of saving with torch.save because extracting features at all times with wav2vec can be too slow, so the extraction is done only once. That makes sense?

yup, does to me :)

vincentqb mentioned this issue Sep 28, 2020

Adding Module Models #446

Closed

vincentqb mentioned this issue Jan 8, 2021

DRAFT #1163

Closed

vincentqb mentioned this issue Jan 25, 2021

Roadmap ahead for torchaudio #1196

Closed

mthrok closed this as completed Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low WER training pipeline in torchaudio with wav2letter #913

Low WER training pipeline in torchaudio with wav2letter #913

vincentqb commented Sep 18, 2020

scarecrow1123 commented Sep 19, 2020

vincentqb commented Sep 22, 2020

Edresson commented Sep 27, 2020

vincentqb commented Sep 28, 2020

Edresson commented Sep 28, 2020

vincentqb commented Sep 28, 2020

Edresson commented Sep 28, 2020

vincentqb commented Sep 28, 2020

Edresson commented Sep 28, 2020

vincentqb commented Oct 14, 2020 •

edited

Loading

Edresson commented Oct 14, 2020

vincentqb commented Oct 14, 2020

Low WER training pipeline in torchaudio with wav2letter #913

Low WER training pipeline in torchaudio with wav2letter #913

Comments

vincentqb commented Sep 18, 2020

scarecrow1123 commented Sep 19, 2020

vincentqb commented Sep 22, 2020

Edresson commented Sep 27, 2020

vincentqb commented Sep 28, 2020

Edresson commented Sep 28, 2020

vincentqb commented Sep 28, 2020

Edresson commented Sep 28, 2020

vincentqb commented Sep 28, 2020

Edresson commented Sep 28, 2020

vincentqb commented Oct 14, 2020 • edited Loading

Edresson commented Oct 14, 2020

vincentqb commented Oct 14, 2020

vincentqb commented Oct 14, 2020 •

edited

Loading