Add embedding scale to nn.Embedding. #17

csukuangfj · 2021-08-22T06:47:22Z

Differences between `conformer_ctc` and `conformer_ctc_embedding_scale`

conformer_ctc_embedding_scale replaces nn.Embedding with modified
Embedding. Modified embedding contains two changes:

(1) The weight matrix is initialized to the range (-std, std) where
std = 1 / sqrt(embedding_dim)
(2) The output of the embedding is scaled by sqrt(embedding_dim)

Also, conformer_ctc_embedding_scale modifies the PositionalEncoding
in transformer.py. It replaces

self.xscale = math.sqrt(self.d_model)
x = x * self.xscale + self.pe[:, : x.size(1), :]

with

self.pos_scale = 1. / math.sqrt(self.d_model)
x = x + self.pe[:, : x.size(1), :] * self.pos_scale

You can use

diff conformer_ctc/transformer.py conformer_ctc_embedding_scale/transformer.py

to find the exact differences.

pzelasko · 2021-08-22T09:45:54Z

egs/librispeech/ASR/conformer_ctc_embedding_scale/test_subsampling.py

@@ -0,0 +1,33 @@
+#!/usr/bin/env python3


I’m now sure that we want to copy the test code across experiments?

The intention is to make the model dir itself as self-contained as possible so that it can be modified
independently, with some drawbacks that there are some duplicates.

We can use symlinks here if someone also agrees. @danpovey what do you think.

csukuangfj · 2021-08-26T07:23:31Z

Add Madam optimizer.

Use default parameters with Foam.

Tensorboard log:
https://tensorboard.dev/experiment/xcJ2nf6YRjGMtJdjjLH0LQ/

csukuangfj · 2021-09-08T06:42:44Z

Here are the results for this pull-request:
(Only the best WERs of each method are listed)

HLG 1best decoding(no LM rescoring, no attention-decoder rescoring)

(model averaging from epoch-40.pt to epoch-49.pt)

2021-09-08 04:35:32,223 INFO [utils.py:317] [test-clean-no_rescore] %WER 2.99% [1573 / 52576, 134 ins, 244 del, 1195 sub ]
2021-09-08 04:37:53,973 INFO [utils.py:317] [test-other-no_rescore] %WER 6.98% [3652 / 52343, 203 ins, 842 del, 2607 sub ]

(model averaging from epoch-22.pt to epoch-49.pt)

2021-09-07 21:50:25,598 INFO [utils.py:317] [test-clean-no_rescore] %WER 3.05% [1602 / 52576, 140 ins, 287 del, 1175 sub ]
2021-09-07 21:52:46,439 INFO [utils.py:317] [test-other-no_rescore] %WER 6.87% [3597 / 52343, 189 ins, 880 del, 2528 sub ]

HLG decoding + 4-gram LM rescoring (whole lattice rescoring, without attention-decoder)

(model averaging from epoch-19.pt to epoch-48.pt)

2021-09-08 08:07:39,492 INFO [utils.py:317] [test-clean-lm_scale_0.6] %WER 2.66% [1399 / 52576, 159 ins, 126 del, 1114 sub ]
2021-09-08 08:14:59,870 INFO [utils.py:317] [test-other-lm_scale_0.7] %WER 6.00% [3140 / 52343, 234 ins, 484 del, 2422 sub ]

HLG + 4-gram LM rescoring (whole lattice-rescoring) + attention-decoder rescoring

((model averaging from epoch-24.pt to epoch-49.pt)

2021-09-07 22:05:24,962 INFO [decode.py:437]
For test-clean, WER of different settings are:
ngram_lm_scale_1.0_attention_scale_0.6  2.64  best for test-clean
ngram_lm_scale_0.9_attention_scale_0.6  2.65
ngram_lm_scale_1.0_attention_scale_0.7  2.65
ngram_lm_scale_1.3_attention_scale_1.3  2.65
ngram_lm_scale_0.9_attention_scale_0.5  2.66
ngram_lm_scale_0.9_attention_scale_0.7  2.66
ngram_lm_scale_1.0_attention_scale_0.9  2.66
ngram_lm_scale_1.0_attention_scale_1.0  2.66

2021-09-07 22:17:03,741 INFO [decode.py:437]
For test-other, WER of different settings are:
ngram_lm_scale_0.9_attention_scale_0.7  5.91  best for test-other
ngram_lm_scale_1.0_attention_scale_0.9  5.91
ngram_lm_scale_0.7_attention_scale_0.6  5.92
ngram_lm_scale_0.7_attention_scale_0.7  5.92
ngram_lm_scale_0.9_attention_scale_0.5  5.92
ngram_lm_scale_0.9_attention_scale_1.1  5.92
ngram_lm_scale_0.9_attention_scale_1.2  5.92

((model averaging from epoch-33.pt to epoch-48.pt)

For test-clean, WER of different settings are:
ngram_lm_scale_0.9_attention_scale_0.5  2.73  best for test-clean
ngram_lm_scale_0.9_attention_scale_0.6  2.73
ngram_lm_scale_0.9_attention_scale_0.7  2.73
ngram_lm_scale_1.5_attention_scale_1.5  2.73
ngram_lm_scale_1.5_attention_scale_1.7  2.73
ngram_lm_scale_1.0_attention_scale_0.5  2.74
ngram_lm_scale_1.0_attention_scale_0.6  2.74

For test-other, WER of different settings are:
ngram_lm_scale_1.2_attention_scale_0.9  5.85  best for test-other
ngram_lm_scale_1.7_attention_scale_1.9  5.85
ngram_lm_scale_1.2_attention_scale_1.0  5.86
ngram_lm_scale_1.5_attention_scale_1.9  5.86
ngram_lm_scale_1.7_attention_scale_2.0  5.86
ngram_lm_scale_0.9_attention_scale_0.6  5.87                                                                                          ngram_lm_scale_1.2_attention_scale_1.1  5.87
ngram_lm_scale_1.2_attention_scale_1.2  5.87

csukuangfj · 2021-09-08T06:48:19Z

The results are comparable with the one from the latest master
(https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md)

WERs of test-clean and test-other:

test-clean:

this pull-request: 2.64
master: 2.57

test-other:

this pull-request: 5.85
master: 5.94

danpovey · 2021-09-08T08:39:04Z

I think the reason this doesn't make much difference is that since this embedding is only used as an input, leaving it as random vectors works OK since the rest of the network can just figure out what to with it. But I think it's probably good practice to train it regardless. There might be setups with larger vocabs, where this matters.

Add embedding scale to nn.Embedding.

40109c0

pzelasko reviewed Aug 22, 2021

View reviewed changes

csukuangfj added 4 commits August 26, 2021 14:41

Merge branch 'master' into embedding-scale

b09224f

Merge master.

69a2bd5

Add madam optimizer.

d09784f

Reduce number of logs.

66467f2

Fix errors in madam.py

b7d4a4f

csukuangfj mentioned this pull request Aug 27, 2021

BucketingSampler more randomness? lhotse-speech/lhotse#364

Closed

csukuangfj mentioned this pull request Sep 8, 2021

Extract framewise alignment information using CTC decoding #39

Merged

Lzhang-hub mentioned this pull request Oct 20, 2021

CUDA out of memory in decoding #70

Open

danpovey mentioned this pull request Nov 27, 2021

Decoding error 'Fsa' object doesn't support assignment. #133

Open

Jianjie-Shi mentioned this pull request Jan 27, 2022

Extract framewise alignment information by the pretrained model #188

Open

ahazned mentioned this pull request Apr 13, 2022

Illegal memory error when training with multi-GPU #247

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add embedding scale to nn.Embedding. #17

Add embedding scale to nn.Embedding. #17

csukuangfj commented Aug 22, 2021

pzelasko Aug 22, 2021

csukuangfj Aug 22, 2021

csukuangfj commented Aug 26, 2021 •

edited

Loading

csukuangfj commented Sep 8, 2021 •

edited

Loading

csukuangfj commented Sep 8, 2021

danpovey commented Sep 8, 2021

Add embedding scale to nn.Embedding. #17

Are you sure you want to change the base?

Add embedding scale to nn.Embedding. #17

Conversation

csukuangfj commented Aug 22, 2021

Differences between conformer_ctc and conformer_ctc_embedding_scale

pzelasko Aug 22, 2021

Choose a reason for hiding this comment

csukuangfj Aug 22, 2021

Choose a reason for hiding this comment

csukuangfj commented Aug 26, 2021 • edited Loading

csukuangfj commented Sep 8, 2021 • edited Loading

HLG 1best decoding(no LM rescoring, no attention-decoder rescoring)

HLG decoding + 4-gram LM rescoring (whole lattice rescoring, without attention-decoder)

HLG + 4-gram LM rescoring (whole lattice-rescoring) + attention-decoder rescoring

csukuangfj commented Sep 8, 2021

danpovey commented Sep 8, 2021

Differences between `conformer_ctc` and `conformer_ctc_embedding_scale`

csukuangfj commented Aug 26, 2021 •

edited

Loading

csukuangfj commented Sep 8, 2021 •

edited

Loading