Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory in decoding #70

Open
Lzhang-hub opened this issue Oct 8, 2021 · 113 comments
Open

CUDA out of memory in decoding #70

Lzhang-hub opened this issue Oct 8, 2021 · 113 comments

Comments

@Lzhang-hub
Copy link

Lzhang-hub commented Oct 8, 2021

Hi, I am newer to learn icefall,I finished the training of tdnn_lstm_ctc, when run the decode steps, I meet the following error, I change the --max-duration, there are still errors:

image

we set --max-duration=100 and use Tesla V100-SXM, the GPU info follow:

image

would you give me some advice?thanks

@CSerV
Copy link

CSerV commented Oct 8, 2021

The 100 is still big for max-duration. Maybe you can reduce the max-duration to 50, 30 or even less.

@Lzhang-hub
Copy link
Author

The 100 is still big for max-duration. Maybe you can reduce the max-duration to 50, 30 or even less.

I have reduce max-duration to 1 ,but the error still exist.

@Lzhang-hub
Copy link
Author

@csukuangfj We have use you advices (1) and (3) ,but the problem is not solved. If you can give some other advices, thank you very much!
企业微信截图_16336894139504

@danpovey
Copy link
Collaborator

danpovey commented Oct 8, 2021 via email

@csukuangfj
Copy link
Collaborator

"search_beam": 20,
"output_beam": 5,
"min_active_states": 30,
"max_active_states": 10000,

You can reduce search_beam, output_beam, or max_active_states.


By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning?

@Lzhang-hub
Copy link
Author

"search_beam": 20,
"output_beam": 5,
"min_active_states": 30,
"max_active_states": 10000,

You can reduce search_beam, output_beam, or max_active_states.

By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning?

Thanks! I will attempt to decode with your advices.

CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool.

@danpovey
Copy link
Collaborator

danpovey commented Oct 9, 2021 via email

@cdxie
Copy link
Contributor

cdxie commented Oct 9, 2021

What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding.

On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.***> wrote:

"search_beam": 20,
"output_beam": 5,
"min_active_states": 30,
"max_active_states": 10000,
You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

@danpovey @csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%.
The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs:
##############
2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started
2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt
2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda
2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt
2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/epoch-19.pt']
2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch)

2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527
2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145
2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18
2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch)

2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753
2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129
2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch)
#########################

we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the https://github.com/kaldi-asr/kaldi/pull/4594 to prune your G.
We now can't pinpoint the cause of the error, so we need help,thanks

@danpovey
Copy link
Collaborator

danpovey commented Oct 9, 2021 via email

@cdxie
Copy link
Contributor

cdxie commented Oct 9, 2021

Hm, can you show the last part of one of the training logs or point to the tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder whether the model is OK.

On Sat, Oct 9, 2021 at 11:21 AM cdxie @.> wrote: What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … <#m_4937422915890188941_> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.> wrote:

"search_beam": 20,
"output_beam": 5,
"min_active_states": 30,
"max_active_states": 10000,
You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 (comment) <#70 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . @danpovey https://github.com/danpovey @csukuangfj https://github.com/csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs: ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) ######################### we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the kaldi-asr/kaldi#4594 http://url to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

@danpovey OK, this is the training log of tdnn_lstm_ctc :
tdnn-lstm-ctc-log-train.txt

@danpovey
Copy link
Collaborator

danpovey commented Oct 9, 2021 via email

@cdxie
Copy link
Contributor

cdxie commented Oct 9, 2021

Your model did not converge; loss should be something like 0.005, not 0.5. I believe when we ran it, we used --bucketing-sampler=True, that could possibly be the reason. Also we used several GPUs, but that should not really affect convergence I think. (Normally this script converges OK).

On Sat, Oct 9, 2021 at 2:17 PM cdxie @.*> wrote: Hm, can you show the last part of one of the training logs or point to the tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder whether the model is OK. … <#m_3115122914114341433_> On Sat, Oct 9, 2021 at 11:21 AM cdxie @.> wrote: What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … <#m_4937422915890188941_> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.> wrote:

"search_beam": 20,
"output_beam": 5,
"min_active_states": 30,
"max_active_states": 10000,
You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 <#70> (comment) <#70 (comment) <#70 (comment)>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . @danpovey https://github.com/danpovey https://github.com/danpovey @csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs: ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) ######################### we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594> http://url to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment) <#70 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . OK, this is the training log of tdnn_lstm_ctc : tdnn-lstm-ctc-log-train.txt https://github.com/k2-fsa/icefall/files/7315071/tdnn-lstm-ctc-log-train.txt — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5Y56FEMGRNZ4QATXDUF7M73ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Thanks, I will modify the parameters and run again.

The GPU device I used is a100 NVIDIA GPU-40G, and single-GPU single-machine. The parameters of the script "./tdnn_lstm_ctc/train.py" are not modified

@danpovey
Copy link
Collaborator

danpovey commented Oct 9, 2021 via email

@cdxie
Copy link
Contributor

cdxie commented Oct 9, 2021

And please show us some sample decoding output, it is written to somewhere (aligned output vs. the ref text). I want to see how the model failed. To get 59% WER is unusual; would normally be either 100% or close to 0, I'd expect.

On Sat, Oct 9, 2021 at 4:36 PM cdxie @.> wrote: Your model did not converge; loss should be something like 0.005, not 0.5. I believe when we ran it, we used --bucketing-sampler=True, that could possibly be the reason. Also we used several GPUs, but that should not really affect convergence I think. (Normally this script converges OK). … <#m_7389038400197059205_> On Sat, Oct 9, 2021 at 2:17 PM cdxie @.> wrote: Hm, can you show the last part of one of the training logs or point to the tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder whether the model is OK. … <#m_3115122914114341433_> On Sat, Oct 9, 2021 at 11:21 AM cdxie @.> wrote: What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … <#m_4937422915890188941_> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.> wrote:

"search_beam": 20,
"output_beam": 5,
"min_active_states": 30,
"max_active_states": 10000,
You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 <#70> <#70 <#70>> (comment) <#70 <#70> (comment) <#70 (comment) <#70 (comment)>>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . @danpovey https://github.com/danpovey https://github.com/danpovey https://github.com/danpovey @csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs: ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) ######################### we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594> < kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594>> http://url to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 <#70> (comment) <#70 (comment) <#70 (comment)>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . OK, this is the training log of tdnn_lstm_ctc : tdnn-lstm-ctc-log-train.txt https://github.com/k2-fsa/icefall/files/7315071/tdnn-lstm-ctc-log-train.txt — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment) <#70 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5Y56FEMGRNZ4QATXDUF7M73ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . Thanks, I will modify the parameters and run again. The GPU device I used is a100 NVIDIA GPU-40G, and single-GPU single-machine. The parameters of the script "./tdnn_lstm_ctc/train.py" are not modified — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7PDH2BU52NMFTXSL3UF75JNANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

OK, I choose the best results file(lm_scale_0.7) of tdnn-lstm-ctc model, the decoding parameters { 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000 } :

errs-test-clean-lm_scale_0.7.txt
errs-test-other-lm_scale_0.7.txt
recogs-test-clean-lm_scale_0.7.txt
recogs-test-other-lm_scale_0.7.txt
wer-summary-test-clean.txt
wer-summary-test-other.txt

@cdxie
Copy link
Contributor

cdxie commented Oct 9, 2021

And please show us some sample decoding output, it is written to somewhere (aligned output vs. the ref text). I want to see how the model failed. To get 59% WER is unusual; would normally be either 100% or close to 0, I'd expect.

On Sat, Oct 9, 2021 at 4:36 PM cdxie @.> wrote: Your model did not converge; loss should be something like 0.005, not 0.5. I believe when we ran it, we used --bucketing-sampler=True, that could possibly be the reason. Also we used several GPUs, but that should not really affect convergence I think. (Normally this script converges OK). … <#m_7389038400197059205_> On Sat, Oct 9, 2021 at 2:17 PM cdxie @.> wrote: Hm, can you show the last part of one of the training logs or point to the tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder whether the model is OK. … <#m_3115122914114341433_> On Sat, Oct 9, 2021 at 11:21 AM cdxie @.> wrote: What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … <#m_4937422915890188941_> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.> wrote:

"search_beam": 20,
"output_beam": 5,
"min_active_states": 30,
"max_active_states": 10000,
You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 <#70> <#70 <#70>> (comment) <#70 <#70> (comment) <#70 (comment) <#70 (comment)>>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . @danpovey https://github.com/danpovey https://github.com/danpovey https://github.com/danpovey @csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs: ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) ######################### we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594> < kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594>> http://url to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 <#70> (comment) <#70 (comment) <#70 (comment)>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . OK, this is the training log of tdnn_lstm_ctc : tdnn-lstm-ctc-log-train.txt https://github.com/k2-fsa/icefall/files/7315071/tdnn-lstm-ctc-log-train.txt — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment) <#70 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5Y56FEMGRNZ4QATXDUF7M73ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . Thanks, I will modify the parameters and run again. The GPU device I used is a100 NVIDIA GPU-40G, and single-GPU single-machine. The parameters of the script "./tdnn_lstm_ctc/train.py" are not modified — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7PDH2BU52NMFTXSL3UF75JNANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

@danpovey , I also run the recipes of Librispeech of Confromer CTC, using a100 NVIDIA GPU-40G, and single-GPU single-machine, and no parameters are modified during the training, my model may be not converge according your opinions. Now what should need to change that can make the loss converged?

the last part of the training logs:
###########
2021-10-02 04:08:09,315 INFO [train.py:506] Epoch 34, batch 53make the loss converge
the last part of the Confromer CTC, training logs:
###########
2021-10-02 04:08:09,315 INFO [train.py:506] Epoch 34, batch 53900, batch avg ctc loss 0.0350, batch avg att loss 0.0249, batch avg loss 0.0279, total avg ctc loss: 0.0481, total avg att loss: 0.03900, batch avg ctc loss 0.0350, batch avg att loss 0.0249, batch avg loss 0.0279, total avg ctc loss: 0.0481, total avg att loss: 0.0344, total avg loss: 0.0385, batch size: 15
2021-10-02 04:08:14,123 INFO [train.py:506] Epoch 34, batch 53910, batch avg ctc loss 0.0332, batch avg att loss 0.0225, batch avg loss 0.0257, total avg ctc loss: 0.0476, total avg att loss: 0.0339, total avg loss: 0.0380, batch size: 17
2021-10-02 04:08:19,109 INFO [train.py:506] Epoch 34, batch 53920, batch avg ctc loss 0.0688, batch avg att loss 0.0451, batch avg loss 0.0522, total avg ctc loss: 0.0472, total avg att loss: 0.0334, total avg loss: 0.0375, batch size: 18
2021-10-02 04:08:24,021 INFO [train.py:506] Epoch 34, batch 53930, batch avg ctc loss 0.0530, batch avg att loss 0.0363, batch avg loss 0.0413, total avg ctc loss: 0.0476, total avg att loss: 0.0335, total avg loss: 0.0377, batch size: 15
2021-10-02 04:08:29,207 INFO [train.py:506] Epoch 34, batch 53940, batch avg ctc loss 0.0420, batch avg att loss 0.0304, batch avg loss 0.0339, total avg ctc loss: 0.0473, total avg att loss: 0.0339, total avg loss: 0.0379, batch size: 16
2021-10-02 04:08:34,475 INFO [train.py:506] Epoch 34, batch 53950, batch avg ctc loss 0.0518, batch avg att loss 0.0357, batch avg loss 0.0405, total avg ctc loss: 0.0469, total avg att loss: 0.0335, total avg loss: 0.0376, batch size: 16
2021-10-02 04:08:39,350 INFO [train.py:506] Epoch 34, batch 53960, batch avg ctc loss 0.0602, batch avg att loss 0.0414, batch avg loss 0.0471, total avg ctc loss: 0.0465, total avg att loss: 0.0333, total avg loss: 0.0373, batch size: 13
2021-10-02 04:08:44,708 INFO [train.py:506] Epoch 34, batch 53970, batch avg ctc loss 0.0495, batch avg att loss 0.0328, batch avg loss 0.0378, total avg ctc loss: 0.0462, total avg att loss: 0.0330, total avg loss: 0.0370, batch size: 16
2021-10-02 04:08:49,894 INFO [train.py:506] Epoch 34, batch 53980, batch avg ctc loss 0.0661, batch avg att loss 0.0431, batch avg loss 0.0500, total avg ctc loss: 0.0465, total avg att loss: 0.0331, total avg loss: 0.0371, batch size: 15
2021-10-02 04:08:54,981 INFO [train.py:506] Epoch 34, batch 53990, batch avg ctc loss 0.0351, batch avg att loss 0.0310, batch avg loss 0.0323, total avg ctc loss: 0.0469, total avg att loss: 0.0336, total avg loss: 0.0376, batch size: 17
2021-10-02 04:09:01,103 INFO [train.py:506] Epoch 34, batch 54000, batch avg ctc loss 0.0616, batch avg att loss 0.0432, batch avg loss 0.0487, total avg ctc loss: 0.0466, total avg att loss: 0.0334, total avg loss: 0.0374, batch size: 16
2021-10-02 04:10:01,514 INFO [train.py:565] Epoch 34, valid ctc loss 0.0642,valid att loss 0.0416,valid loss 0.0483, best valid loss: 0.0445 best valid epoch: 22
2021-10-02 04:10:06,173 INFO [train.py:506] Epoch 34, batch 54010, batch avg ctc loss 0.0651, batch avg att loss 0.0448, batch avg loss 0.0509, total avg ctc loss: 0.0551, total avg att loss: 0.0359, total avg loss: 0.0416, batch size: 15
2021-10-02 04:10:12,536 INFO [train.py:506] Epoch 34, batch 54020, batch avg ctc loss 0.0393, batch avg att loss 0.0274, batch avg loss 0.0310, total avg ctc loss: 0.0516, total avg att loss: 0.0342, total avg loss: 0.0394, batch size: 20
2021-10-02 04:10:17,708 INFO [train.py:506] Epoch 34, batch 54030, batch avg ctc loss 0.0668, batch avg att loss 0.0434, batch avg loss 0.0504, total avg ctc loss: 0.0497, total avg att loss: 0.0325, total avg loss: 0.0377, batch size: 15
2021-10-02 04:10:23,342 INFO [train.py:506] Epoch 34, batch 54040, batch avg ctc loss 0.0456, batch avg att loss 0.0283, batch avg loss 0.0335, total avg ctc loss: 0.0484, total avg att loss: 0.0340, total avg loss: 0.0383, batch size: 17
2021-10-02 04:10:28,181 INFO [train.py:506] Epoch 34, batch 54050, batch avg ctc loss 0.0455, batch avg att loss 0.0313, batch avg loss 0.0356, total avg ctc loss: 0.0488, total avg att loss: 0.0339, total avg loss: 0.0384, batch size: 14
2021-10-02 04:10:33,627 INFO [train.py:506] Epoch 34, batch 54060, batch avg ctc loss 0.0549, batch avg att loss 0.1210, batch avg loss 0.1011, total avg ctc loss: 0.0488, total avg att loss: 0.0351, total avg loss: 0.0392, batch size: 18
2021-10-02 04:10:38,395 INFO [train.py:506] Epoch 34, batch 54070, batch avg ctc loss 0.0647, batch avg att loss 0.0357, batch avg loss 0.0444, total avg ctc loss: 0.0486, total avg att loss: 0.0346, total avg loss: 0.0388, batch size: 16
2021-10-02 04:10:43,016 INFO [train.py:506] Epoch 34, batch 54080, batch avg ctc loss 0.0360, batch avg att loss 0.0266, batch avg loss 0.0294, total avg ctc loss: 0.0477, total avg att loss: 0.0338, total avg loss: 0.0380, batch size: 14
2021-10-02 04:10:47,858 INFO [train.py:506] Epoch 34, batch 54090, batch avg ctc loss 0.0496, batch avg att loss 0.0290, batch avg loss 0.0352, total avg ctc loss: 0.0474, total avg att loss: 0.0332, total avg loss: 0.0375, batch size: 15
2021-10-02 04:10:52,855 INFO [train.py:506] Epoch 34, batch 54100, batch avg ctc loss 0.0421, batch avg att loss 0.0288, batch avg loss 0.0328, total avg ctc loss: 0.0477, total avg att loss: 0.0342, total avg loss: 0.0382, batch size: 14
2021-10-02 04:10:53,829 INFO [checkpoint.py:62] Saving checkpoint to conformer_ctc/exp/epoch-34.pt
2021-10-02 04:11:36,298 INFO [train.py:708] Done!
#############

@cdxie
Copy link
Contributor

cdxie commented Oct 9, 2021

@danpovey @csukuangfj , Ignore the loss convergence problem, the problems of CUDA out of memory in decoding are still not be solved, could you give more help?

@danpovey
Copy link
Collaborator

danpovey commented Oct 9, 2021

The conformer model logs look normal.
Likely the memory usage in decoding is related to the convergence problems of the model.
We will rerun the TDNN+LSTM+CTC script locally to make sure there is no problem. @luomingshuang can you do this?

@luomingshuang
Copy link
Collaborator

OK, I will do it.

The conformer model logs look normal. Likely the memory usage in decoding is related to the convergence problems of the model. We will rerun the TDNN+LSTM+CTC script locally to make sure there is no problem. @luomingshuang can you do this?

@cdxie
Copy link
Contributor

cdxie commented Oct 9, 2021

The conformer model logs look normal. Likely the memory usage in decoding is related to the convergence problems of the model. We will rerun the TDNN+LSTM+CTC script locally to make sure there is no problem. @luomingshuang can you do this?

@danpovey @csukuangfj Sorry to truble again, I just run the decode steps of conformer-ctc, the same mistaked occured again(reduced the search_beam, max_active_states). Is this the same reason as TDNN+LSTM+CTC? or some wrong with our machine(we use docker environment)? :
################
2021-10-09 20:50:49,123 INFO [decode.py:538] Decoding started
2021-10-09 20:50:49,123 INFO [decode.py:539] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 13, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 34, 'avg': 20, 'method': 'attention-decoder', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe'), 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-09 20:50:49,620 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe/Linv.pt
2021-10-09 20:50:49,795 INFO [decode.py:549] device: cuda
2021-10-09 20:50:57,672 INFO [decode.py:604] Loading pre-compiled G_4_gram.pt
2021-10-09 20:51:08,755 INFO [decode.py:640] averaging ['conformer_ctc/exp/epoch-15.pt', 'conformer_ctc/exp/epoch-16.pt', 'conformer_ctc/exp/epoch-17.pt', 'conformer_ctc/exp/epoch-18.pt', 'conformer_ctc/exp/epoch-19.pt', 'conformer_ctc/exp/epoch-20.pt', 'conformer_ctc/exp/epoch-21.pt', 'conformer_ctc/exp/epoch-22.pt', 'conformer_ctc/exp/epoch-23.pt', 'conformer_ctc/exp/epoch-24.pt', 'conformer_ctc/exp/epoch-25.pt', 'conformer_ctc/exp/epoch-26.pt', 'conformer_ctc/exp/epoch-27.pt', 'conformer_ctc/exp/epoch-28.pt', 'conformer_ctc/exp/epoch-29.pt', 'conformer_ctc/exp/epoch-30.pt', 'conformer_ctc/exp/epoch-31.pt', 'conformer_ctc/exp/epoch-32.pt', 'conformer_ctc/exp/epoch-33.pt', 'conformer_ctc/exp/epoch-34.pt']
2021-10-09 20:51:27,902 INFO [decode.py:653] Number of model parameters: 116147120
2021-10-09 20:51:30,958 INFO [decode.py:474] batch 0/?, cuts processed until now is 2
2021-10-09 20:51:44,208 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 31.75 GiB total capacity; 26.26 GiB already allocated; 1.46 GiB free; 29.13 GiB reserved in total by PyTorch)

2021-10-09 20:51:44,274 INFO [decode.py:732] num_arcs before pruning: 103742
2021-10-09 20:51:44,288 INFO [decode.py:739] num_arcs after pruning: 45225
2021-10-09 20:51:46,104 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.47 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,104 INFO [decode.py:732] num_arcs before pruning: 233253
2021-10-09 20:51:46,116 INFO [decode.py:739] num_arcs after pruning: 90555
2021-10-09 20:51:46,235 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,235 INFO [decode.py:732] num_arcs before pruning: 90555
2021-10-09 20:51:46,247 INFO [decode.py:739] num_arcs after pruning: 90414
2021-10-09 20:51:46,360 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,360 INFO [decode.py:732] num_arcs before pruning: 90414
2021-10-09 20:51:46,370 INFO [decode.py:739] num_arcs after pruning: 90366
2021-10-09 20:51:46,482 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,483 INFO [decode.py:732] num_arcs before pruning: 90366
2021-10-09 20:51:46,492 INFO [decode.py:739] num_arcs after pruning: 90366
2021-10-09 20:51:46,605 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,605 INFO [decode.py:732] num_arcs before pruning: 90366
2021-10-09 20:51:46,615 INFO [decode.py:739] num_arcs after pruning: 90366
2021-10-09 20:51:46,728 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,728 INFO [decode.py:732] num_arcs before pruning: 90366
2021-10-09 20:51:46,739 INFO [decode.py:739] num_arcs after pruning: 90366
2021-10-09 20:51:46,853 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,853 INFO [decode.py:732] num_arcs before pruning: 90366
2021-10-09 20:51:46,864 INFO [decode.py:739] num_arcs after pruning: 90366
2021-10-09 20:51:46,978 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,978 INFO [decode.py:732] num_arcs before pruning: 90366
2021-10-09 20:51:46,989 INFO [decode.py:739] num_arcs after pruning: 90366
2021-10-09 20:51:47,101 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:47,101 INFO [decode.py:732] num_arcs before pruning: 90366
2021-10-09 20:51:47,112 INFO [decode.py:739] num_arcs after pruning: 90366
2021-10-09 20:51:47,226 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:47,226 INFO [decode.py:732] num_arcs before pruning: 90366
2021-10-09 20:51:47,237 INFO [decode.py:739] num_arcs after pruning: 90366
2021-10-09 20:51:47,351 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:47,351 INFO [decode.py:732] num_arcs before pruning: 90366
2021-10-09 20:51:47,361 INFO [decode.py:739] num_arcs after pruning: 90366
2021-10-09 20:51:47,361 INFO [decode.py:743] Return None as the resulting lattice is too large
Traceback (most recent call last):
File "./conformer_ctc/decode.py", line 688, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "./conformer_ctc/decode.py", line 664, in main
results_dict = decode_dataset(
File "./conformer_ctc/decode.py", line 447, in decode_dataset
hyps_dict = decode_one_batch(
File "./conformer_ctc/decode.py", line 365, in decode_one_batch
best_path_dict = rescore_with_attention_decoder(
File "/workspace/icefall/icefall/decode.py", line 812, in rescore_with_attention_decoder
nbest = Nbest.from_lattice(
File "/workspace/icefall/icefall/decode.py", line 209, in from_lattice
saved_scores = lattice.scores.clone()
AttributeError: 'NoneType' object has no attribute 'scores'
##################

@luomingshuang
Copy link
Collaborator

I suggest that you can use ctc decoding to verify your model according to #71. If the results based on ctc decoding are normal, maybe the problem happens to your language model.

@danpovey
Copy link
Collaborator

danpovey commented Oct 9, 2021 via email

@Lzhang-hub
Copy link
Author

Lzhang-hub commented Oct 11, 2021

I suggest that you can use ctc decoding to verify your model according to #71. If the results based on ctc decoding are normal, maybe the problem happens to your language model.

According your advice, we use ctc decoding,but the decoding progress is stuck. The logs are as follows and has been in this state for more than 30 hours.
Beside, we tested it on both Tesla V100 and A100-SXM4-40GB, We are not sure whether it is related to the machine configuration, could you please provide your machine configuration?

#######
2021-10-09 23:30:54,225 INFO [decode.py:538] Decoding started
2021-10-09 23:30:54,225 INFO [decode.py:539] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 25, 'avg': 1, 'method': 'ctc-decoding', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe_5000'), 'full_libri': False, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-09 23:30:54,539 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe_5000/Linv.pt
2021-10-09 23:30:54,830 INFO [decode.py:549] device: cuda
2021-10-09 23:31:30,576 INFO [checkpoint.py:92] Loading checkpoint from conformer_ctc/exp/epoch-25.pt
2021-10-09 23:31:45,498 INFO [decode.py:653] Number of model parameters: 116147120
########

@csukuangfj
Copy link
Collaborator

I am using NVIDIA Tesla V100 GPU with 32 GB RAM, Python 3.8 with torch 1.7.1

but the decoding progress is stuck.

We have never encountered this issue before.


Could you test the decoding script with a pre-trained model, provided by us (see https://icefall.readthedocs.io/en/latest/recipes/librispeech/conformer_ctc.html#pre-trained-model)?

$ cd egs/librispeech/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc
$ cd ..
$ ln -s $PWD/tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt conformer_ctc/exp/epoch-99.pt

And you can pass --epoch 99 --avg 1 when running ./conformer_ctc/decode.py.

If it still gets stuck, there is a higher chance that there are some problems with your configuration.

@danpovey
Copy link
Collaborator

danpovey commented Oct 11, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Oct 11, 2021 via email

@Lzhang-hub
Copy link
Author

I am using NVIDIA Tesla V100 GPU with 32 GB RAM, Python 3.8 with torch 1.7.1

but the decoding progress is stuck.

We have never encountered this issue before.

Could you test the decoding script with a pre-trained model, provided by us (see https://icefall.readthedocs.io/en/latest/recipes/librispeech/conformer_ctc.html#pre-trained-model)?

$ cd egs/librispeech/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc
$ cd ..
$ ln -s $PWD/tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt conformer_ctc/exp/epoch-99.pt

And you can pass --epoch 99 --avg 1 when running ./conformer_ctc/decode.py.

If it still gets stuck, there is a higher chance that there are some problems with your configuration.

I test the decoding script with a pre-trained model,get the follow error:

#########
2021-10-11 16:05:51,556 INFO [decode.py:538] Decoding started
2021-10-11 16:05:51,557 INFO [decode.py:539] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 99, 'avg': 1, 'method': 'ctc-decoding', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe_5000'), 'full_libri': False, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-11 16:05:52,080 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe_5000/Linv.pt
2021-10-11 16:05:52,495 INFO [decode.py:549] device: cuda:0
2021-10-11 16:05:56,562 INFO [checkpoint.py:92] Loading checkpoint from conformer_ctc/exp/epoch-99.pt
Traceback (most recent call last):
File "./conformer_ctc/decode.py", line 688, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "./conformer_ctc/decode.py", line 633, in main
load_checkpoint(f"{params.exp_dir}/epoch-{params.epoch}.pt", model)
File "/workspace/icefall/icefall/checkpoint.py", line 93, in load_checkpoint
checkpoint = torch.load(filename, map_location="cpu")
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 595, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 764, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.
########

@csukuangfj
Copy link
Collaborator

Please make sure that you have run git lfs install.

Also, you can check the file size of pretrained.pt, which should be 443 MB.

@cdxie
Copy link
Contributor

cdxie commented Oct 11, 2021

Please make sure that you have run git lfs install.

Also, you can check the file size of pretrained.pt, which should be 443 MB.

@danpovey @csukuangfj

We use the model : tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt, and run the :python -m pdb conformer_ctc/decode.py --epoch 99 --avg 1 --method ctc-decoding --max-duration 50 .
Now, this script is still stuck, and I follow the code, the stuck happened in "lattice = k2.intersect_dense_pruned(",

this is the debug steps:
####################
python -m pdb conformer_ctc/decode.py --epoch 99 --avg 1 --method ctc-decoding --max-duration 50

icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(19)()
-> import argparse
(Pdb) b 443
Breakpoint 1 at icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py:443
(Pdb) c
2021-10-11 16:44:25,291 INFO [decode.py:538] Decoding started
2021-10-11 16:44:25,291 INFO [decode.py:539] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 99, 'avg': 1, 'method': 'ctc-decoding', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe'), 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 50, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-11 16:44:25,971 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe/Linv.pt
2021-10-11 16:44:26,111 INFO [decode.py:549] device: cuda
2021-10-11 16:44:31,376 INFO [checkpoint.py:92] Loading checkpoint from conformer_ctc/exp/epoch-99.pt
2021-10-11 16:44:32,270 INFO [decode.py:653] Number of model parameters: 116147120
icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(443)decode_dataset()
-> results = defaultdict(list)
(Pdb) n
icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(444)decode_dataset()
-> for batch_idx, batch in enumerate(dl):
(Pdb)
icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(445)decode_dataset()
-> texts = batch["supervisions"]["text"]
(Pdb)
icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(447)decode_dataset()
-> hyps_dict = decode_one_batch(
(Pdb) p texts
["THE PRESENT CHAPTERS CAN ONLY TOUCH UPON THE MORE SALIENT MOVEMENTS OF THE CIVIL WAR IN KANSAS WHICH HAPPILY WERE NOT SANGUINARY IF HOWEVER THE INDIVIDUAL AND MORE ISOLATED CASES OF BLOODSHED COULD BE DESCRIBED THEY WOULD SHOW A STARTLING AGGREGATE OF BARBARITY AND LOSS OF LIFE FOR OPINION'S SAKE", 'THEN HE RUSHED DOWN STAIRS INTO THE COURTYARD SHOUTING LOUDLY FOR HIS SOLDIERS AND THREATENING TO PATCH EVERYBODY IN HIS DOMINIONS IF THE SAILORMAN WAS NOT RECAPTURED', 'SIR HARRY TOWNE MISTER BARTLEY ALEXANDER THE AMERICAN ENGINEER', 'BUT AT THIS POINT IN THE RAPIDS IT WAS IMPOSSIBLE FOR HIM TO STAY DOWN', 'HAKON THERE SHALL BE YOUR CONSTANT COMPANION FRIEND FARMER']
.
.
.
(Pdb)
icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(447)decode_dataset()
-> hyps_dict = decode_one_batch(
(Pdb)
.
.
.
-> lattice = get_lattice(
(Pdb) s
icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(266)decode_one_batch()
-> nnet_output=nnet_output,
(Pdb)
icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(267)decode_one_batch()
-> decoding_graph=decoding_graph,
(Pdb)
icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(268)decode_one_batch()
-> supervision_segments=supervision_segments,
(Pdb)
--Call--
/workspace/icefall/icefall/decode.py(67)get_lattice()
-> def get_lattice(
(Pdb) s
/workspace/icefall/icefall/decode.py(114)get_lattice()
-> dense_fsa_vec = k2.DenseFsaVec(
(Pdb) n
/workspace/icefall/icefall/decode.py(115)get_lattice()
-> nnet_output,
(Pdb)
/workspace/icefall/icefall/decode.py(116)get_lattice()
-> supervision_segments,
(Pdb)
/workspace/icefall/icefall/decode.py(117)get_lattice()
-> allow_truncate=subsampling_factor - 1,
(Pdb)
/workspace/icefall/icefall/decode.py(114)get_lattice()
-> dense_fsa_vec = k2.DenseFsaVec(
(Pdb)
/workspace/icefall/icefall/decode.py(120)get_lattice()
-> lattice = k2.intersect_dense_pruned(
(Pdb) p dense_fsa_vec
<k2.dense_fsa_vec.DenseFsaVec object at 0x7fefd19c9a90>
(Pdb) n
/workspace/icefall/icefall/decode.py(121)get_lattice()
-> decoding_graph,
(Pdb)
/workspace/icefall/icefall/decode.py(122)get_lattice()
-> dense_fsa_vec,
(Pdb)
/workspace/icefall/icefall/decode.py(123)get_lattice()
-> search_beam=search_beam,
(Pdb)
/workspace/icefall/icefall/decode.py(124)get_lattice()
-> output_beam=output_beam,
(Pdb)
/workspace/icefall/icefall/decode.py(125)get_lattice()
-> min_active_states=min_active_states,
(Pdb)
/workspace/icefall/icefall/decode.py(126)get_lattice()
-> max_active_states=max_active_states,
(Pdb)
/workspace/icefall/icefall/decode.py(120)get_lattice()
-> lattice = k2.intersect_dense_pruned(
(Pdb)

################

and ctrl-c when it gets stuck:
#######
(Pdb)

/workspace/icefall/icefall/decode.py(120)get_lattice()
-> lattice = k2.intersect_dense_pruned(
(Pdb)

^C
Program interrupted. (Use 'cont' to resume).
--Call--

/opt/conda/lib/python3.8/bdb.py(321)set_trace()
-> def set_trace(self, frame=None):
(Pdb)
Process Process-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 171, in _worker_loop
r = index_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/opt/conda/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/opt/conda/lib/python3.8/pdb.py", line 194, in sigint_handler
self.set_trace(frame)
File "/opt/conda/lib/python3.8/bdb.py", line 321, in set_trace
def set_trace(self, frame=None):
File "/opt/conda/lib/python3.8/bdb.py", line 90, in trace_dispatch
return self.dispatch_call(frame, arg)
File "/opt/conda/lib/python3.8/bdb.py", line 135, in dispatch_call
if self.quitting: raise BdbQuit
bdb.BdbQuit

###########

python3 -m k2.version

Collecting environment information...
k2 version: 1.9
Build type: Release
Git SHA1: 8694fee66f564cf750792cb30c639d3cc404c18b
Git date: Thu Sep 30 15:35:28 2021
Cuda used to build k2: 11.0
cuDNN used to build k2: 8.0.4
Python version used to build k2: 3.8
OS used to build k2:
CMake version: 3.18.0
GCC version: 7.5.0
CMAKE_CUDA_FLAGS: --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow
CMAKE_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow
PyTorch version used to build k2: 1.7.1
PyTorch is using Cuda: 11.0
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False

python --version
Python 3.8.5
torch.version
'1.7.1'

nvidia-smi

Mon Oct 11 17:38:22 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02 Driver Version: 440.118.02 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
| N/A 32C P0 40W / 250W | 15620MiB / 16160MiB | 19% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 |
| N/A 30C P0 35W / 250W | 1503MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

@cdxie
Copy link
Contributor

cdxie commented Oct 11, 2021

@Lzhang-hub you can give the "nvidia-smi" of Tesla V100 and A100-SXM4-40GB

@Lzhang-hub
Copy link
Author

Lzhang-hub commented Oct 11, 2021

@Lzhang-hub you can give the "nvidia-smi" of Tesla V100 and A100-SXM4-40GB

Tesla V100
image
A100-SXM4-40GB
image

@cdxie
Copy link
Contributor

cdxie commented Oct 21, 2021

... what I suspect could be happening is that, near the beginning of the file, the dynamic beams are not decreasing rapidly enough to enforce the max_active constraint (they are supposed to change dynamically to do this). I may have to find a better way to update them. If this is the case, changing the max_active would make no difference, but decreasing the search_beam, say from 20 to 15, might.

OK, we now try your advices

@Lzhang-hub
Copy link
Author

It would be nice if you could run that from gdb, and when it crashes (perhaps "catch throw" will catch the error), you could find out, in PruneTimeRange, what is begin_t and end_t. Also you can try a proposed fix: in intersect_dense_pruned.cu, at line 146, you could try replacing:

   dynamic_beams_(a_fsas.Context(), b_fsas.shape.Dim0(), search_beam)

with:

   dynamic_beams_(a_fsas.Context(), b_fsas.shape.Dim0(), std::max<float>(search_beam * 0.5, output_beam))

and see if this helps. Obviously you'll have to recompile.

This is the result run with GDB , in PruneTimeRange, begin_t=30 ,end_t=60. But we don't know what that tells us.
#70 (comment)

@danpovey
Copy link
Collaborator

Another thing you could that would help debug this is, in intersect_dense_pruned.cu, around line 599 (it may have changed),
just before " return cutoffs; ", to add the statement:
K2_LOG(INFO) << "Row-splits=" << arc_end_scores.RowSplits(1).Data() << ", cutoffs=" << cutoffs;
That may create quite a bit of output, but it will be useful.

@csukuangfj
Copy link
Collaborator

arc_end_scores.RowSplits(1).Data() is a data pointer, should it be arc_end_scores.RowSplits(1)?

@danpovey
Copy link
Collaborator

danpovey commented Oct 21, 2021 via email

@cdxie
Copy link
Contributor

cdxie commented Oct 22, 2021

Another thing you could that would help debug this is, in intersect_dense_pruned.cu, around line 599 (it may have changed), just before " return cutoffs; ", to add the statement: K2_LOG(INFO) << "Row-splits=" << arc_end_scores.RowSplits(1).Data() << ", cutoffs=" << cutoffs; That may create quite a bit of output, but it will be useful.

@danpovey @csukuangfj
We did two things that you suggested:

  1. Changing: dynamic_beams_(a_fsas.Context(), b_fsas.shape.Dim0(), std::max(search_beam * 0.5, output_beam))
  2. Adding K2_LOG(INFO) << "Row-splits=" << arc_end_scores.RowSplits(1).Data() << ", cutoffs=" << cutoffs;

this is the gdb logs:
no_gdb.txt

CUDA out of memory still occurs.

@csukuangfj
Copy link
Collaborator

Are you using data/lang_bpe and data/lm from https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc

Just want to check that you are using the correct files.

@pkufool
Copy link
Collaborator

pkufool commented Oct 22, 2021

@cdxie Would you mind emailing me you QQ or Wechat, It is more efficient to discuss this problem. We can report our solution here when it fixed. You can find my email in my github profile, thanks.

@danpovey
Copy link
Collaborator

OK, your output is not what I expected. The numbers of arcs being printed out are quite small. The large thing being allocated implies that there are about 100 million arcs, but the numbers you are printing indicate that there should be no more than 100,000 arcs active.

Perhaps printing out old_states_offsets and old_arcs_offsets (after their data is written to) in PruneTimeRange, would clarify things.

@pzelasko
Copy link
Collaborator

pzelasko commented Oct 22, 2021

Just wanted to chime in — I’m also seeing issues with CUDA memory usage in decoding. I had to set max_duration=5 to make ctc-decoding work. I’m also using a V100 GPU with 32GB RAM.

@danpovey
Copy link
Collaborator

danpovey commented Oct 22, 2021 via email

@pkufool
Copy link
Collaborator

pkufool commented Oct 22, 2021

I will look into the intersect_dense_pruned code and try to figure it out.

@cdxie
Copy link
Contributor

cdxie commented Oct 22, 2021

@cdxie Would you mind emailing me you QQ or Wechat, It is more efficient to discuss this problem. We can report our solution here when it fixed. You can find my email in my github profile, thanks.

@pkufool I have sent my Wechat number to you

@csukuangfj
Copy link
Collaborator

Turns out I can reproduce the issue using the pre-trained model downloaded from https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc


When I created the pull request #58 supporting CTC decoding, the model that I used for testing was trained by myself, not the one downloaded from hugging face.

I just re-tested CTC decoding using the pre-trained model from https://github.com/csukuangfj/icefall-asr-conformer-ctc-bpe-500 and everything works fine.

The commands for reproducing are

cd egs/librispeech/ASR
git clone https://github.com/csukuangfj/icefall-asr-conformer-ctc-bpe-500
cd icefall-asr-conformer-ctc-bpe-500/
git lfs pull
cd ..
ln -s $PWD/icefall-asr-conformer-ctc-bpe-500/exp/pretrained.pt conformer_ctc/exp/epoch-100.pt
./conformer_ctc/decode.py --epoch 100 --avg 1 --max-duration 300 --method ctc-decoding --lang-dir icefall-asr-conformer-ctc-bpe-500/data/lang_bpe_500

The decoding logs are

2021-10-22 20:36:13,099 INFO [decode.py:540] Decoding started
2021-10-22 20:36:13,099 INFO [decode.py:541] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_fe
at_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 20, 'output_beam':
8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.9', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '2537a3fa927671faec5e4ca56b8b151806356324', 'k2-git-date': 'Fri Oct 15 07:40:41 2021', 'lhotse-version': '0.11.0.dev+missing.version.file', 'torch-cuda-available': True, 'torch-cuda-version': '10.1', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '712ead8-clean', 'icefall-git-date': 'Fri Oct 22 19:52:25 2021', 'icefall-path': '/ceph-fj/open-source/icefall-fix-ctc', 'k2-path': '/ceph-fj/open-source/k2-ali-ctc-new/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-fj/open-source/lhotse-ali-ctc-new/lhotse/__init__.py'}, 'epoch': 100, 'avg': 1, 'method': 'ctc-decoding', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('icefall-asr-conformer-ctc-bpe-500/data/lang_bpe_500'), 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-22 20:36:13,450 INFO [lexicon.py:176] Loading pre-compiled icefall-asr-conformer-ctc-bpe-500/data/lang_bpe_500/Linv.pt
2021-10-22 20:36:13,499 INFO [decode.py:551] device: cuda:0
2021-10-22 20:36:18,229 INFO [checkpoint.py:92] Loading checkpoint from conformer_ctc/exp/epoch-100.pt
2021-10-22 20:36:18,923 INFO [decode.py:655] Number of model parameters: 109226120
2021-10-22 20:36:19,972 INFO [decode.py:476] batch 0/?, cuts processed until now is 62
2021-10-22 20:36:53,690 INFO [decode.py:497] The transcripts are stored in conformer_ctc/exp/recogs-test-clean-ctc-decoding.txt
2021-10-22 20:36:53,767 INFO [utils.py:469] [test-clean-ctc-decoding] %WER 3.01% [1580 / 52576, 137 ins, 123 del, 1320 sub ]
2021-10-22 20:36:54,002 INFO [decode.py:509] Wrote detailed error stats to conformer_ctc/exp/errs-test-clean-ctc-decoding.txt
2021-10-22 20:36:54,003 INFO [decode.py:525]
For test-clean, WER of different settings are:
ctc-decoding    3.01    best for test-clean

2021-10-22 20:36:54,801 INFO [decode.py:476] batch 0/?, cuts processed until now is 70
2021-10-22 20:37:28,191 INFO [decode.py:497] The transcripts are stored in conformer_ctc/exp/recogs-test-other-ctc-decoding.txt
2021-10-22 20:37:28,268 INFO [utils.py:469] [test-other-ctc-decoding] %WER 7.70% [4032 / 52343, 381 ins, 328 del, 3323 sub ]
2021-10-22 20:37:28,493 INFO [decode.py:509] Wrote detailed error stats to conformer_ctc/exp/errs-test-other-ctc-decoding.txt
2021-10-22 20:37:28,494 INFO [decode.py:525]
For test-other, WER of different settings are:
ctc-decoding    7.7     best for test-other

2021-10-22 20:37:28,495 INFO [decode.py:683] Done!

The only difference between the model trained by me and the pre-trained model from hugging face is that the vocab size is changed from 5000 to 500.

I suspect the OOM is caused by the large size of the CTC topo.

If I switch to the modified CTC topo by changing

H = k2.ctc_topo(
max_token=max_token_id,
modified=False,
device=device,
)

to

 H = k2.ctc_topo( 
     max_token=max_token_id, 
     modified=True, 
     device=device, 
 ) 

then ctc-decoding works with the model downloaded from hugging face.

The decoding logs are

$ ./conformer_ctc/decode.py --epoch 99 --avg 1 --max-duration 300 --method ctc-decoding --lang-dir ./tmp/icefall_asr_librispeech_conformer_ctc/data/lang_bpe/
2021-10-22 20:41:48,616 INFO [decode.py:540] Decoding started
2021-10-22 20:41:48,617 INFO [decode.py:541] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 20, 'output_beam':8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.9', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '2537a3fa927671faec5e4ca56b8b151806356324', 'k2-git-date': 'Fri Oct 15 07:40:41 2021', 'lhotse-version': '0.11.0.dev+missing.version.file', 'torch-cuda-available': True, 'torch-cuda-version': '10.1', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '712ead8-dirty', 'icefall-git-date': 'Fri Oct 22 19:52:25 2021', 'icefall-path': '/ceph-fj/open-source/icefall-fix-ctc', 'k2-path': '/ceph-fj/open-source/k2-ali-ctc-new/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-fj/open-source/lhotse-ali-ctc-new/lhotse/__init__.py'}, 'epoch': 99, 'avg': 1, 'method': 'ctc-decoding', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('tmp/icefall_asr_librispeech_conformer_ctc/data/lang_bpe'), 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler':True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
return_cuts': True, 'num_workers': 2}
2021-10-22 20:41:48,984 INFO [lexicon.py:176] Loading pre-compiled tmp/icefall_asr_librispeech_conformer_ctc/data/lang_bpe/Linv.pt
2021-10-22 20:41:49,031 INFO [decode.py:551] device: cuda:0
2021-10-22 20:41:54,053 INFO [checkpoint.py:92] Loading checkpoint from conformer_ctc/exp/epoch-99.pt
2021-10-22 20:41:54,880 INFO [decode.py:655] Number of model parameters: 116147120
2021-10-22 20:41:55,788 INFO [decode.py:476] batch 0/?, cuts processed until now is 62
2021-10-22 20:42:17,664 INFO [decode.py:497] The transcripts are stored in conformer_ctc/exp/recogs-test-clean-ctc-decoding.txt
2021-10-22 20:42:17,738 INFO [utils.py:469] [test-clean-ctc-decoding] %WER 3.60% [1891 / 52576, 203 ins, 128 del, 1560 sub ]
2021-10-22 20:42:17,958 INFO [decode.py:509] Wrote detailed error stats to conformer_ctc/exp/errs-test-clean-ctc-decoding.txt
2021-10-22 20:42:17,958 INFO [decode.py:525]
For test-clean, WER of different settings are:
ctc-decoding    3.6     best for test-clean

2021-10-22 20:42:18,562 INFO [decode.py:476] batch 0/?, cuts processed until now is 70
2021-10-22 20:42:39,817 INFO [decode.py:497] The transcripts are stored in conformer_ctc/exp/recogs-test-other-ctc-decoding.txt
2021-10-22 20:42:39,893 INFO [utils.py:469] [test-other-ctc-decoding] %WER 7.95% [4162 / 52343, 418 ins, 344 del, 3400 sub ]
2021-10-22 20:42:40,128 INFO [decode.py:509] Wrote detailed error stats to conformer_ctc/exp/errs-test-other-ctc-decoding.txt
2021-10-22 20:42:40,129 INFO [decode.py:525]
For test-other, WER of different settings are:
ctc-decoding    7.95    best for test-other

2021-10-22 20:42:40,129 INFO [decode.py:683] Done!

Note: My model, i.e., the one from https://github.com/csukuangfj/icefall-asr-conformer-ctc-bpe-500, does not work well with modified CTC topo. The WERs for ctc-decoding using modified CTC topo degrade rapidly, see below:

2021-10-22 20:44:51,945 INFO [lexicon.py:176] Loading pre-compiled icefall-asr-conformer-ctc-bpe-500/data/lang_bpe_500/Linv.pt
2021-10-22 20:44:51,994 INFO [decode.py:551] device: cuda:0
2021-10-22 20:44:56,796 INFO [checkpoint.py:92] Loading checkpoint from conformer_ctc/exp/epoch-100.pt
2021-10-22 20:44:57,485 INFO [decode.py:655] Number of model parameters: 109226120
2021-10-22 20:44:58,294 INFO [decode.py:476] batch 0/?, cuts processed until now is 62
2021-10-22 20:45:17,570 INFO [decode.py:497] The transcripts are stored in conformer_ctc/exp/recogs-test-clean-ctc-decoding.txt
2021-10-22 20:45:17,653 INFO [utils.py:469] [test-clean-ctc-decoding] %WER 15.39% [8092 / 52576, 665 ins, 120 del, 7307 sub ]
2021-10-22 20:45:17,965 INFO [decode.py:509] Wrote detailed error stats to conformer_ctc/exp/errs-test-clean-ctc-decoding.txt
2021-10-22 20:45:17,966 INFO [decode.py:525]
For test-clean, WER of different settings are:
ctc-decoding    15.39   best for test-clean

2021-10-22 20:45:18,477 INFO [decode.py:476] batch 0/?, cuts processed until now is 70
2021-10-22 20:45:36,718 INFO [decode.py:497] The transcripts are stored in conformer_ctc/exp/recogs-test-other-ctc-decoding.txt
2021-10-22 20:45:36,802 INFO [utils.py:469] [test-other-ctc-decoding] %WER 17.83% [9331 / 52343, 969 ins, 308 del, 8054 sub ]
2021-10-22 20:45:37,058 INFO [decode.py:509] Wrote detailed error stats to conformer_ctc/exp/errs-test-other-ctc-decoding.txt
2021-10-22 20:45:37,059 INFO [decode.py:525]
For test-other, WER of different settings are:
ctc-decoding    17.83   best for test-other

2021-10-22 20:45:37,059 INFO [decode.py:683] Done!

@csukuangfj
Copy link
Collaborator

csukuangfj commented Oct 22, 2021

As suggested by Piotr in #70 (comment),
reduce --max-duration to 5 works with the model downloaded from hugging face with standard CTC topo:

./conformer_ctc/decode.py --epoch 99 --avg 1 --max-duration 5 --method ctc-decoding --lang-dir ./tmp/icefall_asr_librispeech_conformer_ctc/data/lang_bpe/

2021-10-22 20:58:15,988 INFO [lexicon.py:176] Loading pre-compiled tmp/icefall_asr_librispeech_conformer_ctc/data/lang_bpe/Linv.pt
2021-10-22 20:58:16,033 INFO [decode.py:551] device: cuda:0
2021-10-22 20:58:21,038 INFO [checkpoint.py:92] Loading checkpoint from conformer_ctc/exp/epoch-99.pt
2021-10-22 20:58:21,716 INFO [decode.py:655] Number of model parameters: 116147120
/ceph-fj/fangjun/open-source/lhotse-ali-ctc-new/lhotse/dataset/sampling/single_cut.py:237: UserWarning: The first cut drawn in batch
collection violates the max_frames, max_cuts, or max_duration constraints - we'll return it anyway. Consider increasing max_frames/ma
x_cuts/max_duration.
  warnings.warn(
2021-10-22 20:58:22,635 INFO [decode.py:476] batch 0/?, cuts processed until now is 1
2021-10-22 20:59:51,225 INFO [decode.py:476] batch 100/?, cuts processed until now is 101
2021-10-22 21:01:10,508 INFO [decode.py:476] batch 200/?, cuts processed until now is 201
[2021-10-22 21:02:24,409 INFO [decode.py:476] batch 300/?, cuts processed until now is 301
2021-10-22 21:03:37,517 INFO [decode.py:476] batch 400/?, cuts processed until now is 401
2021-10-22 21:04:52,246 INFO [decode.py:476] batch 500/?, cuts processed until now is 501
2021-10-22 21:06:00,438 INFO [decode.py:476] batch 600/?, cuts processed until now is 601
2021-10-22 21:07:05,814 INFO [decode.py:476] batch 700/?, cuts processed until now is 701
2021-10-22 21:08:13,473 INFO [decode.py:476] batch 800/?, cuts processed until now is 801
2021-10-22 21:09:15,777 INFO [decode.py:476] batch 900/?, cuts processed until now is 901
...

@cdxie
Copy link
Contributor

cdxie commented Oct 22, 2021

Are you using data/lang_bpe and data/lm from https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc

Just want to check that you are using the correct files.

Using the data/lang_bpe and data/lm from https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc,still occurs.
We can try :
1、vocab size is changed from 5000 to 500
2、modified CTC topo

@cdxie
Copy link
Contributor

cdxie commented Oct 22, 2021

@csukuangfj @pkufool

we use the same resources 、same parameters、same GPU device as you, but we can't decoding completely , specially --method attention-decoder | ctc-decoding and --max-duration 300.

I think this is most important to be clarified, and we can participate in some verification.

@danpovey
Copy link
Collaborator

OK. I suspect there may be some logic in the code that is not correct in that it might think it is measuring arcs but really be measuring states, or something like tha.

@csukuangfj
Copy link
Collaborator

Are you using data/lang_bpe and data/lm from https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc

Just want to check that you are using the correct files.

Using the data/lang_bpe and data/lm from https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc,still occurs.

We can try :

1、vocab size is changed from 5000 to 500

2、modified CTC topo

Have you tried that?

@danpovey
Copy link
Collaborator

Guys, I think I see the issue: the pruning beams are on min/max active states, but it's the arcs, not states, that are getting out of control. There are a few possible things we could do:

  • Change the interface of IntersectDensePruned to allow a max-arcs (possibly in addition to max-states?)
  • Reduce the beam and/or max-active for decoding when we are using "correct" CTC topo with large BPE vocab size
  • Train models with modified CTC topo so this issue doesn't arise (does this affect WER?)
    .. but do we really even need to decode with the CTC topo? I think if we use a full decoding graph we wouldn't get this problem. Perhaps someone can comment on these options?

@cdxie
Copy link
Contributor

cdxie commented Oct 23, 2021

Are you using data/lang_bpe and data/lm from https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc

Just want to check that you are using the correct files.

Using the data/lang_bpe and data/lm from https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc,still occurs.
We can try :
1、vocab size is changed from 5000 to 500
2、modified CTC topo

Have you tried that?

@csukuangfj we try the "modified CTC topo", ctc-decoding and "max-duration 300" are OK, we use the https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc resource.

whole-lattice-rescoring and attention-decoder are still decoding error, maybe cause by Loading G_4_gram.fst.txt
################################################
Changing the CTC topo:
H = k2.ctc_topo(
max_token=max_token_id,
modified=True,
device=device,
)
./conformer_ctc/decode.py --epoch 99 --avg 1 --max-duration 300 --method ctc-decoding --lang-dir data/lang_bpe_5000

2021-10-23 15:02:15,830 INFO [decode.py:538] Decoding started
2021-10-23 15:02:15,831 INFO [decode.py:539] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 99, 'avg': 1, 'method': 'ctc-decoding', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe_5000'), 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-23 15:02:16,708 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe_5000/Linv.pt
2021-10-23 15:02:17,545 INFO [decode.py:549] device: cuda:0
2021-10-23 15:02:23,008 INFO [checkpoint.py:92] Loading checkpoint from conformer_ctc/exp/epoch-99.pt
2021-10-23 15:02:40,474 INFO [decode.py:653] Number of model parameters: 116147120
2021-10-23 15:03:01,674 INFO [decode.py:474] batch 0/?, cuts processed until now is 62
2021-10-23 15:04:16,588 INFO [decode.py:495] The transcripts are stored in conformer_ctc/exp/recogs-test-clean-ctc-decoding.txt
2021-10-23 15:04:16,839 INFO [utils.py:334] [test-clean-ctc-decoding] %WER 3.58% [1883 / 52576, 196 ins, 128 del, 1559 sub ]
2021-10-23 15:04:17,197 INFO [decode.py:507] Wrote detailed error stats to conformer_ctc/exp/errs-test-clean-ctc-decoding.txt
2021-10-23 15:04:17,479 INFO [decode.py:523]
For test-clean, WER of different settings are:
ctc-decoding 3.58 best for test-clean

2021-10-23 15:04:37,353 INFO [decode.py:474] batch 0/?, cuts processed until now is 70
2021-10-23 15:06:07,180 INFO [decode.py:495] The transcripts are stored in conformer_ctc/exp/recogs-test-other-ctc-decoding.txt
2021-10-23 15:06:08,295 INFO [utils.py:334] [test-other-ctc-decoding] %WER 7.95% [4161 / 52343, 425 ins, 359 del, 3377 sub ]
2021-10-23 15:06:09,303 INFO [decode.py:507] Wrote detailed error stats to conformer_ctc/exp/errs-test-other-ctc-decoding.txt
2021-10-23 15:06:10,161 INFO [decode.py:523]
For test-other, WER of different settings are:
ctc-decoding 7.95 best for test-other

2021-10-23 15:06:10,162 INFO [decode.py:681] Done!
~

@csukuangfj
Copy link
Collaborator

maybe cause by Loading G_4_gram.fst.txt

Could you show us the error log?

@cdxie
Copy link
Contributor

cdxie commented Oct 23, 2021

maybe cause by Loading G_4_gram.fst.txt

Could you show us the error log?

@csukuangfj the error is OOM, that's what we mentioned at the first time, and then you asked us to use ctc-decoding to see if there was any problem with the code. So, we should go back to the first error?

@csukuangfj
Copy link
Collaborator

Do you mean the log in #70 (comment) ?

2021-10-14 11:02:43,243 INFO [pretrained.py:236] device: cuda:0
2021-10-14 11:02:43,243 INFO [pretrained.py:238] Creating model
2021-10-14 11:02:47,756 INFO [pretrained.py:255] Constructing Fbank computer
2021-10-14 11:02:47,758 INFO [pretrained.py:265] Reading sound files: ['./tmp/icefall_asr_librispeech_conformer_ctc/test_wavs/1089-134686-0001.flac', './tmp/icefall_asr_librispeech_conformer_ctc/test_wavs/1221-135766-0001.flac', './tmp/icefall_asr_librispeech_conformer_ctc/test_wavs/1221-135766-0002.flac']
2021-10-14 11:02:47,969 INFO [pretrained.py:271] Decoding started
2021-10-14 11:02:48,845 INFO [pretrained.py:327] Loading HLG from ./tmp/icefall_asr_librispeech_conformer_ctc/data/lang_bpe/HLG.pt
Segmentation fault (core dumped)

Could you figure out which line causes segfault? As your code is in Python, you can use pdb to run the script step by step.

@danpovey
Copy link
Collaborator

danpovey commented Oct 23, 2021 via email

@csukuangfj
Copy link
Collaborator

but do we really even need to decode with the CTC topo? I think if we use a full decoding graph we wouldn't get this problem. Perhaps someone can comment on these options?

I think one advantage of CTC decoding is that it is super faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants