You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi Quan,
I was trying to train an ASR system on WSJ using your toolkit using the similar setup as your ICASSP paper on the swbd corpus. If fbank features are used, the training and decoding work fine. However, when I use cnn downsampling on the fbank before the decoder, the training works fine but when I tried to decode, I got a lot of assertion failures from CUDA, for example,
File "translate.py", line 364, in <module> main() File "translate.py", line 227, in main predBatch, predScore, predLength, goldScore, numGoldWords,allGoldScores = translator.translate_asr(srcBatch, tgtBatch) File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 464, in translate_asr for n in range(self.opt.n_best)] File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 464, in <listcomp> for n in range(self.opt.n_best)] File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 273, in build_target_tokens tokens = self.tgt_dict.convertToLabels(pred, onmt.Constants.EOS) File "/data1/NMTGMinor/onmt/Dict.py", line 166, in convertToLabels print('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', idx) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 71, in __repr__ return torch._tensor_str._str(self) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 286, in _str tensor_str = _tensor_str(self, indent) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 201, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 83, in __init__ value_str = '{}'.format(value) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 387, in __format__ return self.item().__format__(format_spec) RuntimeError: CUDA error: device-side assert triggered
Do you have any idea what causes this error? The model converges to a similar loss as the fbank features, I think it is less likely due to the model issue. Thanks
The text was updated successfully, but these errors were encountered:
Hi Quan,
Thanks for the reply. Right now I am using batch_size=1, I have tracked the problem: when I print out the decoder_output of the first step, only the first beam has normal log_posteriors, the log_prob for all other beams of this step are nan.
Hi Quan,
I was trying to train an ASR system on WSJ using your toolkit using the similar setup as your ICASSP paper on the swbd corpus. If fbank features are used, the training and decoding work fine. However, when I use cnn downsampling on the fbank before the decoder, the training works fine but when I tried to decode, I got a lot of assertion failures from CUDA, for example,
data1/tools/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = false]: block: [64,0,0], thread: [127,0,0] Assertion
srcIndex < srcSelectDimSizefailed.
The debugging stack is as below
File "translate.py", line 364, in <module> main() File "translate.py", line 227, in main predBatch, predScore, predLength, goldScore, numGoldWords,allGoldScores = translator.translate_asr(srcBatch, tgtBatch) File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 464, in translate_asr for n in range(self.opt.n_best)] File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 464, in <listcomp> for n in range(self.opt.n_best)] File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 273, in build_target_tokens tokens = self.tgt_dict.convertToLabels(pred, onmt.Constants.EOS) File "/data1/NMTGMinor/onmt/Dict.py", line 166, in convertToLabels print('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', idx) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 71, in __repr__ return torch._tensor_str._str(self) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 286, in _str tensor_str = _tensor_str(self, indent) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 201, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 83, in __init__ value_str = '{}'.format(value) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 387, in __format__ return self.item().__format__(format_spec) RuntimeError: CUDA error: device-side assert triggered
Do you have any idea what causes this error? The model converges to a similar loss as the fbank features, I think it is less likely due to the model issue. Thanks
The text was updated successfully, but these errors were encountered: