translator leads to "CUDA error: device-side assert triggered" when using cnn downsampling for ASR #13

guangsenw · 2019-09-23T07:37:29Z

Hi Quan,
I was trying to train an ASR system on WSJ using your toolkit using the similar setup as your ICASSP paper on the swbd corpus. If fbank features are used, the training and decoding work fine. However, when I use cnn downsampling on the fbank before the decoder, the training works fine but when I tried to decode, I got a lot of assertion failures from CUDA, for example,

data1/tools/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = false]: block: [64,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.

The debugging stack is as below

File "translate.py", line 364, in <module> main() File "translate.py", line 227, in main predBatch, predScore, predLength, goldScore, numGoldWords,allGoldScores = translator.translate_asr(srcBatch, tgtBatch) File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 464, in translate_asr for n in range(self.opt.n_best)] File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 464, in <listcomp> for n in range(self.opt.n_best)] File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 273, in build_target_tokens tokens = self.tgt_dict.convertToLabels(pred, onmt.Constants.EOS) File "/data1/NMTGMinor/onmt/Dict.py", line 166, in convertToLabels print('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', idx) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 71, in __repr__ return torch._tensor_str._str(self) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 286, in _str tensor_str = _tensor_str(self, indent) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 201, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 83, in __init__ value_str = '{}'.format(value) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 387, in __format__ return self.item().__format__(format_spec) RuntimeError: CUDA error: device-side assert triggered

Do you have any idea what causes this error? The model converges to a similar loss as the fbank features, I think it is less likely due to the model issue. Thanks

The text was updated successfully, but these errors were encountered:

quanpn90 · 2019-09-23T09:06:48Z

Thank you for the question.

The CNN downsampling was added not a long time ago and I was not able to test it, due to the lack of time.

Possibly the mask creation step during decoding was not done correctly. You can try decoding with batch size 1 to see if it could work.

guangsenw · 2019-09-23T09:26:06Z

Hi Quan,
Thanks for the reply. Right now I am using batch_size=1, I have tracked the problem: when I print out the decoder_output of the first step, only the first beam has normal log_posteriors, the log_prob for all other beams of this step are nan.

Thanks for the hint and I will look into that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

translator leads to "CUDA error: device-side assert triggered" when using cnn downsampling for ASR #13

translator leads to "CUDA error: device-side assert triggered" when using cnn downsampling for ASR #13

guangsenw commented Sep 23, 2019

quanpn90 commented Sep 23, 2019

guangsenw commented Sep 23, 2019

translator leads to "CUDA error: device-side assert triggered" when using cnn downsampling for ASR #13

translator leads to "CUDA error: device-side assert triggered" when using cnn downsampling for ASR #13

Comments

guangsenw commented Sep 23, 2019

quanpn90 commented Sep 23, 2019

guangsenw commented Sep 23, 2019