Inconsistency between the reported loss and the actual loss used for gradient computation #194

pzelasko · 2021-05-12T20:09:21Z

The loss we backprop is normalized by the number of supervisions:

snowfall/egs/librispeech/asr/simple_v1/mmi_att_transformer_train.py

Lines 114 to 117 in 5d1b00d

    
           if att_rate != 0.0: 
        
               loss = (- (1.0 - att_rate) * mmi_loss + att_rate * att_loss) / (len(texts) * accum_grad) 
        
           else: 
        
               loss = (-mmi_loss) / (len(texts) * accum_grad)

But the loss we report is normalized by the number of frames:

snowfall/egs/librispeech/asr/simple_v1/mmi_att_transformer_train.py

Lines 267 to 278 in 5d1b00d

    
           logging.info( 
        
               'batch {}, epoch {}/{} ' 
        
               'global average objf: {:.6f} over {} ' 
        
               'frames ({:.1f}% kept), current batch average objf: {:.6f} over {} frames ({:.1f}% kept) ' 
        
               'avg time waiting for batch {:.3f}s'.format( 
        
                   batch_idx, current_epoch, num_epochs, 
        
                   total_objf / total_frames, total_frames, 
        
                   100.0 * total_frames / total_all_frames, 
        
                   curr_batch_objf / (curr_batch_frames + 0.001), 
        
                   curr_batch_frames, 
        
                   100.0 * curr_batch_frames / curr_batch_all_frames, 
        
                   time_waiting_for_batch / max(1, batch_idx)))

It looks like it wasn't intended. I think the latter makes more sense to me to use in backprop (but we'd probably need to re-tune learning rates etc.) - WDYT?

danpovey · 2021-05-13T02:48:22Z

Yes, I think normalizing by number of frames would be OK. With Adam optimizers this shouldn't make a difference to the results.
My preference in the abstract would be to simply not normalize at all, which would mean we wouldn't have to take into account accum_grad. But I think it's traditional in machine learning to normalize somehow, so IDK whether people feel this might be confusing to readers.

pzelasko · 2021-05-17T12:07:17Z

Could also be surprising if we want to try out a different optimizer that doesn't have Adam-like gradient scaling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency between the reported loss and the actual loss used for gradient computation #194

Inconsistency between the reported loss and the actual loss used for gradient computation #194

pzelasko commented May 12, 2021

danpovey commented May 13, 2021

pzelasko commented May 17, 2021

Inconsistency between the reported loss and the actual loss used for gradient computation #194

Inconsistency between the reported loss and the actual loss used for gradient computation #194

Comments

pzelasko commented May 12, 2021

danpovey commented May 13, 2021

pzelasko commented May 17, 2021