Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

Inconsistency between the reported loss and the actual loss used for gradient computation #194

Open
pzelasko opened this issue May 12, 2021 · 2 comments

Comments

@pzelasko
Copy link
Collaborator

The loss we backprop is normalized by the number of supervisions:

if att_rate != 0.0:
loss = (- (1.0 - att_rate) * mmi_loss + att_rate * att_loss) / (len(texts) * accum_grad)
else:
loss = (-mmi_loss) / (len(texts) * accum_grad)

But the loss we report is normalized by the number of frames:

logging.info(
'batch {}, epoch {}/{} '
'global average objf: {:.6f} over {} '
'frames ({:.1f}% kept), current batch average objf: {:.6f} over {} frames ({:.1f}% kept) '
'avg time waiting for batch {:.3f}s'.format(
batch_idx, current_epoch, num_epochs,
total_objf / total_frames, total_frames,
100.0 * total_frames / total_all_frames,
curr_batch_objf / (curr_batch_frames + 0.001),
curr_batch_frames,
100.0 * curr_batch_frames / curr_batch_all_frames,
time_waiting_for_batch / max(1, batch_idx)))

It looks like it wasn't intended. I think the latter makes more sense to me to use in backprop (but we'd probably need to re-tune learning rates etc.) - WDYT?

@danpovey
Copy link
Contributor

Yes, I think normalizing by number of frames would be OK. With Adam optimizers this shouldn't make a difference to the results.
My preference in the abstract would be to simply not normalize at all, which would mean we wouldn't have to take into account accum_grad. But I think it's traditional in machine learning to normalize somehow, so IDK whether people feel this might be confusing to readers.

@pzelasko
Copy link
Collaborator Author

Could also be surprising if we want to try out a different optimizer that doesn't have Adam-like gradient scaling.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants