You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed that while the configuration for large MERT-v1-330M matches e.g. wav2vec2-large-robust in terms of transformer encoder parameters, you used feat_extract_norm="group" instead of feat_extract_norm="layer". Wav2Vec2 authors mentioned that using layer provides more robust training for larger models.
Also, layer computes statistics for each time step independently which solves issue with different model outputs for padded and unpadded inputs (unlike with group where statistics are compute for each channel across all time steps including padding ones since there is no attention_mask-like mechanism in feature extractor). Which is huge benefit because it allows running inference in batches "properly".
Any reasons why you preferred group other layer?
The text was updated successfully, but these errors were encountered:
Hey! Thank you for your amazing work!
I've noticed that while the configuration for large MERT-v1-330M matches e.g. wav2vec2-large-robust in terms of transformer encoder parameters, you used
feat_extract_norm="group"
instead offeat_extract_norm="layer"
. Wav2Vec2 authors mentioned that usinglayer
provides more robust training for larger models.Also,
layer
computes statistics for each time step independently which solves issue with different model outputs for padded and unpadded inputs (unlike withgroup
where statistics are compute for each channel across all time steps including padding ones since there is noattention_mask
-like mechanism in feature extractor). Which is huge benefit because it allows running inference in batches "properly".Any reasons why you preferred
group
otherlayer
?The text was updated successfully, but these errors were encountered: