Why MERT-v1-330M was trained with `feat_extract_norm="group"` instead of "layer"? #19

LiableFishYS · 2025-01-17T10:02:28Z

Hey! Thank you for your amazing work!

I've noticed that while the configuration for large MERT-v1-330M matches e.g. wav2vec2-large-robust in terms of transformer encoder parameters, you used feat_extract_norm="group" instead of feat_extract_norm="layer". Wav2Vec2 authors mentioned that using layer provides more robust training for larger models.

Also, layer computes statistics for each time step independently which solves issue with different model outputs for padded and unpadded inputs (unlike with group where statistics are compute for each channel across all time steps including padding ones since there is no attention_mask-like mechanism in feature extractor). Which is huge benefit because it allows running inference in batches "properly".

Any reasons why you preferred group other layer?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why MERT-v1-330M was trained with `feat_extract_norm="group"` instead of "layer"? #19

Why MERT-v1-330M was trained with `feat_extract_norm="group"` instead of "layer"? #19

LiableFishYS commented Jan 17, 2025

Why MERT-v1-330M was trained with feat_extract_norm="group" instead of "layer"? #19

Why MERT-v1-330M was trained with feat_extract_norm="group" instead of "layer"? #19

Comments

LiableFishYS commented Jan 17, 2025

Why MERT-v1-330M was trained with `feat_extract_norm="group"` instead of "layer"? #19

Why MERT-v1-330M was trained with `feat_extract_norm="group"` instead of "layer"? #19