Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why MERT-v1-330M was trained with feat_extract_norm="group" instead of "layer"? #19

Open
LiableFishYS opened this issue Jan 17, 2025 · 0 comments

Comments

@LiableFishYS
Copy link

Hey! Thank you for your amazing work!

I've noticed that while the configuration for large MERT-v1-330M matches e.g. wav2vec2-large-robust in terms of transformer encoder parameters, you used feat_extract_norm="group" instead of feat_extract_norm="layer". Wav2Vec2 authors mentioned that using layer provides more robust training for larger models.

Also, layer computes statistics for each time step independently which solves issue with different model outputs for padded and unpadded inputs (unlike with group where statistics are compute for each channel across all time steps including padding ones since there is no attention_mask-like mechanism in feature extractor). Which is huge benefit because it allows running inference in batches "properly".

Any reasons why you preferred group other layer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant