Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using assistant in AutomaticSpeechRecognitionPipeline with different encoder size #30637

Merged
merged 25 commits into from
May 23, 2024

Conversation

kamilakesbi
Copy link
Contributor

@kamilakesbi kamilakesbi commented May 3, 2024

This PR aims at fixing issue #30611:

  • First: an error will be thrown if the assistant and main models encoders don't have the same size, and the assistant is loaded using AutoModelForCausalLM.

  • Second: This PR makes the pipeline work when using an assistant with a different encoder size (loaded with AutoModelForSpeechSeq2Seq) than the main model:

When using AutomaticSpeechRecognitionPipeline, If we use an assistant with a different encoder size than the main model , the pipeline is broken and we get the following error message:

ValueError: Whisper expects the mel input features to be of length 3000, but found 1500. Make sure to pad the input mel features to 3000.

Explanation of the solution

When doing short form generation with the pipeline, input_features aren't passed to the generate method, which instead takes the output of the main model's encoder.

If the main model and the assistant don't share the same encoder, the encoder_output passed to generate cannot be used by the assistant for generation, and we get an error.

The solution here is to also pass the input_features to the generate method to be used by the assistant.

Who can review?

@sanchit-gandhi

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this up @kamilakesbi - left some comments below!

src/transformers/models/whisper/generation_whisper.py Outdated Show resolved Hide resolved
src/transformers/pipelines/automatic_speech_recognition.py Outdated Show resolved Hide resolved
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment regarding the generality of the check (note that in generation/utils.py, we are assuming that all checks + functionality can be applied to all models in the library that are generate-compatible, not just speech recognition ones)

src/transformers/generation/utils.py Outdated Show resolved Hide resolved
@sanchit-gandhi
Copy link
Contributor

Is there a test that confirms correctness after the fix? There's likely a relevant slow pipeline test that was either failing, or was not rigorous enough

@gante
Copy link
Member

gante commented May 9, 2024

Please have a look at #30726 for an alternative fix -- IMO, the root source of problems is the ASR pipeline doing a redundant operation, and not in generate :)

I hope you don't mind me crashing into the issue 🙌 (I only noticed this PR after opening #30726, when trying to link all related issues)

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline changes LGTM, just some minor suggestions regarding the slow tests. Would love a second opinion from generate expert @gante on the assistant model validation!

@kamilakesbi
Copy link
Contributor Author

I think this PR is ready to be merged!

cc @amyeroberts @gante if you want to have a look ;)

@kamilakesbi kamilakesbi changed the title [WIP] - Using assistant in AutomaticSpeechRecognitionPipeline with different encoder size Using assistant in AutomaticSpeechRecognitionPipeline with different encoder size May 15, 2024
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

I've just done a quick pass over and have some outstanding Qs. I'll review again once @gante has confirmed the update to the generation validation is OK

src/transformers/generation/utils.py Outdated Show resolved Hide resolved
src/transformers/generation/utils.py Outdated Show resolved Hide resolved
src/transformers/generation/utils.py Outdated Show resolved Hide resolved
@kamilakesbi kamilakesbi requested a review from gante May 16, 2024 15:20
@kamilakesbi kamilakesbi added Audio Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! labels May 17, 2024
@kamilakesbi kamilakesbi self-assigned this May 17, 2024
Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for fixing! 💪 I've added a few minor nits to help with readability.

(I see that you've included the diff from #30726, I'm going to close that PR :D)

src/transformers/generation/utils.py Outdated Show resolved Hide resolved
src/transformers/generation/utils.py Outdated Show resolved Hide resolved
@kamilakesbi
Copy link
Contributor Author

Thanks @gante for the review :)
@amyeroberts could you please merge this PR ?

@amyeroberts
Copy link
Collaborator

@kamilakesbi It still needs a final core maintainer review and approval before merge (+ resolution of conflicts) :). I'll review now

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding and iterating on this!

Only thing left to add are tests for _validate_assistant - there should be tests making sure that it correctly raises exceptions for the two cases it's checking for

@kamilakesbi
Copy link
Contributor Author

hi @amyeroberts, I've added a slow test which pass :) I think quality checks fails are unrelated to this PR, and will be fixed by #30932.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests - looks great!

tests/generation/test_utils.py Show resolved Hide resolved
@sanchit-gandhi
Copy link
Contributor

Quick checklist before merge:

  • Resolve all comment threads that have been addressed
  • Fix the merge conflict in automatic_speech_recognition.py
  • Rebase onto main to get the style fixes from update ruff version #30932 (once the PR is merged)
  • Ping me to get this merged!

kamilakesbi and others added 20 commits May 23, 2024 10:17
@kamilakesbi kamilakesbi force-pushed the speculative_decoding_asr branch from 06e5839 to 3a35145 Compare May 23, 2024 08:17
@sanchit-gandhi
Copy link
Contributor

Nice work @kamilakesbi!

@sanchit-gandhi sanchit-gandhi merged commit eb1a77b into huggingface:main May 23, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Audio Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants