-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Musicgen #24109
Add Musicgen #24109
Conversation
The documentation is not available anymore as the PR was closed or merged. |
330291f
to
8bc98b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's very nicely handled! As we talked offline, cool that everyting stays in the modeling code instead of having logit processors!
aaf8b95
to
fc58b7a
Compare
Would be great to hear your thoughts on the design here @patrickvonplaten (adding the tests otherwise now) TODO:
|
151b1f8
to
99600f6
Compare
output["audio_encoder"] = self.audio_encoder.to_dict() | ||
output["decoder"] = self.decoder.to_dict() | ||
output["model_type"] = self.__class__.model_type | ||
return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the modeling design is nice. The changes to the main generate
method are too model-specific IMO and I also don't think it's a good idea to create a dependency via super().generate(...)
.
I'd propose two things:
- 1.) Change the forward method to accept [
batch_size x num_codevectors
,seq_length
] instead of [batch_size
,num_codevectors
,seq_length
]. I think if we explain it nicely in the docs, there is no real disadvantage to movingnum_codevectors
to the batch dimension right away. - 2.) I'd just directly call
sample
andgreedy_search
here: https://github.com/huggingface/transformers/pull/24109/files#r1232442770
inputs_embeds = torch.zeros((bsz, seq_len, self.d_model), device=input_ids.device) | ||
|
||
for codebook in range(self.num_codebooks): | ||
inputs_embeds += self.embed_tokens[codebook](input[:, codebook]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting!
return input_ids | ||
|
||
@torch.no_grad() | ||
def generate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored the .generate
method as discussed @patrickvonplaten - would be great to hear your thoughts on this before committing to it for the final design! Will update the docstrings as required once we've settled on the design.
Overall, my thoughts are that this approach to explicitly calling greedy_search
and sample
works. It's not super compact, but is definitely easier to understand than the super().generate
call that we had before. It will allow us to build on top with other generation methods (e.g. assisted generation) in the future, so I think it's the way to go
_, pattern_mask = self.build_delay_pattern_mask( | ||
inputs, generation_config.pad_token_id, max_length=generation_config.max_length | ||
) | ||
output_ids = self.apply_delay_pattern_mask(output_ids, pattern_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean!
""" | ||
return self.tokenizer.decode(*args, **kwargs) | ||
|
||
def decode_audio(self, audio_values, padding_mask: Optional = None) -> List[np.ndarray]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @sgugger this is the only new functionality since your review (the rest was just getting the doc tests to pass by fixing bugs in the docstrings)
As discussed offline, we want a method in the processor
to strip any padding from our generated audio values. This method does exactly that, and is tested for with a new processor test.
The API looks as follows:
from transformers import AutoProcessor, MusicgenForConditionalGeneration
from datasets import load_dataset
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True)
sample = next(iter(dataset))["audio"]
inputs = processor(
audio=sample["audio"],
sampling_rate=sample["sampling_rate"],
text="80s blues track with groovy saxophone",
padding=True,
return_tensors="pt",
)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
# NEW: post-process to remove padding from the batched audio
audio_values = processor.decode_audio(audio_values, padding_mask=inputs.padding_mask)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do decode(text=...)
or decode(audio=...)
instead? This would be more in-line with how processor deal with multimodal inputs/outputs and we have something similar in call and pad already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in c0235d3 - we now decode the audios if audio values are passed, or decode token ids with the tokenizer otherwise
I had originally used the .pad
function in the feature extractor, but this actually proved to be more complicated than a hand-written pad. This is because we want to pack the padding mask to max length with the non-padding token:
- The padding mask is constructed based on the input ids, padding left based on the length of the inputs
- We then generate our input ids to a new max length, generating new tokens on the right always
- The newly generated ids are all valid, so the padding mask should not pad these out, hence we pad with the non-padding token on the right side
This means that to use the .pad
method, we first have to flip the attributes of the feature extractor, perform the padding, then flip them back again
- Flip the padding token
- Ensure the padding side is set to
"right"
- Do the padding
- Flip the padding token back again
- Ensure the padding side is set to the original
=> overall this was more complicated than writing two lines of padding
What does this PR do?
Adds the musicgen model by fairseq to transformers
This model is made of three components:
AutoModelForTextEncoding
)modeling_bart.py
)AutoModel
)