Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Musicgen #24109

Merged
merged 154 commits into from
Jun 29, 2023
Merged

Add Musicgen #24109

merged 154 commits into from
Jun 29, 2023

Conversation

sanchit-gandhi
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi commented Jun 8, 2023

What does this PR do?

Adds the musicgen model by fairseq to transformers

This model is made of three components:

  1. T5Encoder (which import as AutoModelForTextEncoding)
  2. MusicgenDecoder (which we copy as much as possible from modeling_bart.py)
  3. Encodec (which we import as AutoModel)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 8, 2023

The documentation is not available anymore as the PR was closed or merged.

@sanchit-gandhi sanchit-gandhi changed the title [WIP] Add Audiocraft [WIP] Add Musicgen Jun 12, 2023
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's very nicely handled! As we talked offline, cool that everyting stays in the modeling code instead of having logit processors!

src/transformers/models/musicgen/configuration_musicgen.py Outdated Show resolved Hide resolved
src/transformers/models/musicgen/configuration_musicgen.py Outdated Show resolved Hide resolved
src/transformers/models/musicgen/modeling_musicgen.py Outdated Show resolved Hide resolved
src/transformers/models/musicgen/modeling_musicgen.py Outdated Show resolved Hide resolved
src/transformers/models/musicgen/modeling_musicgen.py Outdated Show resolved Hide resolved
src/transformers/models/musicgen/modeling_musicgen.py Outdated Show resolved Hide resolved
src/transformers/models/musicgen/modeling_musicgen.py Outdated Show resolved Hide resolved
@sanchit-gandhi
Copy link
Contributor Author

sanchit-gandhi commented Jun 15, 2023

Would be great to hear your thoughts on the design here @patrickvonplaten (adding the tests otherwise now)

TODO:

  • convert m/l checkpoints
  • handle padded tokens from encodec (in delay pattern mask, then again when we decode)
  • fast tests
  • integration tests
  • add method for unconditional generation (no need to use processor to get input ids)
  • finish docs / docstrings

output["audio_encoder"] = self.audio_encoder.to_dict()
output["decoder"] = self.decoder.to_dict()
output["model_type"] = self.__class__.model_type
return output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean!

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the modeling design is nice. The changes to the main generate method are too model-specific IMO and I also don't think it's a good idea to create a dependency via super().generate(...).

I'd propose two things:

  • 1.) Change the forward method to accept [batch_size x num_codevectors, seq_length] instead of [batch_size, num_codevectors, seq_length]. I think if we explain it nicely in the docs, there is no real disadvantage to moving num_codevectors to the batch dimension right away.
  • 2.) I'd just directly call sample and greedy_search here: https://github.com/huggingface/transformers/pull/24109/files#r1232442770

inputs_embeds = torch.zeros((bsz, seq_len, self.d_model), device=input_ids.device)

for codebook in range(self.num_codebooks):
inputs_embeds += self.embed_tokens[codebook](input[:, codebook])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting!

return input_ids

@torch.no_grad()
def generate(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the .generate method as discussed @patrickvonplaten - would be great to hear your thoughts on this before committing to it for the final design! Will update the docstrings as required once we've settled on the design.

Overall, my thoughts are that this approach to explicitly calling greedy_search and sample works. It's not super compact, but is definitely easier to understand than the super().generate call that we had before. It will allow us to build on top with other generation methods (e.g. assisted generation) in the future, so I think it's the way to go

_, pattern_mask = self.build_delay_pattern_mask(
inputs, generation_config.pad_token_id, max_length=generation_config.max_length
)
output_ids = self.apply_delay_pattern_mask(output_ids, pattern_mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean!

"""
return self.tokenizer.decode(*args, **kwargs)

def decode_audio(self, audio_values, padding_mask: Optional = None) -> List[np.ndarray]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @sgugger this is the only new functionality since your review (the rest was just getting the doc tests to pass by fixing bugs in the docstrings)

As discussed offline, we want a method in the processor to strip any padding from our generated audio values. This method does exactly that, and is tested for with a new processor test.

The API looks as follows:

from transformers import AutoProcessor, MusicgenForConditionalGeneration
from datasets import load_dataset

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True)
sample = next(iter(dataset))["audio"]

inputs = processor(
    audio=sample["audio"],
    sampling_rate=sample["sampling_rate"],
    text="80s blues track with groovy saxophone",
    padding=True,
    return_tensors="pt",
)

audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)

# NEW: post-process to remove padding from the batched audio
audio_values = processor.decode_audio(audio_values, padding_mask=inputs.padding_mask)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do decode(text=...) or decode(audio=...) instead? This would be more in-line with how processor deal with multimodal inputs/outputs and we have something similar in call and pad already.

Copy link
Contributor Author

@sanchit-gandhi sanchit-gandhi Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in c0235d3 - we now decode the audios if audio values are passed, or decode token ids with the tokenizer otherwise

I had originally used the .pad function in the feature extractor, but this actually proved to be more complicated than a hand-written pad. This is because we want to pack the padding mask to max length with the non-padding token:

  1. The padding mask is constructed based on the input ids, padding left based on the length of the inputs
  2. We then generate our input ids to a new max length, generating new tokens on the right always
  3. The newly generated ids are all valid, so the padding mask should not pad these out, hence we pad with the non-padding token on the right side

This means that to use the .pad method, we first have to flip the attributes of the feature extractor, perform the padding, then flip them back again

  1. Flip the padding token
  2. Ensure the padding side is set to "right"
  3. Do the padding
  4. Flip the padding token back again
  5. Ensure the padding side is set to the original

=> overall this was more complicated than writing two lines of padding

@sanchit-gandhi sanchit-gandhi merged commit 1c1c907 into huggingface:main Jun 29, 2023
@sanchit-gandhi sanchit-gandhi deleted the audiocraft branch June 29, 2023 13:49
@ylacombe ylacombe mentioned this pull request Jul 20, 2023
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants