Add Musicgen #24109

sanchit-gandhi · 2023-06-08T13:29:34Z

What does this PR do?

Adds the musicgen model by fairseq to transformers

This model is made of three components:

T5Encoder (which import as AutoModelForTextEncoding)
MusicgenDecoder (which we copy as much as possible from modeling_bart.py)
Encodec (which we import as AutoModel)

HuggingFaceDocBuilderDev · 2023-06-08T13:51:45Z

The documentation is not available anymore as the PR was closed or merged.

src/transformers/generation/utils.py

src/transformers/models/musicgen/configuration_musicgen.py

src/transformers/models/musicgen/convert_musicgen_transformers.py

src/transformers/models/musicgen/modeling_musicgen.py

ArthurZucker

I think it's very nicely handled! As we talked offline, cool that everyting stays in the modeling code instead of having logit processors!

src/transformers/models/musicgen/configuration_musicgen.py

src/transformers/models/musicgen/modeling_musicgen.py

sanchit-gandhi · 2023-06-15T16:54:21Z

Would be great to hear your thoughts on the design here @patrickvonplaten (adding the tests otherwise now)

TODO:

convert m/l checkpoints
handle padded tokens from encodec (in delay pattern mask, then again when we decode)
fast tests
integration tests
add method for unconditional generation (no need to use processor to get input ids)
finish docs / docstrings

docs/source/en/model_doc/musicgen.mdx

src/transformers/generation/logits_process.py

src/transformers/generation/utils.py

src/transformers/models/musicgen/configuration_musicgen.py

patrickvonplaten · 2023-06-16T15:52:57Z

src/transformers/models/musicgen/configuration_musicgen.py

+        output["audio_encoder"] = self.audio_encoder.to_dict()
+        output["decoder"] = self.decoder.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output


src/transformers/models/musicgen/modeling_musicgen.py

patrickvonplaten

I think the modeling design is nice. The changes to the main generate method are too model-specific IMO and I also don't think it's a good idea to create a dependency via super().generate(...).

I'd propose two things:

1.) Change the forward method to accept [batch_size x num_codevectors, seq_length] instead of [batch_size, num_codevectors, seq_length]. I think if we explain it nicely in the docs, there is no real disadvantage to moving num_codevectors to the batch dimension right away.
2.) I'd just directly call sample and greedy_search here: https://github.com/huggingface/transformers/pull/24109/files#r1232442770

patrickvonplaten · 2023-06-16T16:14:33Z

src/transformers/models/musicgen/modeling_musicgen.py

+            inputs_embeds = torch.zeros((bsz, seq_len, self.d_model), device=input_ids.device)
+
+            for codebook in range(self.num_codebooks):
+                inputs_embeds += self.embed_tokens[codebook](input[:, codebook])


Interesting!

src/transformers/models/musicgen/modeling_musicgen.py

sanchit-gandhi · 2023-06-19T13:21:09Z

src/transformers/models/musicgen/modeling_musicgen.py

+        return input_ids
+
+    @torch.no_grad()
+    def generate(


Refactored the .generate method as discussed @patrickvonplaten - would be great to hear your thoughts on this before committing to it for the final design! Will update the docstrings as required once we've settled on the design.

Overall, my thoughts are that this approach to explicitly calling greedy_search and sample works. It's not super compact, but is definitely easier to understand than the super().generate call that we had before. It will allow us to build on top with other generation methods (e.g. assisted generation) in the future, so I think it's the way to go

src/transformers/models/musicgen/modeling_musicgen.py

patrickvonplaten · 2023-06-19T14:18:16Z

src/transformers/models/musicgen/modeling_musicgen.py

+        _, pattern_mask = self.build_delay_pattern_mask(
+            inputs, generation_config.pad_token_id, max_length=generation_config.max_length
+        )
+        output_ids = self.apply_delay_pattern_mask(output_ids, pattern_mask)


src/transformers/models/musicgen/modeling_musicgen.py

sanchit-gandhi · 2023-06-27T15:11:58Z

src/transformers/models/musicgen/processing_musicgen.py

+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    def decode_audio(self, audio_values, padding_mask: Optional = None) -> List[np.ndarray]:


FYI @sgugger this is the only new functionality since your review (the rest was just getting the doc tests to pass by fixing bugs in the docstrings)

As discussed offline, we want a method in the processor to strip any padding from our generated audio values. This method does exactly that, and is tested for with a new processor test.

The API looks as follows:

from transformers import AutoProcessor, MusicgenForConditionalGeneration from datasets import load_dataset processor = AutoProcessor.from_pretrained("facebook/musicgen-small") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True) sample = next(iter(dataset))["audio"] inputs = processor( audio=sample["audio"], sampling_rate=sample["sampling_rate"], text="80s blues track with groovy saxophone", padding=True, return_tensors="pt", ) audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256) # NEW: post-process to remove padding from the batched audio audio_values = processor.decode_audio(audio_values, padding_mask=inputs.padding_mask)

Can we do decode(text=...) or decode(audio=...) instead? This would be more in-line with how processor deal with multimodal inputs/outputs and we have something similar in call and pad already.

Resolved in c0235d3 - we now decode the audios if audio values are passed, or decode token ids with the tokenizer otherwise

I had originally used the .pad function in the feature extractor, but this actually proved to be more complicated than a hand-written pad. This is because we want to pack the padding mask to max length with the non-padding token:

The padding mask is constructed based on the input ids, padding left based on the length of the inputs

We then generate our input ids to a new max length, generating new tokens on the right always

The newly generated ids are all valid, so the padding mask should not pad these out, hence we pad with the non-padding token on the right side

This means that to use the .pad method, we first have to flip the attributes of the feature extractor, perform the padding, then flip them back again

Flip the padding token

Ensure the padding side is set to "right"

Do the padding

Flip the padding token back again

Ensure the padding side is set to the original

=> overall this was more complicated than writing two lines of padding

sanchit-gandhi changed the title ~~[WIP] Add Audiocraft~~ [WIP] Add Musicgen Jun 12, 2023

sanchit-gandhi force-pushed the audiocraft branch from 330291f to 8bc98b2 Compare June 12, 2023 19:09

sanchit-gandhi commented Jun 13, 2023

View reviewed changes

src/transformers/models/musicgen/modeling_musicgen.py Outdated Show resolved Hide resolved

src/transformers/models/musicgen/modeling_musicgen.py Outdated Show resolved Hide resolved

ArthurZucker reviewed Jun 13, 2023

View reviewed changes

sanchit-gandhi mentioned this pull request Jun 15, 2023

[EnCodec] Changes for 32kHz ckpt #24296

Merged

sanchit-gandhi force-pushed the audiocraft branch from aaf8b95 to fc58b7a Compare June 15, 2023 10:04

sanchit-gandhi mentioned this pull request Jun 15, 2023

[AutoModel] Add AutoModelForTextEncoding #24305

Merged

sanchit-gandhi force-pushed the audiocraft branch from 151b1f8 to 99600f6 Compare June 15, 2023 16:55