-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parler tts release #1
Conversation
…h into parler-tts-release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done on getting the repo cleaned up, it's looking in good shape.
My opinion is that the main README should bring visibility to both inference and training, since we want to push this as a toolkit to train new TTS models. In that regard, it would be great to include the training steps in the main REAMDE, or at least a reference to the training README if too long.
For the training README, let's also try to keep it quite streamlined: one command to run per-section, with the args set to reproduce the v0.1 checkpoint. This way, it'll be easier for the community to build on your work.
Leaving this as another TODO here: revisit the generation code (forcing ids and ensuring we don't get early stopping)
|
||
Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc. | ||
|
||
## Usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could be good before the section on Usage
is having a quick index (like with Bark https://github.com/suno-ai/bark#-quick-index)
@@ -0,0 +1,1759 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good!
"head_mask": head_mask, | ||
"cross_attn_head_mask": cross_attn_head_mask, | ||
"past_key_values": past_key_values, | ||
"use_cache": use_cache, | ||
} | ||
|
||
# Ignore copy | ||
def build_delay_pattern_mask(self, input_ids: torch.LongTensor, pad_token_id: int, max_length: int = None): | ||
def build_delay_pattern_mask( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we essentially make this a function instead of method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, I'm using it in a Dataset.map
and this avoid hashing issues!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do something hacky in your training script where you do:
build_delay_pattern_mask = model.build_delay_pattern_mask
def preprocess(batch):
# do some preprocessing with the pattern mask
batch["labels"] = build_delay_pattern_mask(batch["labels"])
return batch
dataset = dataset.map(preprocess)
But this is also a sufficient workaround as well
output_values.append(sample.transpose(0, 2)) | ||
else: | ||
output_values.append(torch.zeros((1, 1, 1)).to(self.device)) | ||
# TODO: we should keep track of output length as well. Not really straightfoward tbh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return a tensor of shape (bsz,)
that contains the output lengths as integers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was more about passing batch to DAC instead of sequential decoding!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - yeah passing a mask is quite involved, let's leave this as a TODO. Having the LM work with SDPA/FA2 is going to bring far greater inference speed gains than batched decoding with DAC
Co-authored-by: Sanchit Gandhi <[email protected]>
Prepare release