Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parler tts release #1

Closed
wants to merge 84 commits into from
Closed

Conversation

ylacombe
Copy link
Collaborator

@ylacombe ylacombe commented Apr 9, 2024

Prepare release

Copy link
Collaborator

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done on getting the repo cleaned up, it's looking in good shape.

My opinion is that the main README should bring visibility to both inference and training, since we want to push this as a toolkit to train new TTS models. In that regard, it would be great to include the training steps in the main REAMDE, or at least a reference to the training README if too long.

For the training README, let's also try to keep it quite streamlined: one command to run per-section, with the args set to reproduce the v0.1 checkpoint. This way, it'll be easier for the community to build on your work.

Leaving this as another TODO here: revisit the generation code (forcing ids and ensuring we don't get early stopping)

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved

Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.

## Usage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could be good before the section on Usage is having a quick index (like with Bark https://github.com/suno-ai/bark#-quick-index)

README.md Show resolved Hide resolved
@@ -0,0 +1,1759 @@
#!/usr/bin/env python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good!

"head_mask": head_mask,
"cross_attn_head_mask": cross_attn_head_mask,
"past_key_values": past_key_values,
"use_cache": use_cache,
}

# Ignore copy
def build_delay_pattern_mask(self, input_ids: torch.LongTensor, pad_token_id: int, max_length: int = None):
def build_delay_pattern_mask(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we essentially make this a function instead of method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, I'm using it in a Dataset.map and this avoid hashing issues!

Copy link
Collaborator

@sanchit-gandhi sanchit-gandhi Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do something hacky in your training script where you do:

build_delay_pattern_mask = model.build_delay_pattern_mask

def preprocess(batch):
    # do some preprocessing with the pattern mask
    batch["labels"] = build_delay_pattern_mask(batch["labels"])
    return batch

dataset = dataset.map(preprocess)

But this is also a sufficient workaround as well

parler_tts/modeling_parler_tts.py Outdated Show resolved Hide resolved
parler_tts/modeling_parler_tts.py Outdated Show resolved Hide resolved
output_values.append(sample.transpose(0, 2))
else:
output_values.append(torch.zeros((1, 1, 1)).to(self.device))
# TODO: we should keep track of output length as well. Not really straightfoward tbh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return a tensor of shape (bsz,) that contains the output lengths as integers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was more about passing batch to DAC instead of sequential decoding!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - yeah passing a mask is quite involved, let's leave this as a TODO. Having the LM work with SDPA/FA2 is going to bring far greater inference speed gains than batched decoding with DAC

@ylacombe ylacombe closed this Apr 10, 2024
@ylacombe ylacombe mentioned this pull request Apr 10, 2024
ylacombe added a commit that referenced this pull request Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants