Parler tts release #1

ylacombe · 2024-04-09T13:45:42Z

Prepare release

…h into parler-tts-release

sanchit-gandhi

Well done on getting the repo cleaned up, it's looking in good shape.

My opinion is that the main README should bring visibility to both inference and training, since we want to push this as a toolkit to train new TTS models. In that regard, it would be great to include the training steps in the main REAMDE, or at least a reference to the training README if too long.

For the training README, let's also try to keep it quite streamlined: one command to run per-section, with the args set to reproduce the v0.1 checkpoint. This way, it'll be easier for the community to build on your work.

Leaving this as another TODO here: revisit the generation code (forcing ids and ensuring we don't get early stopping)

README.md

sanchit-gandhi · 2024-04-09T14:06:26Z

README.md

+
+Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.
+
+## Usage


What could be good before the section on Usage is having a quick index (like with Bark https://github.com/suno-ai/bark#-quick-index)

README.md

sanchit-gandhi · 2024-04-09T14:26:27Z

training/run_parler_tts_training.py

@@ -0,0 +1,1759 @@
+#!/usr/bin/env python


This looks really good!

sanchit-gandhi · 2024-04-09T14:28:30Z

parler_tts/modeling_parler_tts.py

            "head_mask": head_mask,
            "cross_attn_head_mask": cross_attn_head_mask,
            "past_key_values": past_key_values,
            "use_cache": use_cache,
        }

    # Ignore copy
-    def build_delay_pattern_mask(self, input_ids: torch.LongTensor, pad_token_id: int, max_length: int = None):
+    def build_delay_pattern_mask(


Is there a reason we essentially make this a function instead of method?

Yes indeed, I'm using it in a Dataset.map and this avoid hashing issues!

I think you can do something hacky in your training script where you do:

build_delay_pattern_mask = model.build_delay_pattern_mask def preprocess(batch): # do some preprocessing with the pattern mask batch["labels"] = build_delay_pattern_mask(batch["labels"]) return batch dataset = dataset.map(preprocess)

But this is also a sufficient workaround as well

parler_tts/modeling_parler_tts.py

sanchit-gandhi · 2024-04-09T14:29:53Z

parler_tts/modeling_parler_tts.py

+                    output_values.append(sample.transpose(0, 2))
+                else:
+                    output_values.append(torch.zeros((1, 1, 1)).to(self.device))
+            # TODO: we should keep track of output length as well. Not really straightfoward tbh


Return a tensor of shape (bsz,) that contains the output lengths as integers?

This was more about passing batch to DAC instead of sequential decoding!

Ah I see - yeah passing a mask is quite involved, let's leave this as a TODO. Having the LM work with SDPA/FA2 is going to bring far greater inference speed gains than batched decoding with DAC

Co-authored-by: Sanchit Gandhi <[email protected]>

Cross attn

ylacombe and others added 30 commits February 21, 2024 14:31

update modeling code with prompt concat

813df4d

write first version of training script + example config

b634105

fix some bugs

fc66e60

update example config + add script to init dummy model

226fe07

fix vocab_size in dummy init

cb44e48

fix position embedding computation

e25b8ba

make audio encoding multi-gpus compatible

997bf5e

working training + generation

11fcc06

latest changes

0f6d59d

fix eval + correct json

9d25447

remove trainer code

7ae1e8e

enrich config and code

ee12a81

update code: fix accelerate, fix delay pattern mask, improve generation

d014074

improve mapping

dbb9513

fix bandwidth

eec6e3d

add latest changes

71d87fd

add DAC

9bde993

add dac config, init, and temporary datasets saving

e7cc576

fix ddp issue

98e1fe3

add warnings for broken resume from + fix eval

6f5a027

add group by length sampler

d112db9

optimize GPU memory usage

03611f9

fix sampling + free gpu memory after eval

faef1c7

fix eval batch being None when sample size too smal

b7f5feb

improve modeling code and logging

daca572

fix save/load accelerator state

1652c37

add temperature to args

3754ce0

Merge branch 'sanchit-gandhi:main' into add-training

c6b4674

fix fp16 training and attention mask in generation

e51113f

fix eval when fp16 + remove useless code

80da6b4

ylacombe and others added 14 commits April 8, 2024 23:06

move training script

6732a07

update README

c2f3296

add other .md

7ed694d

Update README.md

2518810

add push to hub scripts

6e6b299

Merge branch 'parler-tts-release' of github.com:ylacombe/stable-speec…

59f811d

…h into parler-tts-release

add contribution section

b30e519

add training.md skeleton

0968eb4

add TRAINING

0c07fe2

update repo id and gradio link

153694f

update license and copyright terms

fa0d258

update library

5a25b7c

remove example_configs folder

7243b0f

Update TRAINING.md

ac9c881

sanchit-gandhi approved these changes Apr 9, 2024

View reviewed changes

ylacombe and others added 12 commits April 9, 2024 16:51

Apply suggestions from code review

c40c6de

Co-authored-by: Sanchit Gandhi <[email protected]>

further improvements of README

b10e562

remove useless parameters from training config

0e5f273

Add quick inde

a87db05

rename TRAINING.md

59d717e

add TL;DR for training

92f82a3

update README

743212e

remove one line

8678ab4

Update setup.py

34d0013

Update README.md

5bb2a16

Update README.md

3c88229

Update README.md

86e4eb7

ylacombe closed this Apr 10, 2024

ylacombe mentioned this pull request Apr 10, 2024

Release #2

Merged

ylacombe added a commit that referenced this pull request Jul 8, 2024

Merge pull request #1 from sanchit-gandhi/cross-attn

934f08c

Cross attn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parler tts release #1

Parler tts release #1

ylacombe commented Apr 9, 2024

sanchit-gandhi left a comment •

edited

Loading

sanchit-gandhi Apr 9, 2024

sanchit-gandhi Apr 9, 2024

sanchit-gandhi Apr 9, 2024

ylacombe Apr 9, 2024

sanchit-gandhi Apr 9, 2024 •

edited

Loading

sanchit-gandhi Apr 9, 2024

ylacombe Apr 9, 2024

sanchit-gandhi Apr 9, 2024


		Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.

		## Usage

Parler tts release #1

Parler tts release #1

Conversation

ylacombe commented Apr 9, 2024

sanchit-gandhi left a comment • edited Loading

Choose a reason for hiding this comment

sanchit-gandhi Apr 9, 2024

Choose a reason for hiding this comment

sanchit-gandhi Apr 9, 2024

Choose a reason for hiding this comment

sanchit-gandhi Apr 9, 2024

Choose a reason for hiding this comment

ylacombe Apr 9, 2024

Choose a reason for hiding this comment

sanchit-gandhi Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

sanchit-gandhi Apr 9, 2024

Choose a reason for hiding this comment

ylacombe Apr 9, 2024

Choose a reason for hiding this comment

sanchit-gandhi Apr 9, 2024

Choose a reason for hiding this comment

sanchit-gandhi left a comment •

edited

Loading

sanchit-gandhi Apr 9, 2024 •

edited

Loading