Skip to content

Commit

Permalink
update finetune, test=tts
Browse files Browse the repository at this point in the history
  • Loading branch information
liangym committed Sep 21, 2022
1 parent 4ab07cd commit 237991c
Show file tree
Hide file tree
Showing 11 changed files with 681 additions and 109 deletions.
70 changes: 32 additions & 38 deletions examples/other/tts_finetune/tts3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,22 @@ This example shows how to finetune your own AM based on FastSpeech2 with AISHELL

We use AISHELL-3 to train a multi-speaker fastspeech2 model. You can refer [examples/aishell3/tts3](https://github.com/lym0302/PaddleSpeech/tree/develop/examples/aishell3/tts3) to train multi-speaker fastspeech2 from scratch.


## Prepare
### Download Pretrained Fastspeech2 model
Assume the path to the model is `./pretrained_models`. Download pretrained fastspeech2 model with aishell3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip).
### Download Pretrained model
Assume the path to the model is `./pretrained_models`. Download pretrained fastspeech2 model with aishell3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip) for finetuning. Download pretrained HiFiGAN model with aishell3: [hifigan_aishell3_ckpt_0.2.0](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) for synthesis.

```bash
mkdir -p pretrained_models && cd pretrained_models
# pretrained fastspeech2 model
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip
unzip fastspeech2_aishell3_ckpt_1.1.0.zip
# pretrained hifigan model
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip
unzip hifigan_aishell3_ckpt_0.2.0.zip
cd ../
```

### Download MFA tools and pretrained model
Assume the path to the MFA tool is `./tools`. Download [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz) and pretrained MFA models with aishell3: [aishell3_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip).

Expand Down Expand Up @@ -76,15 +82,15 @@ When "Prepare" done. The structure of the current directory is listed below.
```

### Set finetune.yaml
`finetune.yaml` contains some configurations for fine-tuning. You can try various options to fine better result.
`conf/finetune.yaml` contains some configurations for fine-tuning. You can try various options to fine better result. The value of frozen_layers can be change according `conf/fastspeech2_layers.txt` which is the model layer of fastspeech2.

Arguments:
- `batch_size`: finetune batch size. Default: -1, means 64 which same to pretrained model
- `learning_rate`: learning rate. Default: 0.0001
- `num_snapshots`: number of save models. Default: -1, means 5 which same to pretrained model
- `frozen_layers`: frozen layers. must be a list. If you don't want to frozen any layer, set [].



## Get Started
Run the command below to
1. **source path**.
Expand All @@ -102,55 +108,41 @@ You can choose a range of stages you want to run, or set `stage` equal to `stop-
Finetune a FastSpeech2 model.

```bash
./run.sh --stage 0 --stop-stage 0
./run.sh --stage 0 --stop-stage 5
```
`stage 0` of `run.sh` calls `finetune.py`, here's the complete help message.
`stage 5` of `run.sh` calls `local/finetune.py`, here's the complete help message.

```text
usage: finetune.py [-h] [--input_dir INPUT_DIR] [--pretrained_model_dir PRETRAINED_MODEL_DIR]
[--mfa_dir MFA_DIR] [--dump_dir DUMP_DIR]
[--output_dir OUTPUT_DIR] [--lang LANG]
[--ngpu NGPU]
usage: finetune.py [-h] [--pretrained_model_dir PRETRAINED_MODEL_DIR]
[--dump_dir DUMP_DIR] [--output_dir OUTPUT_DIR] [--ngpu NGPU]
[--epoch EPOCH] [--finetune_config FINETUNE_CONFIG]
optional arguments:
-h, --help show this help message and exit
--input_dir INPUT_DIR
directory containing audio and label file
-h, --help Show this help message and exit
--pretrained_model_dir PRETRAINED_MODEL_DIR
Path to pretrained model
--mfa_dir MFA_DIR directory to save aligned files
--dump_dir DUMP_DIR
directory to save feature files and metadata
--output_dir OUTPUT_DIR
directory to save finetune model
--lang LANG Choose input audio language, zh or en
--ngpu NGPU if ngpu=0, use cpu
--epoch EPOCH the epoch of finetune
--batch_size BATCH_SIZE
the batch size of finetune, default -1 means same as pretrained model
Directory to save finetune model
--ngpu NGPU The number of gpu, if ngpu=0, use cpu
--epoch EPOCH The epoch of finetune
--finetune_config FINETUNE_CONFIG
Path to finetune config file
```
1. `--input_dir` is the directory containing audio and label file.
2. `--pretrained_model_dir` is the directory incluing pretrained fastspeech2_aishell3 model.
3. `--mfa_dir` is the directory to save the results of aligning from pretrained MFA_aishell3 model.
4. `--dump_dir` is the directory including audio feature and metadata.
5. `--output_dir` is the directory to save finetune model.
6. `--lang` is the language of input audio, zh or en.
7. `--ngpu` is the number of gpu.
8. `--epoch` is the epoch of finetune.
9. `--batch_size` is the batch size of finetune.

1. `--pretrained_model_dir` is the directory incluing pretrained fastspeech2_aishell3 model.
2. `--dump_dir` is the directory including audio feature and metadata.
3. `--output_dir` is the directory to save finetune model.
4. `--ngpu` is the number of gpu, if ngpu=0, use cpu
5. `--epoch` is the epoch of finetune.
6. `--finetune_config` is the path to finetune config file


### Synthesizing
We use [HiFiGAN](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5) as the neural vocoder.
Assume the path to the hifigan model is `./pretrained_models`. Download the pretrained HiFiGAN model from [hifigan_aishell3_ckpt_0.2.0](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) and unzip it.

```bash
cd pretrained_models
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip
unzip hifigan_aishell3_ckpt_0.2.0.zip
cd ../
```

HiFiGAN checkpoint contains files listed below.
```text
hifigan_aishell3_ckpt_0.2.0
Expand All @@ -160,7 +152,7 @@ hifigan_aishell3_ckpt_0.2.0
```
Modify `ckpt` in `run.sh` to the final model in `exp/default/checkpoints`.
```bash
./run.sh --stage 1 --stop-stage 1
./run.sh --stage 6 --stop-stage 6
```
`stage 1` of `run.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.

Expand Down Expand Up @@ -210,6 +202,7 @@ optional arguments:
--output_dir OUTPUT_DIR
output dir.
```

1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_ckpt`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset}
Expand All @@ -219,6 +212,7 @@ optional arguments:
7. `--output_dir` is the directory to save synthesized audio files.
8. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.


### Tips
If you want to get better audio quality, you can use more audios to finetune.
More finetune results can be found on [finetune-fastspeech2-for-csmsc](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html#finetune-fastspeech2-for-csmsc).
Expand Down
216 changes: 216 additions & 0 deletions examples/other/tts_finetune/tts3/conf/fastspeech2_layers.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
epoch
iteration
main_params
main_optimizer
spk_embedding_table.weight
encoder.embed.0.weight
encoder.embed.1.alpha
encoder.encoders.0.self_attn.linear_q.weight
encoder.encoders.0.self_attn.linear_q.bias
encoder.encoders.0.self_attn.linear_k.weight
encoder.encoders.0.self_attn.linear_k.bias
encoder.encoders.0.self_attn.linear_v.weight
encoder.encoders.0.self_attn.linear_v.bias
encoder.encoders.0.self_attn.linear_out.weight
encoder.encoders.0.self_attn.linear_out.bias
encoder.encoders.0.feed_forward.w_1.weight
encoder.encoders.0.feed_forward.w_1.bias
encoder.encoders.0.feed_forward.w_2.weight
encoder.encoders.0.feed_forward.w_2.bias
encoder.encoders.0.norm1.weight
encoder.encoders.0.norm1.bias
encoder.encoders.0.norm2.weight
encoder.encoders.0.norm2.bias
encoder.encoders.1.self_attn.linear_q.weight
encoder.encoders.1.self_attn.linear_q.bias
encoder.encoders.1.self_attn.linear_k.weight
encoder.encoders.1.self_attn.linear_k.bias
encoder.encoders.1.self_attn.linear_v.weight
encoder.encoders.1.self_attn.linear_v.bias
encoder.encoders.1.self_attn.linear_out.weight
encoder.encoders.1.self_attn.linear_out.bias
encoder.encoders.1.feed_forward.w_1.weight
encoder.encoders.1.feed_forward.w_1.bias
encoder.encoders.1.feed_forward.w_2.weight
encoder.encoders.1.feed_forward.w_2.bias
encoder.encoders.1.norm1.weight
encoder.encoders.1.norm1.bias
encoder.encoders.1.norm2.weight
encoder.encoders.1.norm2.bias
encoder.encoders.2.self_attn.linear_q.weight
encoder.encoders.2.self_attn.linear_q.bias
encoder.encoders.2.self_attn.linear_k.weight
encoder.encoders.2.self_attn.linear_k.bias
encoder.encoders.2.self_attn.linear_v.weight
encoder.encoders.2.self_attn.linear_v.bias
encoder.encoders.2.self_attn.linear_out.weight
encoder.encoders.2.self_attn.linear_out.bias
encoder.encoders.2.feed_forward.w_1.weight
encoder.encoders.2.feed_forward.w_1.bias
encoder.encoders.2.feed_forward.w_2.weight
encoder.encoders.2.feed_forward.w_2.bias
encoder.encoders.2.norm1.weight
encoder.encoders.2.norm1.bias
encoder.encoders.2.norm2.weight
encoder.encoders.2.norm2.bias
encoder.encoders.3.self_attn.linear_q.weight
encoder.encoders.3.self_attn.linear_q.bias
encoder.encoders.3.self_attn.linear_k.weight
encoder.encoders.3.self_attn.linear_k.bias
encoder.encoders.3.self_attn.linear_v.weight
encoder.encoders.3.self_attn.linear_v.bias
encoder.encoders.3.self_attn.linear_out.weight
encoder.encoders.3.self_attn.linear_out.bias
encoder.encoders.3.feed_forward.w_1.weight
encoder.encoders.3.feed_forward.w_1.bias
encoder.encoders.3.feed_forward.w_2.weight
encoder.encoders.3.feed_forward.w_2.bias
encoder.encoders.3.norm1.weight
encoder.encoders.3.norm1.bias
encoder.encoders.3.norm2.weight
encoder.encoders.3.norm2.bias
encoder.after_norm.weight
encoder.after_norm.bias
spk_projection.weight
spk_projection.bias
duration_predictor.conv.0.0.weight
duration_predictor.conv.0.0.bias
duration_predictor.conv.0.2.weight
duration_predictor.conv.0.2.bias
duration_predictor.conv.1.0.weight
duration_predictor.conv.1.0.bias
duration_predictor.conv.1.2.weight
duration_predictor.conv.1.2.bias
duration_predictor.linear.weight
duration_predictor.linear.bias
pitch_predictor.conv.0.0.weight
pitch_predictor.conv.0.0.bias
pitch_predictor.conv.0.2.weight
pitch_predictor.conv.0.2.bias
pitch_predictor.conv.1.0.weight
pitch_predictor.conv.1.0.bias
pitch_predictor.conv.1.2.weight
pitch_predictor.conv.1.2.bias
pitch_predictor.conv.2.0.weight
pitch_predictor.conv.2.0.bias
pitch_predictor.conv.2.2.weight
pitch_predictor.conv.2.2.bias
pitch_predictor.conv.3.0.weight
pitch_predictor.conv.3.0.bias
pitch_predictor.conv.3.2.weight
pitch_predictor.conv.3.2.bias
pitch_predictor.conv.4.0.weight
pitch_predictor.conv.4.0.bias
pitch_predictor.conv.4.2.weight
pitch_predictor.conv.4.2.bias
pitch_predictor.linear.weight
pitch_predictor.linear.bias
pitch_embed.0.weight
pitch_embed.0.bias
energy_predictor.conv.0.0.weight
energy_predictor.conv.0.0.bias
energy_predictor.conv.0.2.weight
energy_predictor.conv.0.2.bias
energy_predictor.conv.1.0.weight
energy_predictor.conv.1.0.bias
energy_predictor.conv.1.2.weight
energy_predictor.conv.1.2.bias
energy_predictor.linear.weight
energy_predictor.linear.bias
energy_embed.0.weight
energy_embed.0.bias
decoder.embed.0.alpha
decoder.encoders.0.self_attn.linear_q.weight
decoder.encoders.0.self_attn.linear_q.bias
decoder.encoders.0.self_attn.linear_k.weight
decoder.encoders.0.self_attn.linear_k.bias
decoder.encoders.0.self_attn.linear_v.weight
decoder.encoders.0.self_attn.linear_v.bias
decoder.encoders.0.self_attn.linear_out.weight
decoder.encoders.0.self_attn.linear_out.bias
decoder.encoders.0.feed_forward.w_1.weight
decoder.encoders.0.feed_forward.w_1.bias
decoder.encoders.0.feed_forward.w_2.weight
decoder.encoders.0.feed_forward.w_2.bias
decoder.encoders.0.norm1.weight
decoder.encoders.0.norm1.bias
decoder.encoders.0.norm2.weight
decoder.encoders.0.norm2.bias
decoder.encoders.1.self_attn.linear_q.weight
decoder.encoders.1.self_attn.linear_q.bias
decoder.encoders.1.self_attn.linear_k.weight
decoder.encoders.1.self_attn.linear_k.bias
decoder.encoders.1.self_attn.linear_v.weight
decoder.encoders.1.self_attn.linear_v.bias
decoder.encoders.1.self_attn.linear_out.weight
decoder.encoders.1.self_attn.linear_out.bias
decoder.encoders.1.feed_forward.w_1.weight
decoder.encoders.1.feed_forward.w_1.bias
decoder.encoders.1.feed_forward.w_2.weight
decoder.encoders.1.feed_forward.w_2.bias
decoder.encoders.1.norm1.weight
decoder.encoders.1.norm1.bias
decoder.encoders.1.norm2.weight
decoder.encoders.1.norm2.bias
decoder.encoders.2.self_attn.linear_q.weight
decoder.encoders.2.self_attn.linear_q.bias
decoder.encoders.2.self_attn.linear_k.weight
decoder.encoders.2.self_attn.linear_k.bias
decoder.encoders.2.self_attn.linear_v.weight
decoder.encoders.2.self_attn.linear_v.bias
decoder.encoders.2.self_attn.linear_out.weight
decoder.encoders.2.self_attn.linear_out.bias
decoder.encoders.2.feed_forward.w_1.weight
decoder.encoders.2.feed_forward.w_1.bias
decoder.encoders.2.feed_forward.w_2.weight
decoder.encoders.2.feed_forward.w_2.bias
decoder.encoders.2.norm1.weight
decoder.encoders.2.norm1.bias
decoder.encoders.2.norm2.weight
decoder.encoders.2.norm2.bias
decoder.encoders.3.self_attn.linear_q.weight
decoder.encoders.3.self_attn.linear_q.bias
decoder.encoders.3.self_attn.linear_k.weight
decoder.encoders.3.self_attn.linear_k.bias
decoder.encoders.3.self_attn.linear_v.weight
decoder.encoders.3.self_attn.linear_v.bias
decoder.encoders.3.self_attn.linear_out.weight
decoder.encoders.3.self_attn.linear_out.bias
decoder.encoders.3.feed_forward.w_1.weight
decoder.encoders.3.feed_forward.w_1.bias
decoder.encoders.3.feed_forward.w_2.weight
decoder.encoders.3.feed_forward.w_2.bias
decoder.encoders.3.norm1.weight
decoder.encoders.3.norm1.bias
decoder.encoders.3.norm2.weight
decoder.encoders.3.norm2.bias
decoder.after_norm.weight
decoder.after_norm.bias
feat_out.weight
feat_out.bias
postnet.postnet.0.0.weight
postnet.postnet.0.1.weight
postnet.postnet.0.1.bias
postnet.postnet.0.1._mean
postnet.postnet.0.1._variance
postnet.postnet.1.0.weight
postnet.postnet.1.1.weight
postnet.postnet.1.1.bias
postnet.postnet.1.1._mean
postnet.postnet.1.1._variance
postnet.postnet.2.0.weight
postnet.postnet.2.1.weight
postnet.postnet.2.1.bias
postnet.postnet.2.1._mean
postnet.postnet.2.1._variance
postnet.postnet.3.0.weight
postnet.postnet.3.1.weight
postnet.postnet.3.1.bias
postnet.postnet.3.1._mean
postnet.postnet.3.1._variance
postnet.postnet.4.0.weight
postnet.postnet.4.1.weight
postnet.postnet.4.1.bias
postnet.postnet.4.1._mean
postnet.postnet.4.1._variance

Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,6 @@ num_snapshots: -1

# frozen_layers should be a list
# if you don't need to freeze, set frozen_layers to []
# fastspeech2 layers can be found on conf/fastspeech2_layers.txt
# example: frozen_layers: ["encoder", "duration_predictor"]
frozen_layers: ["encoder"]
Loading

0 comments on commit 237991c

Please sign in to comment.