update finetune, test=tts

PaddlePaddle · Sep 21, 2022 · 237991c · 237991c
1 parent 4ab07cd
commit 237991c
Show file tree

Hide file tree

Showing 11 changed files with 681 additions and 109 deletions.
diff --git a/examples/other/tts_finetune/tts3/README.md b/examples/other/tts_finetune/tts3/README.md
@@ -3,16 +3,22 @@ This example shows how to finetune your own AM based on FastSpeech2 with AISHELL
 
 We use AISHELL-3 to train a multi-speaker fastspeech2 model. You can refer [examples/aishell3/tts3](https://github.com/lym0302/PaddleSpeech/tree/develop/examples/aishell3/tts3) to train multi-speaker fastspeech2 from scratch.
 
+
 ## Prepare
-### Download Pretrained Fastspeech2 model
-Assume the path to the model is `./pretrained_models`. Download pretrained fastspeech2 model with aishell3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip). 
+### Download Pretrained model
+Assume the path to the model is `./pretrained_models`. Download pretrained fastspeech2 model with aishell3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip) for finetuning. Download pretrained HiFiGAN model with aishell3: [hifigan_aishell3_ckpt_0.2.0](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) for synthesis.
 
 ```bash
 mkdir -p pretrained_models && cd pretrained_models
+# pretrained fastspeech2 model
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip 
 unzip fastspeech2_aishell3_ckpt_1.1.0.zip
+# pretrained hifigan model
+wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip
+unzip hifigan_aishell3_ckpt_0.2.0.zip
 cd ../
 ```
+
 ### Download MFA tools and pretrained model
 Assume the path to the MFA tool is `./tools`. Download [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz) and pretrained MFA models with aishell3: [aishell3_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip).
 
@@ -76,15 +82,15 @@ When "Prepare" done. The structure of the current directory is listed below.
 ```
 
 ### Set finetune.yaml
-`finetune.yaml` contains some configurations for fine-tuning. You can try various options to fine better result.
+`conf/finetune.yaml` contains some configurations for fine-tuning. You can try various options to fine better result. The value of frozen_layers can be change according `conf/fastspeech2_layers.txt` which is the model layer of fastspeech2.
+
 Arguments:
   - `batch_size`: finetune batch size. Default: -1, means 64 which same to pretrained model
   - `learning_rate`: learning rate. Default: 0.0001
   - `num_snapshots`: number of save models. Default: -1, means 5 which same to pretrained model
   - `frozen_layers`: frozen layers. must be a list. If you don't want to frozen any layer, set []. 
 
 
-
 ## Get Started
 Run the command below to
 1. **source path**.
@@ -102,55 +108,41 @@ You can choose a range of stages you want to run, or set `stage` equal to `stop-
 Finetune a FastSpeech2 model. 
 
 ```bash
-./run.sh --stage 0 --stop-stage 0
+./run.sh --stage 0 --stop-stage 5
 ```
-`stage 0` of `run.sh` calls `finetune.py`, here's the complete help message.
+`stage 5` of `run.sh` calls `local/finetune.py`, here's the complete help message.
 
 ```text
-usage: finetune.py [-h] [--input_dir INPUT_DIR] [--pretrained_model_dir PRETRAINED_MODEL_DIR]
-                [--mfa_dir MFA_DIR] [--dump_dir DUMP_DIR]
-                [--output_dir OUTPUT_DIR] [--lang LANG]
-                [--ngpu NGPU]
+usage: finetune.py [-h] [--pretrained_model_dir PRETRAINED_MODEL_DIR]
+                [--dump_dir DUMP_DIR] [--output_dir OUTPUT_DIR] [--ngpu NGPU]
+                [--epoch EPOCH] [--finetune_config FINETUNE_CONFIG]
 
 optional arguments:
-  -h, --help            show this help message and exit
-  --input_dir INPUT_DIR       
-                        directory containing audio and label file
+  -h, --help           Show this help message and exit
   --pretrained_model_dir PRETRAINED_MODEL_DIR
                        Path to pretrained model
-  --mfa_dir MFA_DIR    directory to save aligned files
   --dump_dir DUMP_DIR
                        directory to save feature files and metadata
   --output_dir OUTPUT_DIR      
-                       directory to save finetune model 
-  --lang LANG          Choose input audio language, zh or en
-  --ngpu NGPU          if ngpu=0, use cpu
-  --epoch EPOCH        the epoch of finetune
-  --batch_size BATCH_SIZE        
-                       the batch size of finetune, default -1 means same as pretrained model
-
+                       Directory to save finetune model 
+  --ngpu NGPU          The number of gpu, if ngpu=0, use cpu
+  --epoch EPOCH        The epoch of finetune
+  --finetune_config FINETUNE_CONFIG        
+                       Path to finetune config file
 ```
-1. `--input_dir` is the directory containing audio and label file. 
-2. `--pretrained_model_dir` is the directory incluing pretrained fastspeech2_aishell3 model.
-3. `--mfa_dir` is the directory to save the results of aligning from pretrained MFA_aishell3 model.
-4. `--dump_dir` is the directory including audio feature and metadata.
-5. `--output_dir` is the directory to save finetune model.
-6. `--lang` is the language of input audio, zh or en.
-7. `--ngpu` is the number of gpu.
-8. `--epoch` is the epoch of finetune.
-9. `--batch_size` is the batch size of finetune.
+
+1. `--pretrained_model_dir` is the directory incluing pretrained fastspeech2_aishell3 model.
+2. `--dump_dir` is the directory including audio feature and metadata.
+3. `--output_dir` is the directory to save finetune model.
+4. `--ngpu` is the number of gpu, if ngpu=0, use cpu
+5. `--epoch` is the epoch of finetune.
+6. `--finetune_config` is the path to finetune config file
+
 
 ### Synthesizing
 We use [HiFiGAN](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5) as the neural vocoder.
 Assume the path to the hifigan model is `./pretrained_models`. Download the pretrained HiFiGAN model from [hifigan_aishell3_ckpt_0.2.0](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) and unzip it.
 
-```bash
-cd pretrained_models
-wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip
-unzip hifigan_aishell3_ckpt_0.2.0.zip
-cd ../
-```
-
 HiFiGAN checkpoint contains files listed below.
 ```text
 hifigan_aishell3_ckpt_0.2.0
@@ -160,7 +152,7 @@ hifigan_aishell3_ckpt_0.2.0
 ```
 Modify `ckpt` in `run.sh` to the final model in `exp/default/checkpoints`.
 ```bash
-./run.sh --stage 1 --stop-stage 1
+./run.sh --stage 6 --stop-stage 6
 ```
 `stage 1` of `run.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
 
@@ -210,6 +202,7 @@ optional arguments:
   --output_dir OUTPUT_DIR
                         output dir.
 ```
+
 1. `--am` is acoustic model type with the format {model_name}_{dataset}
 2. `--am_config`, `--am_ckpt`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
 3. `--voc` is vocoder type with the format {model_name}_{dataset}
@@ -219,6 +212,7 @@ optional arguments:
 7.  `--output_dir` is the directory to save synthesized audio files.
 8. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 
+
 ### Tips
 If you want to get better audio quality, you can use more audios to finetune.
 More finetune results can be found on [finetune-fastspeech2-for-csmsc](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html#finetune-fastspeech2-for-csmsc).

diff --git a/examples/other/tts_finetune/tts3/conf/fastspeech2_layers.txt b/examples/other/tts_finetune/tts3/conf/fastspeech2_layers.txt
@@ -0,0 +1,216 @@
+epoch
+iteration
+main_params
+main_optimizer
+spk_embedding_table.weight
+encoder.embed.0.weight
+encoder.embed.1.alpha
+encoder.encoders.0.self_attn.linear_q.weight
+encoder.encoders.0.self_attn.linear_q.bias
+encoder.encoders.0.self_attn.linear_k.weight
+encoder.encoders.0.self_attn.linear_k.bias
+encoder.encoders.0.self_attn.linear_v.weight
+encoder.encoders.0.self_attn.linear_v.bias
+encoder.encoders.0.self_attn.linear_out.weight
+encoder.encoders.0.self_attn.linear_out.bias
+encoder.encoders.0.feed_forward.w_1.weight
+encoder.encoders.0.feed_forward.w_1.bias
+encoder.encoders.0.feed_forward.w_2.weight
+encoder.encoders.0.feed_forward.w_2.bias
+encoder.encoders.0.norm1.weight
+encoder.encoders.0.norm1.bias
+encoder.encoders.0.norm2.weight
+encoder.encoders.0.norm2.bias
+encoder.encoders.1.self_attn.linear_q.weight
+encoder.encoders.1.self_attn.linear_q.bias
+encoder.encoders.1.self_attn.linear_k.weight
+encoder.encoders.1.self_attn.linear_k.bias
+encoder.encoders.1.self_attn.linear_v.weight
+encoder.encoders.1.self_attn.linear_v.bias
+encoder.encoders.1.self_attn.linear_out.weight
+encoder.encoders.1.self_attn.linear_out.bias
+encoder.encoders.1.feed_forward.w_1.weight
+encoder.encoders.1.feed_forward.w_1.bias
+encoder.encoders.1.feed_forward.w_2.weight
+encoder.encoders.1.feed_forward.w_2.bias
+encoder.encoders.1.norm1.weight
+encoder.encoders.1.norm1.bias
+encoder.encoders.1.norm2.weight
+encoder.encoders.1.norm2.bias
+encoder.encoders.2.self_attn.linear_q.weight
+encoder.encoders.2.self_attn.linear_q.bias
+encoder.encoders.2.self_attn.linear_k.weight
+encoder.encoders.2.self_attn.linear_k.bias
+encoder.encoders.2.self_attn.linear_v.weight
+encoder.encoders.2.self_attn.linear_v.bias
+encoder.encoders.2.self_attn.linear_out.weight
+encoder.encoders.2.self_attn.linear_out.bias
+encoder.encoders.2.feed_forward.w_1.weight
+encoder.encoders.2.feed_forward.w_1.bias
+encoder.encoders.2.feed_forward.w_2.weight
+encoder.encoders.2.feed_forward.w_2.bias
+encoder.encoders.2.norm1.weight
+encoder.encoders.2.norm1.bias
+encoder.encoders.2.norm2.weight
+encoder.encoders.2.norm2.bias
+encoder.encoders.3.self_attn.linear_q.weight
+encoder.encoders.3.self_attn.linear_q.bias
+encoder.encoders.3.self_attn.linear_k.weight
+encoder.encoders.3.self_attn.linear_k.bias
+encoder.encoders.3.self_attn.linear_v.weight
+encoder.encoders.3.self_attn.linear_v.bias
+encoder.encoders.3.self_attn.linear_out.weight
+encoder.encoders.3.self_attn.linear_out.bias
+encoder.encoders.3.feed_forward.w_1.weight
+encoder.encoders.3.feed_forward.w_1.bias
+encoder.encoders.3.feed_forward.w_2.weight
+encoder.encoders.3.feed_forward.w_2.bias
+encoder.encoders.3.norm1.weight
+encoder.encoders.3.norm1.bias
+encoder.encoders.3.norm2.weight
+encoder.encoders.3.norm2.bias
+encoder.after_norm.weight
+encoder.after_norm.bias
+spk_projection.weight
+spk_projection.bias
+duration_predictor.conv.0.0.weight
+duration_predictor.conv.0.0.bias
+duration_predictor.conv.0.2.weight
+duration_predictor.conv.0.2.bias
+duration_predictor.conv.1.0.weight
+duration_predictor.conv.1.0.bias
+duration_predictor.conv.1.2.weight
+duration_predictor.conv.1.2.bias
+duration_predictor.linear.weight
+duration_predictor.linear.bias
+pitch_predictor.conv.0.0.weight
+pitch_predictor.conv.0.0.bias
+pitch_predictor.conv.0.2.weight
+pitch_predictor.conv.0.2.bias
+pitch_predictor.conv.1.0.weight
+pitch_predictor.conv.1.0.bias
+pitch_predictor.conv.1.2.weight
+pitch_predictor.conv.1.2.bias
+pitch_predictor.conv.2.0.weight
+pitch_predictor.conv.2.0.bias
+pitch_predictor.conv.2.2.weight
+pitch_predictor.conv.2.2.bias
+pitch_predictor.conv.3.0.weight
+pitch_predictor.conv.3.0.bias
+pitch_predictor.conv.3.2.weight
+pitch_predictor.conv.3.2.bias
+pitch_predictor.conv.4.0.weight
+pitch_predictor.conv.4.0.bias
+pitch_predictor.conv.4.2.weight
+pitch_predictor.conv.4.2.bias
+pitch_predictor.linear.weight
+pitch_predictor.linear.bias
+pitch_embed.0.weight
+pitch_embed.0.bias
+energy_predictor.conv.0.0.weight
+energy_predictor.conv.0.0.bias
+energy_predictor.conv.0.2.weight
+energy_predictor.conv.0.2.bias
+energy_predictor.conv.1.0.weight
+energy_predictor.conv.1.0.bias
+energy_predictor.conv.1.2.weight
+energy_predictor.conv.1.2.bias
+energy_predictor.linear.weight
+energy_predictor.linear.bias
+energy_embed.0.weight
+energy_embed.0.bias
+decoder.embed.0.alpha
+decoder.encoders.0.self_attn.linear_q.weight
+decoder.encoders.0.self_attn.linear_q.bias
+decoder.encoders.0.self_attn.linear_k.weight
+decoder.encoders.0.self_attn.linear_k.bias
+decoder.encoders.0.self_attn.linear_v.weight
+decoder.encoders.0.self_attn.linear_v.bias
+decoder.encoders.0.self_attn.linear_out.weight
+decoder.encoders.0.self_attn.linear_out.bias
+decoder.encoders.0.feed_forward.w_1.weight
+decoder.encoders.0.feed_forward.w_1.bias
+decoder.encoders.0.feed_forward.w_2.weight
+decoder.encoders.0.feed_forward.w_2.bias
+decoder.encoders.0.norm1.weight
+decoder.encoders.0.norm1.bias
+decoder.encoders.0.norm2.weight
+decoder.encoders.0.norm2.bias
+decoder.encoders.1.self_attn.linear_q.weight
+decoder.encoders.1.self_attn.linear_q.bias
+decoder.encoders.1.self_attn.linear_k.weight
+decoder.encoders.1.self_attn.linear_k.bias
+decoder.encoders.1.self_attn.linear_v.weight
+decoder.encoders.1.self_attn.linear_v.bias
+decoder.encoders.1.self_attn.linear_out.weight
+decoder.encoders.1.self_attn.linear_out.bias
+decoder.encoders.1.feed_forward.w_1.weight
+decoder.encoders.1.feed_forward.w_1.bias
+decoder.encoders.1.feed_forward.w_2.weight
+decoder.encoders.1.feed_forward.w_2.bias
+decoder.encoders.1.norm1.weight
+decoder.encoders.1.norm1.bias
+decoder.encoders.1.norm2.weight
+decoder.encoders.1.norm2.bias
+decoder.encoders.2.self_attn.linear_q.weight
+decoder.encoders.2.self_attn.linear_q.bias
+decoder.encoders.2.self_attn.linear_k.weight
+decoder.encoders.2.self_attn.linear_k.bias
+decoder.encoders.2.self_attn.linear_v.weight
+decoder.encoders.2.self_attn.linear_v.bias
+decoder.encoders.2.self_attn.linear_out.weight
+decoder.encoders.2.self_attn.linear_out.bias
+decoder.encoders.2.feed_forward.w_1.weight
+decoder.encoders.2.feed_forward.w_1.bias
+decoder.encoders.2.feed_forward.w_2.weight
+decoder.encoders.2.feed_forward.w_2.bias
+decoder.encoders.2.norm1.weight
+decoder.encoders.2.norm1.bias
+decoder.encoders.2.norm2.weight
+decoder.encoders.2.norm2.bias
+decoder.encoders.3.self_attn.linear_q.weight
+decoder.encoders.3.self_attn.linear_q.bias
+decoder.encoders.3.self_attn.linear_k.weight
+decoder.encoders.3.self_attn.linear_k.bias
+decoder.encoders.3.self_attn.linear_v.weight
+decoder.encoders.3.self_attn.linear_v.bias
+decoder.encoders.3.self_attn.linear_out.weight
+decoder.encoders.3.self_attn.linear_out.bias
+decoder.encoders.3.feed_forward.w_1.weight
+decoder.encoders.3.feed_forward.w_1.bias
+decoder.encoders.3.feed_forward.w_2.weight
+decoder.encoders.3.feed_forward.w_2.bias
+decoder.encoders.3.norm1.weight
+decoder.encoders.3.norm1.bias
+decoder.encoders.3.norm2.weight
+decoder.encoders.3.norm2.bias
+decoder.after_norm.weight
+decoder.after_norm.bias
+feat_out.weight
+feat_out.bias
+postnet.postnet.0.0.weight
+postnet.postnet.0.1.weight
+postnet.postnet.0.1.bias
+postnet.postnet.0.1._mean
+postnet.postnet.0.1._variance
+postnet.postnet.1.0.weight
+postnet.postnet.1.1.weight
+postnet.postnet.1.1.bias
+postnet.postnet.1.1._mean
+postnet.postnet.1.1._variance
+postnet.postnet.2.0.weight
+postnet.postnet.2.1.weight
+postnet.postnet.2.1.bias
+postnet.postnet.2.1._mean
+postnet.postnet.2.1._variance
+postnet.postnet.3.0.weight
+postnet.postnet.3.1.weight
+postnet.postnet.3.1.bias
+postnet.postnet.3.1._mean
+postnet.postnet.3.1._variance
+postnet.postnet.4.0.weight
+postnet.postnet.4.1.weight
+postnet.postnet.4.1.bias
+postnet.postnet.4.1._mean
+postnet.postnet.4.1._variance
+
diff --git a/...les/other/tts_finetune/tts3/finetune.yaml → ...ther/tts_finetune/tts3/conf/finetune.yaml b/...les/other/tts_finetune/tts3/finetune.yaml → ...ther/tts_finetune/tts3/conf/finetune.yaml
@@ -9,4 +9,6 @@ num_snapshots: -1
 
 # frozen_layers should be a list
 # if you don't need to freeze, set frozen_layers to []
+# fastspeech2 layers can be found on conf/fastspeech2_layers.txt
+# example: frozen_layers: ["encoder", "duration_predictor"]
 frozen_layers: ["encoder"]