SLM Adversarial Training did not start when finetuning #227

godspirit00 · 2024-04-10T01:01:56Z

I tried to do finetuning on a small dataset with 2 speakers. I set epochs=25, diff_epoch=8, joint_epoch=15.
The Style Diffusion training started as expected, but SLM Adversarial Training never started throughout the entire finetuning process.

My config is

log_dir: "Models/xxx"
save_freq: 1
log_interval: 10
device: "cuda"
epochs: 25 # number of finetuning epoch (1 hour of data)
batch_size: 2
max_len: 400 # maximum number of frames
pretrained_model: "Models/LibriTTS/epochs_2nd_00020.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters

F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT/'

data_params:
  train_data: "Data/xxx__train.txt"
  val_data: "Data/xxx__val.txt"
  root_path: ""
  OOD_data: "Data/OOD_texts.txt"
  min_length: 50 # sample until texts with this size are obtained for OOD texts

preprocess_params:
  sr: 24000
  spect_params:
    n_fft: 2048
    win_length: 1200
    hop_length: 300

model_params:
  multispeaker: true

  dim_in: 64 
  hidden_dim: 512
  max_conv_dim: 512
  n_layer: 3
  n_mels: 80

  n_token: 178 # number of phoneme tokens
  max_dur: 50 # maximum duration of a single phoneme
  style_dim: 128 # style vector size
  
  dropout: 0.2

  # config for decoder
  decoder: 
      type: 'hifigan' # either hifigan or istftnet
      resblock_kernel_sizes: [3,7,11]
      upsample_rates :  [10,5,3,2]
      upsample_initial_channel: 512
      resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
      upsample_kernel_sizes: [20,10,6,4]
      
  # speech language model config
  slm:
      model: 'microsoft/wavlm-base-plus'
      sr: 16000 # sampling rate of SLM
      hidden: 768 # hidden size of SLM
      nlayers: 13 # number of layers of SLM
      initial_channel: 64 # initial channels of SLM discriminator head
  
  # style diffusion model config
  diffusion:
    embedding_mask_proba: 0.1
    # transformer config
    transformer:
      num_layers: 3
      num_heads: 8
      head_features: 64
      multiplier: 2

    # diffusion distribution config
    dist:
      sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
      estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
      mean: -3.0
      std: 1.0
  
loss_params:
    lambda_mel: 5. # mel reconstruction loss
    lambda_gen: 1. # generator loss
    lambda_slm: 1. # slm feature matching loss
    
    lambda_mono: 1. # monotonic alignment loss (TMA)
    lambda_s2s: 1. # sequence-to-sequence loss (TMA)

    lambda_F0: 1. # F0 reconstruction loss
    lambda_norm: 1. # norm reconstruction loss
    lambda_dur: 1. # duration loss
    lambda_ce: 20. # duration predictor probability output CE loss
    lambda_sty: 1. # style reconstruction loss
    lambda_diff: 1. # score matching loss
    
    diff_epoch: 8 # style diffusion starting epoch
    joint_epoch: 15 # joint training starting epoch

optimizer_params:
  lr: 0.0001 # general learning rate
  bert_lr: 0.00001 # learning rate for PLBERT
  ft_lr: 0.0001 # learning rate for acoustic modules
  
slmadv_params:
  min_len: 400 # minimum length of samples
  max_len: 500 # maximum length of samples
  batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
  iter: 10 # update the discriminator every this iterations of generator update
  thresh: 5 # gradient norm above which the gradient is scaled
  scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
  sig: 1.5 # sigma for differentiable duration modeling

What have I missed? Thanks!

The text was updated successfully, but these errors were encountered:

DogeLord081 · 2024-04-19T14:43:11Z

Same issue

meng2468 · 2024-04-22T15:07:01Z

You seem to be missing the configuration options for when the second training starts

See lines 6 and 7 in the LibriTTS Config File

epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)

Should be able to kick-off second stage training by loading your current model checkpoint and setting epochs_1st to 0

DogeLord081 · 2024-04-22T19:08:09Z

You seem to be missing the configuration options for when the second training starts

See lines 6 and 7 in the LibriTTS Config File
epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)
Should be able to kick-off second stage training by loading your current model checkpoint and setting epochs_1st to 0

Thanks, do I also need the first_stage_path:?

DogeLord081 · 2024-05-01T14:05:15Z

You seem to be missing the configuration options for when the second training starts

See lines 6 and 7 in the LibriTTS Config File
epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)
Should be able to kick-off second stage training by loading your current model checkpoint and setting epochs_1st to 0

This did not fix the issue unfortunately

78Alpha · 2024-05-04T05:21:42Z

I've managed to get to:

        for bib in range(len(output_lengths)):
            mel_length_pred = output_lengths[bib]
            mel_length_gt = int(mel_input_length[bib].item() / 2)
            if mel_length_gt <= mel_len or mel_length_pred <= mel_len:
                continue
            sp.append(s_preds[bib])

            random_start = np.random.randint(0, mel_length_pred - mel_len)
            en.append(asr_pred[bib, :, random_start:random_start+mel_len])
            p_en.append(p_pred[bib, :, random_start:random_start+mel_len])

            # get ground truth clips
            random_start = np.random.randint(0, mel_length_gt - mel_len)
            y = waves[bib][(random_start * 2) * 300:((random_start+mel_len) * 2) * 300]
            wav.append(torch.from_numpy(y).to(ref_text.device))
            
            if len(wav) >= self.batch_percentage * len(waves): # prevent OOM due to longer lengths
                break
        if len(sp) <= 1:
            return None

Aa where things go sour. If you have a batch size of 2, then it will always be 1, meaning SLMADV never starts. You need to change batch_percentage: 0.5 to at least batch_percentage: 1. However, if you were running at the edge of memory before (Example, 20 GB/24 GB) then you will not be able to comfortably use this unless you have shared memory (28.4 GB/24 GB). It will also take 8 times longer unless you really crank down the min/max slmadv length. So instead of a max of 500, it would be a max of 220 or so.

godspirit00 · 2024-05-04T14:50:14Z

@78Alpha
I tried with batch size 2, batch_percentage: 1, max_len: 600; batch size 6, batch_percentage: 0.8, max_len: 200; batch size 4, batch_percentage: 0.5, max_len: 400, but the results were the same, joint_epoch was reached, and occasionally slm_out at L502 was not None, but still SLMADV training didn't seem to start, as the related losses were still zero.

78Alpha · 2024-05-04T16:39:03Z

They're going to be zero for a while unless the conditions it's looking for are met. On about 1 epoch of training, my tensorboard only showed 60 steps worth of SLM training when i set batch percentage to 1. I don't know what exactly is looking for.

GUUser91 · 2024-05-13T00:36:31Z

@78Alpha
I'm not sure if I'm doing this correctly, but does this image from my tensorboard folder mean that I was able to start SLM Adversarial Training?

78Alpha · 2024-05-13T06:26:15Z

Yeah, that's what it should look like. All graphs filled is the sign of all parts working.

GUUser91 · 2024-05-16T06:44:23Z

I tinkered around with the config_ft.yml file. I set Max_Len to 120. I set batch_percentage to 1. I set slmadv_params min_len to 100 and slmadv_params max_len to 120. Batch size is set to 2. Now the DiscLM and GenLM Loss stats are no longer at 0. I'm using a rx 7900 xtx. Note I'm training a model with style diffusion in one fine-tuning session and adversarial training in another session.

Here a pic from my tensorboard folder.

Edit: I discovered I can do style diffusion and SLM adversarial training together in one session. I set max_len to 252, epoch set to 100, batch_size set to 2, batch_percentage set to 1, slmadv_params min_len set to 180, slmadv_params max_len set to 190, diff_epoch to 10, joint_epoch to 50, I'm using the vokan model as the base model.

I also rented out a h100 from runpod and slmadv_params and slmadv_params max_len are at default settings ( min_len: 400 and max_len: 500), batch size at 2 and batch_percentage to 1 and SLM adversarial training never started.

I tinkered around with the config_ft.yml file again. I set Max_Len to 252. I set batch_percentage to 1. I set slmadv_params min_len to 100 and slmadv_params max_len to 500. Batch size is set to 2. Now the DiscLM and GenLM Loss stats are only occasionally at 0. This is done in one session.

Second edit:
I rented out a h100 from runpod again. I edit the config_ft.yml file with micro. I set batch_size to 4 and max_len to 500. I set slmadv_params min_len to 100 and slmadv_params max_len to 500 and batch_percentage to 1. And now SLM adversarial training stats is no longer at zero.

Edit: Turns this is bad. I had set slmadv_params min_len even higher to get better quality.

Here's a screenshot of the vram usage.

This is what I did in runpod.
I update the repo.

apt update

I install these.

apt install aria2 p7zip-full curl jq micro

I use the pwd command to find directory / filepath infomation.

pwd

I put the training dataset in a zip file and then I upload it to either https://catbox.moe/ or https://litterbox.catbox.moe/ (Which lets you upload a 1GB file)

I download the vokan base model and zip file.

I download the Vokan model and then upload to gofile.io and then download it with gofile-downloader

aria2c -x 16 -s 16 -k 1M https://files.catbox.moe/XXXXXXXX.zip

I unzip the file

7z x XXXXXXXX.zip

I download the gofile upload script file.

aria2c https://raw.githubusercontent.com/Sushrut1101/GoFile-Upload/master/upload.sh

I give the script permissions

chmod +rx upload.sh

I upload the pth file to https://gofile.io/

./upload.sh model.pth

Third edit:
I tinkered around with config_ft.yml file and I trained with these settings on my 4090. I installed the latest version of bitsandbytes. Batch_size is set to 2. batch_percentage is set to 1. This was trained with 7 minutes of audio. The prompt is: Huh. Maybe she got off work later than I thought.

Max_Len is set to 252. slmadv_params min_len is set to 180. slmadv_params max_len is set to 190.
https://vocaroo.com/1naWqrrK3Via
https://vocaroo.com/1gJfNFoAkErL

Max_Len is set to 252. slmadv_params min_len is set to 252. slmadv_params max_len is set to 252.
https://vocaroo.com/1mc7AQst5t4u
https://vocaroo.com/1hdoRs1MaLQG

Max_Len is set to 260. slmadv_params min_len is set to 260. slmadv_params max_len is set to 260.
https://vocaroo.com/1dcr7vhTDXUP
https://vocaroo.com/18S9JyR905XH

Max_Len is set to 280. slmadv_params min_len is set to 280. slmadv_params max_len is set to 280.
https://vocaroo.com/1fQaJFghUi01
https://vocaroo.com/1fiq3ubdGEYv

GUUser91 · 2024-06-12T02:54:33Z

Another observation. I trained a model with a dataset of 6 minutes. My previous datasets contained a lot of 1-3 seconds audio files. This one contained a lot of files that are more than 3 seconds.

I decided to look at the tensorboard file since I noticed the DiscLM and GenLM Loss stats were constantly above a 0. The tensorboard file showed that the d_loss_slm and gen_loss_slm value were above 0.

So I decided to look at another tensorboard file that had a dataset containing audio files that were more than 3 seconds.

And another.

On my config_ft.yml file, Max_Len is set to 280. slmadv_params min_len is set to 280. slmadv_params max_len is set to 280. Batch_size is set to 2. batch_percentage is 1. diff_epoch and joint_epoch starting epoch are set at default. This was done on my 4090.

DevOps920719 · 2024-07-10T06:25:57Z

Which one is best source to train Stts2 model?

GUUser91 · 2024-07-10T06:38:18Z

@PriyamJha0124
You mean fine tuning? I used the vokan model for that.
https://huggingface.co/ShoukanLabs/Vokan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLM Adversarial Training did not start when finetuning #227

SLM Adversarial Training did not start when finetuning #227

godspirit00 commented Apr 10, 2024

DogeLord081 commented Apr 19, 2024

meng2468 commented Apr 22, 2024

DogeLord081 commented Apr 22, 2024

DogeLord081 commented May 1, 2024

78Alpha commented May 4, 2024

godspirit00 commented May 4, 2024

78Alpha commented May 4, 2024

GUUser91 commented May 13, 2024

78Alpha commented May 13, 2024

GUUser91 commented May 16, 2024 •

edited

Loading

GUUser91 commented Jun 12, 2024 •

edited

Loading

DevOps920719 commented Jul 10, 2024

GUUser91 commented Jul 10, 2024

SLM Adversarial Training did not start when finetuning #227

SLM Adversarial Training did not start when finetuning #227

Comments

godspirit00 commented Apr 10, 2024

DogeLord081 commented Apr 19, 2024

meng2468 commented Apr 22, 2024

DogeLord081 commented Apr 22, 2024

DogeLord081 commented May 1, 2024

78Alpha commented May 4, 2024

godspirit00 commented May 4, 2024

78Alpha commented May 4, 2024

GUUser91 commented May 13, 2024

78Alpha commented May 13, 2024

GUUser91 commented May 16, 2024 • edited Loading

GUUser91 commented Jun 12, 2024 • edited Loading

DevOps920719 commented Jul 10, 2024

GUUser91 commented Jul 10, 2024

GUUser91 commented May 16, 2024 •

edited

Loading

GUUser91 commented Jun 12, 2024 •

edited

Loading