VoiceCraft Fine-tune dataset preparation #138

rikabi89 · 2024-06-11T12:50:38Z

Hello

I've been going through the the instructions on gitpage however I was not able to figure out how one prepares their own dataset for finetuning.

Could anyone share this if they have been successful.

zmy1116 · 2024-08-09T14:56:09Z

the instruction is pretty straightforward, which part you have question?

rikabi89 · 2024-08-10T13:47:06Z

the instruction is pretty straightforward, which part you have question?

I kinda gave up on this a long time ago. But I got to a point where I was able to figure out most of it in terms of generating the transcripts,phoeme seq and metadata. But I still don't feel I don't understand the whole dataset structure. Like where are the .wav supposed to go and which files correspond to their DIRs.

Also "make sure you modify the weights loading part" - I have no idea how this is done. I tried ChatGPT and other AIs but I was going around in circles. So in the end I was not able to start training, because either I didn't do the dataset right or something to do with weight loading.

zmy1116 · 2024-08-11T11:39:48Z

here are the .wav supposed to go and which files correspond to their DIRs

I would suggest you go through data/gigaspeech.py, although a lot of other parts through the repository look a bit intimidating. But based on how the dataset is defined you realize at least for dataset preparation it's not difficult. ALL YOU NEED is per each training example:

the phonemes of the transcription text
the encodec of the audio

since phonemes are treated as text tokens, you need to load a map that map phoneme character to index for the model to generate phoneme embedding, an example phoneme map is stored at any of the pretrained weight: model.phn2num

You don't need to follow the exact file structure the dataset required, as long as your __get_item__ of your dataset return

      return {
          "x": torch.LongTensor(x),
          "x_len": x_len,
          "y": torch.LongTensor(y),
          "y_len": y_len
      }

where x is the phoneme tokens indices, y is the encodec of audio:

suppose you have text_phonemes and the map phn2num, x is simply [phn2num[p] for p in text_phonemes]

Once you defined your dataset that output the above, you can pretty much run the training as it is by changing the dataset from gigaspeech dataset to yours.

The phonemes and encodec generation are both in data/phonemize_encodec_encode_hf.py , you run one module on text transcript, another on audio file.

make sure you modify the weights loading part

So if you look at any of the pretrained model model you want to start doing finetuning. model['phn2num'] records the set of text phonemes the model handles. For instance, giga830m.pt handles 80 phonemes. Once you finish generate text phonemes of your own dataset. You can generate the phonemes set. example:
phonemes_set = set(sum([x['text_phonemes'] for x in data], []))
You then need to compare your phonemes set and the model's phonemes set, If there are new phonemes introduced in your dataset, you need to:

expand the phoneme token to token index map phn2num so that the model can recognize your new phonemes
the model generate text phonemes embedding using an embedding layer, the weight is stored at model['text_embedding.word_embeddings.weight'] . For giga830m.pt , this weight has a shape of 101x2048, Although the phn2num shows there are actually 80 phonemes tokens, if you look at the norm of each embedding weight, you see that only the first 80 out of 101 are trained, the rest 21 are random and the one at index 100 represent the pad token.

So what you need to do is:

once you expand the phn2num map with your new phonomes tokens, create an embedding matrix that is at least as large as len(phn2num) + 1, copy the first 80 from the model weight model.text_embedding.word_embeddings.weight:
new_weight[:80] = old_weight[:80], and update the weight model['text_embedding.word_embeddings.weight']=new_weight, save the weight.

At last, at your training script, indicate:

text_vocab_size : new_weight.shape[0] - 1 (because the last token embedding is the meaningless padding token)
text_pad_token : new_weight.shape[0] (say you have 110 true phonemes tokens, the weight matrix you have have shape 111x2048, and weigth[100] is the embedding for the meaningless padding token )

hope this helps

rikabi89 · 2024-08-12T15:48:04Z

here are the .wav supposed to go and which files correspond to their DIRs

I would suggest you go through data/gigaspeech.py, although a lot of other parts through the repository look a bit intimidating. But based on how the dataset is defined you realize at least for dataset preparation it's not difficult. ALL YOU NEED is per each training example:

the phonemes of the transcription text

the encodec of the audio

since phonemes are treated as text tokens, you need to load a map that map phoneme character to index for the model to generate phoneme embedding, an example phoneme map is stored at any of the pretrained weight: model.phn2num

You don't need to follow the exact file structure the dataset required, as long as your __get_item__ of your dataset return
      return {
          "x": torch.LongTensor(x),
          "x_len": x_len,
          "y": torch.LongTensor(y),
          "y_len": y_len
      }
where x is the phoneme tokens indices, y is the encodec of audio:

suppose you have text_phonemes and the map phn2num, x is simply [phn2num[p] for p in text_phonemes]

Once you defined your dataset that output the above, you can pretty much run the training as it is by changing the dataset from gigaspeech dataset to yours.

The phonemes and encodec generation are both in data/phonemize_encodec_encode_hf.py , you run one module on text transcript, another on audio file.

make sure you modify the weights loading part

So if you look at any of the pretrained model model you want to start doing finetuning. model['phn2num'] records the set of text phonemes the model handles. For instance, giga830m.pt handles 80 phonemes. Once you finish generate text phonemes of your own dataset. You can generate the phonemes set. example: phonemes_set = set(sum([x['text_phonemes'] for x in data], [])) You then need to compare your phonemes set and the model's phonemes set, If there are new phonemes introduced in your dataset, you need to:

expand the phoneme token to token index map phn2num so that the model can recognize your new phonemes

the model generate text phonemes embedding using an embedding layer, the weight is stored at model['text_embedding.word_embeddings.weight'] . For giga830m.pt , this weight has a shape of 101x2048, Although the phn2num shows there are actually 80 phonemes tokens, if you look at the norm of each embedding weight, you see that only the first 80 out of 101 are trained, the rest 21 are random and the one at index 100 represent the pad token.

So what you need to do is:

once you expand the phn2num map with your new phonomes tokens, create an embedding matrix that is at least as large as len(phn2num) + 1, copy the first 80 from the model weight model.text_embedding.word_embeddings.weight:
new_weight[:80] = old_weight[:80], and update the weight model['text_embedding.word_embeddings.weight']=new_weight, save the weight.

At last, at your training script, indicate:

text_vocab_size : new_weight.shape[0] - 1 (because the last token embedding is the meaningless padding token)

text_pad_token : new_weight.shape[0] (say you have 110 true phonemes tokens, the weight matrix you have have shape 111x2048, and weigth[100] is the embedding for the meaningless padding token )

hope this helps

Well thanks for taking your time to explain this. I will try again sometime in the future and see how it goes!

zmy1116 mentioned this issue Aug 15, 2024

I finetuned voicecraft on commonvoice-french, here are some of my findings/thoughts #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VoiceCraft Fine-tune dataset preparation #138

VoiceCraft Fine-tune dataset preparation #138

rikabi89 commented Jun 11, 2024

zmy1116 commented Aug 9, 2024

rikabi89 commented Aug 10, 2024

zmy1116 commented Aug 11, 2024

rikabi89 commented Aug 12, 2024

VoiceCraft Fine-tune dataset preparation #138

VoiceCraft Fine-tune dataset preparation #138

Comments

rikabi89 commented Jun 11, 2024

zmy1116 commented Aug 9, 2024

rikabi89 commented Aug 10, 2024

zmy1116 commented Aug 11, 2024

rikabi89 commented Aug 12, 2024