Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VoiceCraft Fine-tune dataset preparation #138

Open
rikabi89 opened this issue Jun 11, 2024 · 4 comments
Open

VoiceCraft Fine-tune dataset preparation #138

rikabi89 opened this issue Jun 11, 2024 · 4 comments

Comments

@rikabi89
Copy link

Hello

I've been going through the the instructions on gitpage however I was not able to figure out how one prepares their own dataset for finetuning.

Could anyone share this if they have been successful.

@zmy1116
Copy link

zmy1116 commented Aug 9, 2024

the instruction is pretty straightforward, which part you have question?

image

image

@rikabi89
Copy link
Author

the instruction is pretty straightforward, which part you have question?

image

image

I kinda gave up on this a long time ago. But I got to a point where I was able to figure out most of it in terms of generating the transcripts,phoeme seq and metadata. But I still don't feel I don't understand the whole dataset structure. Like where are the .wav supposed to go and which files correspond to their DIRs.

Also "make sure you modify the weights loading part" - I have no idea how this is done. I tried ChatGPT and other AIs but I was going around in circles. So in the end I was not able to start training, because either I didn't do the dataset right or something to do with weight loading.

@zmy1116
Copy link

zmy1116 commented Aug 11, 2024

here are the .wav supposed to go and which files correspond to their DIRs

I would suggest you go through data/gigaspeech.py, although a lot of other parts through the repository look a bit intimidating. But based on how the dataset is defined you realize at least for dataset preparation it's not difficult. ALL YOU NEED is per each training example:

  • the phonemes of the transcription text
  • the encodec of the audio

since phonemes are treated as text tokens, you need to load a map that map phoneme character to index for the model to generate phoneme embedding, an example phoneme map is stored at any of the pretrained weight: model.phn2num

You don't need to follow the exact file structure the dataset required, as long as your __get_item__ of your dataset return

      return {
          "x": torch.LongTensor(x),
          "x_len": x_len,
          "y": torch.LongTensor(y),
          "y_len": y_len
      }

where x is the phoneme tokens indices, y is the encodec of audio:

  • suppose you have text_phonemes and the map phn2num, x is simply [phn2num[p] for p in text_phonemes]

Once you defined your dataset that output the above, you can pretty much run the training as it is by changing the dataset from gigaspeech dataset to yours.

The phonemes and encodec generation are both in data/phonemize_encodec_encode_hf.py , you run one module on text transcript, another on audio file.

make sure you modify the weights loading part

So if you look at any of the pretrained model model you want to start doing finetuning. model['phn2num'] records the set of text phonemes the model handles. For instance, giga830m.pt handles 80 phonemes. Once you finish generate text phonemes of your own dataset. You can generate the phonemes set. example:
phonemes_set = set(sum([x['text_phonemes'] for x in data], []))
You then need to compare your phonemes set and the model's phonemes set, If there are new phonemes introduced in your dataset, you need to:

  • expand the phoneme token to token index map phn2num so that the model can recognize your new phonemes
  • the model generate text phonemes embedding using an embedding layer, the weight is stored at model['text_embedding.word_embeddings.weight'] . For giga830m.pt , this weight has a shape of 101x2048, Although the phn2num shows there are actually 80 phonemes tokens, if you look at the norm of each embedding weight, you see that only the first 80 out of 101 are trained, the rest 21 are random and the one at index 100 represent the pad token.

So what you need to do is:

  • once you expand the phn2num map with your new phonomes tokens, create an embedding matrix that is at least as large as len(phn2num) + 1, copy the first 80 from the model weight model.text_embedding.word_embeddings.weight:
    new_weight[:80] = old_weight[:80], and update the weight model['text_embedding.word_embeddings.weight']=new_weight, save the weight.

At last, at your training script, indicate:

  • text_vocab_size : new_weight.shape[0] - 1 (because the last token embedding is the meaningless padding token)
  • text_pad_token : new_weight.shape[0] (say you have 110 true phonemes tokens, the weight matrix you have have shape 111x2048, and weigth[100] is the embedding for the meaningless padding token )

hope this helps

@rikabi89
Copy link
Author

here are the .wav supposed to go and which files correspond to their DIRs

I would suggest you go through data/gigaspeech.py, although a lot of other parts through the repository look a bit intimidating. But based on how the dataset is defined you realize at least for dataset preparation it's not difficult. ALL YOU NEED is per each training example:

  • the phonemes of the transcription text
  • the encodec of the audio

since phonemes are treated as text tokens, you need to load a map that map phoneme character to index for the model to generate phoneme embedding, an example phoneme map is stored at any of the pretrained weight: model.phn2num

You don't need to follow the exact file structure the dataset required, as long as your __get_item__ of your dataset return

      return {
          "x": torch.LongTensor(x),
          "x_len": x_len,
          "y": torch.LongTensor(y),
          "y_len": y_len
      }

where x is the phoneme tokens indices, y is the encodec of audio:

  • suppose you have text_phonemes and the map phn2num, x is simply [phn2num[p] for p in text_phonemes]

Once you defined your dataset that output the above, you can pretty much run the training as it is by changing the dataset from gigaspeech dataset to yours.

The phonemes and encodec generation are both in data/phonemize_encodec_encode_hf.py , you run one module on text transcript, another on audio file.

make sure you modify the weights loading part

So if you look at any of the pretrained model model you want to start doing finetuning. model['phn2num'] records the set of text phonemes the model handles. For instance, giga830m.pt handles 80 phonemes. Once you finish generate text phonemes of your own dataset. You can generate the phonemes set. example: phonemes_set = set(sum([x['text_phonemes'] for x in data], [])) You then need to compare your phonemes set and the model's phonemes set, If there are new phonemes introduced in your dataset, you need to:

  • expand the phoneme token to token index map phn2num so that the model can recognize your new phonemes
  • the model generate text phonemes embedding using an embedding layer, the weight is stored at model['text_embedding.word_embeddings.weight'] . For giga830m.pt , this weight has a shape of 101x2048, Although the phn2num shows there are actually 80 phonemes tokens, if you look at the norm of each embedding weight, you see that only the first 80 out of 101 are trained, the rest 21 are random and the one at index 100 represent the pad token.

So what you need to do is:

  • once you expand the phn2num map with your new phonomes tokens, create an embedding matrix that is at least as large as len(phn2num) + 1, copy the first 80 from the model weight model.text_embedding.word_embeddings.weight:
    new_weight[:80] = old_weight[:80], and update the weight model['text_embedding.word_embeddings.weight']=new_weight, save the weight.

At last, at your training script, indicate:

  • text_vocab_size : new_weight.shape[0] - 1 (because the last token embedding is the meaningless padding token)
  • text_pad_token : new_weight.shape[0] (say you have 110 true phonemes tokens, the weight matrix you have have shape 111x2048, and weigth[100] is the embedding for the meaningless padding token )

hope this helps

Well thanks for taking your time to explain this. I will try again sometime in the future and see how it goes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants