-
-
Notifications
You must be signed in to change notification settings - Fork 806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to finetune model for new speaker #405
Comments
It tells Fine-tuning speaker change detection (SCD) and speaker embedding (EMB) with just one speaker does not really make sense either:
The pipeline needs to be adapted to these new SAD, SCD, and EMB models as well. See this tutorial. |
Ok, not it seems to be more clear:) Could you please estimate how much audio (speech, no speech) is required to fine-tune model for a new speaker voice? |
It sounds like you might be misunderstanding what speaker diarization is. If you are trying to detect a particular speaker (using an existing recording of this speaker as enrollment), what you want is speaker tracking, not speaker diarization. Can you please describe precisely what your final task is? |
Well, The task is to gather a lot hours of a particular speaker talking (to feed that data to a TTS like Tacatron 2 to train it to speak with a new voice). |
I suggest you have a look at this issue that is very similar to what you are trying to achieve. |
For now I see the problem, that when I use a random video, for example: https://www.youtube.com/watch?v=5m8SSt4gp7A (sorry for Russian, but the language itself does not mean anything here), there are actually 2 persons talking (I.Kirillov most of the time and some other guy at the end of the video). But the SpeakerDiarization pipeline returns that there are only one speaker talking all the time (it threats both speakers as the same person talking). I thought model fine-tuning for I.Kirillov's voice would let to distinguish his voice from other speakers. |
Hi @hbredin, @marlon-br;
I got the following error: ` Using cache found in /Users/xx.yy/.cache/torch/hub/pyannote_pyannote-audio_develop File "/usr/local/lib/python3.7/site-packages/pyannote/audio/applications/pyannote_audio.py", line 366, in main File "/usr/local/lib/python3.7/site-packages/pyannote/audio/applications/base.py", line 198, in train TypeError: get_protocol() got an unexpected keyword argument 'progress' knowing that I've installed pyannote.db. voxceleb, Am I missing something else ? |
This may happen when you directly apply the dia/ami trained models on your own data or when you fine-tune them using a very small training set or/and for just a couple of epochs ! |
The API of |
Thanks for your prompt answer! |
Yes, this is due to the last version of |
I've added the .lst file and now it works!
|
Data are downsampled on-the-fly. But it probably does not hurt efficiency to downsample them first.
This one happens because your list of speakers used for fine-tuning differs from the one used for the original pretraining.
This is due to an additional safety check that happens in speaker diarization protocols: https://github.com/pyannote/pyannote-database/blob/b6e855710dd8e4336de2d0e1c95361c405852534/pyannote/database/protocol/speaker_diarization.py#L100-L102. It looks like some of the provided RTTM annotations are outside of the actual file extent (or of the provided UEM). |
@hbredin |
It could be, indeed.
Yes, you can mix pretrained and fine-tuned models. |
Thanks @hbredin for your reply, |
There is currently no way to constraint the number of speakers. Instead, you should tune the pipeline hyper-parameters so that the clustering thresholds and stopping criterion somehow learn the type of data (here, a limited number of speakers). Closing this issue as it has diverged from the original. Please open a new one if needed. |
I am trying to finetune models to support one more speaker, but it looks I am doing something wrong.
I want to use "dia_hard" pipeline, so I need to finetune models: {sad_dihard, scd_dihard, emb_voxceleb}.
For my speaker I have one WAV file with duration more then 1 hour.
So, I created database.yml file:
and put additional files near database.yml:
train.lst:
kirilov
train.rttm:
SPEAKER kirilov 1 0.0 3600.0 <NA> <NA> Kirilov <NA> <NA>
train.uem:
kirilov NA 0.0 3600.0
I assume it will say trainer to use kirilov.wav file and take 3600 seconds of audio from it to use for training.
Now I finetune the models, current folder is /content/fine/kirilov, so database.yml is taken from the current directory:
Output looks like:
Etc.
And try to run pipeline with new .pt's:
The result is that for my new.wav the whole audio is recognized as speaker talking without pauses. So I assume that the models were broken. And it does not matter if I train for 1 epoch or for 100.
In case I use:
or
everything is ok and the result is similar to
Could you please advise what could be wrong with my training\finetuning process?
The text was updated successfully, but these errors were encountered: