Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to finetune model for new speaker #405

Closed
marlon-br opened this issue Jun 19, 2020 · 17 comments
Closed

Trying to finetune model for new speaker #405

marlon-br opened this issue Jun 19, 2020 · 17 comments

Comments

@marlon-br
Copy link

I am trying to finetune models to support one more speaker, but it looks I am doing something wrong.

I want to use "dia_hard" pipeline, so I need to finetune models: {sad_dihard, scd_dihard, emb_voxceleb}.

For my speaker I have one WAV file with duration more then 1 hour.

So, I created database.yml file:

Databases:
   IK: /content/fine/kirilov/{uri}.wav

Protocols:
    IK:
       SpeakerDiarization:
          kirilov:
            train:
               uri: train.lst
               annotation: train.rttm
               annotated: train.uem

and put additional files near database.yml:

kirilov
├── database.yml
├── kirilov.wav
├── train.lst
├── train.rttm
└── train.uem

train.lst:
kirilov

train.rttm:
SPEAKER kirilov 1 0.0 3600.0 <NA> <NA> Kirilov <NA> <NA>

train.uem:
kirilov NA 0.0 3600.0

I assume it will say trainer to use kirilov.wav file and take 3600 seconds of audio from it to use for training.

Now I finetune the models, current folder is /content/fine/kirilov, so database.yml is taken from the current directory:

!pyannote-audio sad train --pretrained=sad_dihard --subset=train --to=1 --parallel=4 "/content/fine/sad" IK.SpeakerDiarization.kirilov
!pyannote-audio scd train --pretrained=scd_dihard --subset=train --to=1 --parallel=4 "/content/fine/scd" IK.SpeakerDiarization.kirilov
!pyannote-audio emb train --pretrained=emb_voxceleb --subset=train --to=1 --parallel=4 "/content/fine/emb" IK.SpeakerDiarization.kirilov

Output looks like:

Using cache found in /root/.cache/torch/hub/pyannote_pyannote-audio_develop
Loading labels: 0file [00:00, ?file/s]/usr/local/lib/python3.6/dist-packages/pyannote/database/protocol/protocol.py:128: UserWarning:

Existing key "annotation" may have been modified.

Loading labels: 1file [00:00, 20.49file/s]
/usr/local/lib/python3.6/dist-packages/pyannote/audio/train/trainer.py:128: UserWarning:

Did not load optimizer state (most likely because current training session uses a different loss than the one used for pre-training).

2020-06-19 15:35:26.763592: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Training:   0%|                                        | 0/1 [00:00<?, ?epoch/s]
Epoch pyannote/pyannote-database#1:   0%|                                       | 0/29 [00:00<?, ?batch/s]
Epoch pyannote/pyannote-database#1:   0%|                           | 0/29 [00:00<?, ?batch/s, loss=0.676]
Epoch pyannote/pyannote-database#1:   3%|▋                  | 1/29 [00:00<00:26,  1.04batch/s, loss=0.676]

Etc.

And try to run pipeline with new .pt's:

import os
import torch
from pyannote.audio.pipeline import SpeakerDiarization
pipeline = SpeakerDiarization(embedding = "/content/fine/emb/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt", 
                              sad_scores = "/content/fine/sad/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt",
                              scd_scores = "/content/fine/scd/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt",
                              method= "affinity_propagation")

#params from dia_dihard\train\X.SpeakerDiarization.DIHARD_Official.development\params.yml
pipeline.load_params("/content/drive/My Drive/pyannote/params.yml")
FILE = {'audio': "/content/groundtruth/new.wav"}
diarization = pipeline(FILE)
diarization

The result is that for my new.wav the whole audio is recognized as speaker talking without pauses. So I assume that the models were broken. And it does not matter if I train for 1 epoch or for 100.

In case I use:

  1. 0000.pt - I assume these are the original models
pipeline = SpeakerDiarization(embedding = "/content/fine/emb/train/IK.SpeakerDiarization.kirilov.train/weights/0000.pt", 
                              sad_scores = "/content/fine/sad/train/IK.SpeakerDiarization.kirilov.train/weights/0000.pt",
                              scd_scores = "/content/fine/scd/train/IK.SpeakerDiarization.kirilov.train/weights/0000.pt",
                              method= "affinity_propagation")

or

  1. weights from original models
pipeline = SpeakerDiarization(embedding = "/content/drive/My Drive/pyannote/emb_voxceleb/train/X.SpeakerDiarization.VoxCeleb.train/weights/0326.pt", 
                             sad_scores = "/content/drive/My Drive/pyannote/sad_dihard/sad_dihard/train/X.SpeakerDiarization.DIHARD_Official.train/weights/0231.pt",
                             scd_scores = "/content/drive/My Drive/pyannote/scd_dihard/train/X.SpeakerDiarization.DIHARD_Official.train/weights/0421.pt",
                             method= "affinity_propagation")

everything is ok and the result is similar to

pipeline = torch.hub.load('pyannote/pyannote-audio', 'dia_dihard')
FILE = {'audio': "/content/groundtruth/new.wav"}
diarization = pipeline(FILE)
diarization

Could you please advise what could be wrong with my training\finetuning process?

@hbredin
Copy link
Member

hbredin commented Jun 22, 2020

I am trying to finetune models to support one more speaker, but it looks I am doing something wrong.

I want to use "dia_hard" pipeline, so I need to finetune models: {sad_dihard, scd_dihard, emb_voxceleb}.

For my speaker I have one WAV file with duration more then 1 hour.

So, I created database.yml file:

Databases:
   IK: /content/fine/kirilov/{uri}.wav

Protocols:
    IK:
       SpeakerDiarization:
          kirilov:
            train:
               uri: train.lst
               annotation: train.rttm
               annotated: train.uem

and put additional files near database.yml:

kirilov
├── database.yml
├── kirilov.wav
├── train.lst
├── train.rttm
└── train.uem

train.lst:
kirilov

train.rttm:
SPEAKER kirilov 1 0.0 3600.0 <NA> <NA> Kirilov <NA> <NA>

train.uem:
kirilov NA 0.0 3600.0

I assume it will say trainer to use kirilov.wav file and take 3600 seconds of audio from it to use for training.

It tells pyannote.audio that speaker Kirilov speaks for the whole hour of kirilov.wav (only speech, no non-speech). Therefore, fine-tuning speech activity detection (SAD) will most likely lead the model to always return the speech class. You need both speech and non-speech regions for fine-tuning to make sense.

Fine-tuning speaker change detection (SCD) and speaker embedding (EMB) with just one speaker does not really make sense either:

  • having just one speaker means that there won't be any speaker change to train SCD from
  • training EMB aims at discriminating speakers from each other: with just one speaker, you cannot do that.

And try to run pipeline with new .pt's:

import os
import torch
from pyannote.audio.pipeline import SpeakerDiarization
pipeline = SpeakerDiarization(embedding = "/content/fine/emb/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt", 
                              sad_scores = "/content/fine/sad/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt",
                              scd_scores = "/content/fine/scd/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt",
                              method= "affinity_propagation")

The pipeline needs to be adapted to these new SAD, SCD, and EMB models as well. See this tutorial.

@hbredin hbredin transferred this issue from pyannote/pyannote-database Jun 22, 2020
@marlon-br
Copy link
Author

Ok, not it seems to be more clear:)

Could you please estimate how much audio (speech, no speech) is required to fine-tune model for a new speaker voice?

@hbredin
Copy link
Member

hbredin commented Jun 22, 2020

It sounds like you might be misunderstanding what speaker diarization is.

If you are trying to detect a particular speaker (using an existing recording of this speaker as enrollment), what you want is speaker tracking, not speaker diarization.

Can you please describe precisely what your final task is?

@marlon-br
Copy link
Author

Well,

The task is to gather a lot hours of a particular speaker talking (to feed that data to a TTS like Tacatron 2 to train it to speak with a new voice).
So the idea is to download a lot of video\audio files with that speaker and other people talking and detect\extract all audio segments with my speaker talking. In order to do this I am trying to use pyannote-audio and finetune model to distinguish my speaker voice from others.

@hbredin
Copy link
Member

hbredin commented Jun 22, 2020

I suggest you have a look at this issue that is very similar to what you are trying to achieve.

@marlon-br
Copy link
Author

marlon-br commented Jun 22, 2020

For now I see the problem, that when I use a random video, for example: https://www.youtube.com/watch?v=5m8SSt4gp7A (sorry for Russian, but the language itself does not mean anything here), there are actually 2 persons talking (I.Kirillov most of the time and some other guy at the end of the video). But the SpeakerDiarization pipeline returns that there are only one speaker talking all the time (it threats both speakers as the same person talking). I thought model fine-tuning for I.Kirillov's voice would let to distinguish his voice from other speakers.

@ooza
Copy link

ooza commented Jul 16, 2020

Hi @hbredin, @marlon-br;
I'm trying to fine-tune the dia models using my training data. It works fine forsad and scd but when it comes to the embmodel:

$pyannote-audio emb train --pretrained=emb_voxceleb --subset=train --to=1 --parallel=4 "experiments/train_outputs/emb" ADVANCE.SpeakerDiarization.advComp01

I got the following error:

`
/usr/local/lib/python3.7/site-packages/pyannote/database/database.py:51: UserWarning: Ignoring deprecated 'preprocessors' argument in MUSAN.init. Pass it to 'get_protocol' instead.
warnings.warn(msg)

Using cache found in /Users/xx.yy/.cache/torch/hub/pyannote_pyannote-audio_develop
Traceback (most recent call last):
File "/usr/local/bin/pyannote-audio", line 8, in
sys.exit(main())

File "/usr/local/lib/python3.7/site-packages/pyannote/audio/applications/pyannote_audio.py", line 366, in main
app.train(protocol, **params)

File "/usr/local/lib/python3.7/site-packages/pyannote/audio/applications/base.py", line 198, in train
protocol_name, progress=True, preprocessors=preprocessors

TypeError: get_protocol() got an unexpected keyword argument 'progress'
`

knowing that I've installed pyannote.db. voxceleb, Am I missing something else ?

@ooza
Copy link

ooza commented Jul 16, 2020

@marlon-br

For now I see the problem, that when I use a random video, for example: https://www.youtube.com/watch?v=5m8SSt4gp7A (sorry for Russian, but the language itself does not mean anything here), there are actually 2 persons talking (I.Kirillov most of the time and some other guy at the end of the video). But the SpeakerDiarization pipeline returns that there are only one speaker talking all the time (it threats both speakers as the same person talking). I thought model fine-tuning for I.Kirillov's voice would let to distinguish his voice from other speakers.

This may happen when you directly apply the dia/ami trained models on your own data or when you fine-tune them using a very small training set or/and for just a couple of epochs !

@hbredin
Copy link
Member

hbredin commented Jul 16, 2020

TypeError: get_protocol() got an unexpected keyword argument 'progress'

knowing that I've installed pyannote.db. voxceleb, Am I missing something else ?

The API of pyannote.database.get_protocol has changed recently.
Can you try with the latest version of pyannote.audio (develop branch)?
This problem should be fixed in c3791bc.

@ooza
Copy link

ooza commented Jul 16, 2020

Thanks for your prompt answer!
The problem was resolved but I got a new one:
ValueError: Missing mandatory 'uri' entry in ADVANCE.SpeakerDiarization.advComp01.train
Actually, yesterday I got the same problem after cloning the last dev version of the project and I though that it was maybe linked to the GPU server!
now I'm facing the same error on local machine.
any idea how can I solve this.

@hbredin
Copy link
Member

hbredin commented Jul 16, 2020

Yes, this is due to the last version of pyannote.database: the syntax for defining custom speaker diarization protocol has also changed a bit.
The data preparation tutorial has been updated accordingly: https://github.com/pyannote/pyannote-audio/tree/develop/tutorials/data_preparation

@ooza
Copy link

ooza commented Jul 16, 2020

I've added the .lst file and now it works!
Thanks a lot
I've just few questions regarding the input data and the outputs:

  • my data are stereo with 44khz sample rate, should I downsample them to 16khz ?
  • the system outputs some warning:
    Did not load optimizer state (most likely because current training session uses a different loss than the one used for pre-training).
    Existing precomputed key "annotation" has been modified by a preprocessor. warnings.warn(msg.format(key=key))
    How can I deal with them ?

@hbredin
Copy link
Member

hbredin commented Jul 19, 2020

I've added the .lst file and now it works!
Thanks a lot
I've just few questions regarding the input data and the outputs:

* my data are stereo with 44khz sample rate, should I downsample them to 16khz ?

Data are downsampled on-the-fly. But it probably does not hurt efficiency to downsample them first.

* the system outputs some warning:
  ` Did not load optimizer state (most likely because current training session uses a different loss than the one used for pre-training).`

This one happens because your list of speakers used for fine-tuning differs from the one used for the original pretraining.
Hence the final classification layer has a different shape... This is just a warning: you can simply ignore this.

  `Existing precomputed key "annotation" has been modified by a preprocessor. warnings.warn(msg.format(key=key))`
  How can I deal with them ?

This is due to an additional safety check that happens in speaker diarization protocols: https://github.com/pyannote/pyannote-database/blob/b6e855710dd8e4336de2d0e1c95361c405852534/pyannote/database/protocol/speaker_diarization.py#L100-L102. It looks like some of the provided RTTM annotations are outside of the actual file extent (or of the provided UEM).

@ooza
Copy link

ooza commented Sep 9, 2020

@hbredin
Dear Hervé,
I trained the Diarization pipeline after fine-tuning the sub-models (on my own data) and extracting the raw scores. The obtained results on the test set are quite good. However, I got a high confusion score on some audios.
I'm wondering if this is linked to the speaker embedding module ?
Another question: Can I use a pretained model (e.g. of EMB) along with the fine-tuned ones (e.g. of SAD and SCD)? if yes, please tell me how.

@hbredin
Copy link
Member

hbredin commented Sep 17, 2020

I trained the Diarization pipeline after fine-tuning the sub-models (on my own data) and extracting the raw scores. The obtained results on the test set are quite good. However, I got a high confusion score on some audios.
I'm wondering if this is linked to the speaker embedding module ?

It could be, indeed.

Another question: Can I use a pretained model (e.g. of EMB) along with the fine-tuned ones (e.g. of SAD and SCD)? if yes, please tell me how.

Yes, you can mix pretrained and fine-tuned models.
See related issues #439 and #430.

@ooza
Copy link

ooza commented Sep 23, 2020

Thanks @hbredin for your reply,
My audios have only two speakers, so I'm wondering if I force the model for always considering 2 speakers can help improving the DER? If probably yes, where can I fix this parameter? Is it the "number of speakers per batch" (per_fold) parameter in the embedding's config file ?

@hbredin
Copy link
Member

hbredin commented Sep 28, 2020

There is currently no way to constraint the number of speakers.

Instead, you should tune the pipeline hyper-parameters so that the clustering thresholds and stopping criterion somehow learn the type of data (here, a limited number of speakers).

Closing this issue as it has diverged from the original. Please open a new one if needed.

@hbredin hbredin closed this as completed Sep 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants