Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create a correct diarization #1567

Closed
fablau opened this issue Nov 27, 2023 · 25 comments · Fixed by #1574
Closed

Unable to create a correct diarization #1567

fablau opened this issue Nov 27, 2023 · 25 comments · Fixed by #1574

Comments

@fablau
Copy link

fablau commented Nov 27, 2023

Hello again.

I have just installed the latest version of pyannote, and discussed CPU/GPU issues on this thread, but now I am facing a major problem. It looks like this new version of pyannote is unable to perform diarization of very simple files like the attached one.

What I get is this:

start=0.30s stop=6.54s speaker_SPEAKER_00
start=7.84s stop=21.94s speaker_SPEAKER_00
start=22.28s stop=24.68s speaker_SPEAKER_00
start=24.95s stop=27.68s speaker_SPEAKER_00
start=28.02s stop=30.06s speaker_SPEAKER_00
start=30.82s stop=33.52s speaker_SPEAKER_00
start=33.95s stop=44.44s speaker_SPEAKER_00
start=44.49s stop=68.31s speaker_SPEAKER_00
start=68.51s stop=82.95s speaker_SPEAKER_00
start=83.76s stop=87.28s speaker_SPEAKER_00
start=88.11s stop=90.25s speaker_SPEAKER_00
start=90.77s stop=96.05s speaker_SPEAKER_00
start=96.88s stop=99.62s speaker_SPEAKER_00

Which clearly shows that the system hasn't understood they are two different persons (and they have quite different voices).

Here is the code I am using:


import sys
from pyannote.audio import Pipeline
import torch

inputAudioFile = sys.argv[1] 
spkrsNo = int(sys.argv[2])
fileDiary = sys.argv[3]

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token="xxxxxxxxxxx")


pipeline.to(torch.device("cuda"))

diarization = pipeline(inputAudioFile, num_speakers=spkrsNo)


with open(fileDiary, mode='w') as file_object:
	for turn, _, speaker in diarization.itertracks(yield_label=True):		
		print(f"start={turn.start:.2f}s stop={turn.end:.2f}s speaker_{speaker}", file=file_object)

I have tested the system with other short files like this one, but I keep getting just a single speaker returned.

Thoughts?

TwoSpeakers.mp3.zip

Copy link

Thank you for your issue.
We found the following entry in the FAQ which you may find helpful:

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

  • installation
  • data preparation
  • model download
  • etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

  • paid scientific consulting around speaker diarization and speech processing in general;
  • custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

@fablau
Copy link
Author

fablau commented Nov 27, 2023

I can confirm that if I change this line:

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token="xxxxxxxxxxx")

Back to using the previous model:

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization", use_auth_token="xxxxxxxxxxx")

The speakers are correctly recognized. How to fix this?

@fablau
Copy link
Author

fablau commented Nov 27, 2023

Strangely, I tried again and now I get this error (complete output below):

python diarizationOnly.py TwoSpeakers.mp3 2 out.txt
config.yaml: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 469/469 [00:00<00:00, 991kB/s]
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 5.91M/5.91M [00:00<00:00, 21.3MB/s]
config.yaml: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 399/399 [00:00<00:00, 910kB/s]
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 26.6M/26.6M [00:01<00:00, 21.9MB/s]
config.yaml: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 221/221 [00:00<00:00, 503kB/s]
Traceback (most recent call last):
  File "/workspace/diarizationOnly.py", line 14, in <module>
    diarization = pipeline(fileOutWav, num_speakers=spkrsNo)
  File "/usr/local/lib/python3.10/site-packages/pyannote/audio/core/pipeline.py", line 325, in __call__
    return self.apply(file, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pyannote/audio/pipelines/speaker_diarization.py", line 514, in apply
    embeddings = self.get_embeddings(
  File "/usr/local/lib/python3.10/site-packages/pyannote/audio/pipelines/speaker_diarization.py", line 343, in get_embeddings
    waveform_batch = torch.vstack(waveforms)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 160000 but got size 159112 for tensor number 14 in the list.

The program "diarizationOnly.py" is the one I posted above, here it is again:

import sys
from pyannote.audio import Pipeline
import torch

inputAudioFile = sys.argv[1] 
spkrsNo = int(sys.argv[2])
fileDiary = sys.argv[3]

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token="xxxxxxxxxxx")


pipeline.to(torch.device("cuda"))

diarization = pipeline(inputAudioFile, num_speakers=spkrsNo)


with open(fileDiary, mode='w') as file_object:
	for turn, _, speaker in diarization.itertracks(yield_label=True):		
		print(f"start={turn.start:.2f}s stop={turn.end:.2f}s speaker_{speaker}", file=file_object)

@AntoineBlanot
Copy link

Facing the same issues. Diarization is really not doing good...
I can't even reproduce this blog: https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all
(blog uses only segmentation)

@fablau
Copy link
Author

fablau commented Nov 28, 2023

I could fix the error above with this hack:

#1324 (comment)

But I still have the problem at the first point above for diarization.

Can that be fixed any time soon?

@hbredin
Copy link
Member

hbredin commented Nov 28, 2023

The obvious solution is to switch back to pyannote/speaker-diarization in your use case.

I could fix the error above with this hack:

#1324 (comment)

But I still have the problem at the first point above for diarization.

Can that be fixed any time soon?

Short answer: no.
Long answer: why don't you switch back to pyannote/speaker-diarization if it works better for your use case? why don't you finetune the pipeline to your own data to get the best possible performance?

@fablau
Copy link
Author

fablau commented Nov 28, 2023

The problem with the previous version is that it looks like using the CPU instead of the GPU. Thoughts on that?

@fablau
Copy link
Author

fablau commented Nov 28, 2023

Never mind, it looks like it uses the GPU correctly. I think I'll stick to pyannote/speaker-diarization for now. Thanks!

@fablau
Copy link
Author

fablau commented Nov 28, 2023

Oh, you may want to apply the hack above to the repository to avoid the error I posted above.

Thanks again.

@thomasmol
Copy link

I am also getting slightly degraded performance with 3.1 compared to 3.0.

A file I use to test with 2 speakers always was consistently diarized (near perfect) with version 2.1 and 3.0. But with 3.1 almost all segments indicate a single speaker, but only when I manually set the num_speakers.
Sample of output:

[ 00:00:03.013 -->  00:00:06.052] A SPEAKER_01
[ 00:00:06.154 -->  00:00:09.516] B SPEAKER_01
[ 00:00:09.889 -->  00:00:14.405] C SPEAKER_01
[ 00:00:15.458 -->  00:00:22.045] D SPEAKER_01
[ 00:00:23.438 -->  00:00:23.964] E SPEAKER_01
[ 00:00:24.269 -->  00:00:39.006] F SPEAKER_01
[ 00:00:39.584 -->  00:01:01.876] G SPEAKER_01
[ 00:01:02.775 -->  00:01:50.331] H SPEAKER_01
[ 00:01:51.044 -->  00:02:00.331] I SPEAKER_01
[ 00:02:01.095 -->  00:02:08.429] J SPEAKER_01
[ 00:02:09.125 -->  00:02:45.713] K SPEAKER_01
[ 00:02:47.071 -->  00:02:48.531] L SPEAKER_01
[ 00:02:50.755 -->  00:02:51.960] M SPEAKER_01
[ 00:02:52.775 -->  00:02:53.302] N SPEAKER_01
[ 00:02:54.354 -->  00:03:01.536] O SPEAKER_01
[ 00:03:01.994 -->  00:03:16.646] P SPEAKER_01
[ 00:03:16.748 -->  00:03:27.682] Q SPEAKER_01

When I let pyannote auto detect number of speakers the diarization is almost perfect again (with the exception of a small number of segments identified with a 3rd speaker).

The speaker change at 39s is spot on:

[ 00:00:03.013 -->  00:00:06.052] A SPEAKER_01
[ 00:00:06.154 -->  00:00:09.516] B SPEAKER_01
[ 00:00:09.889 -->  00:00:14.405] C SPEAKER_01
[ 00:00:15.458 -->  00:00:22.045] D SPEAKER_01
[ 00:00:23.438 -->  00:00:23.964] E SPEAKER_01
[ 00:00:24.269 -->  00:00:39.006] F SPEAKER_01
[ 00:00:39.584 -->  00:01:01.876] G SPEAKER_02
[ 00:01:02.775 -->  00:01:50.331] H SPEAKER_02
[ 00:01:51.044 -->  00:02:00.331] I SPEAKER_02
[ 00:02:01.095 -->  00:02:08.429] J SPEAKER_02
[ 00:02:09.125 -->  00:02:45.713] K SPEAKER_02
[ 00:02:47.071 -->  00:02:48.531] L SPEAKER_02
[ 00:02:50.755 -->  00:02:51.960] M SPEAKER_02
[ 00:02:52.775 -->  00:02:53.302] N SPEAKER_02
[ 00:02:54.354 -->  00:03:01.536] O SPEAKER_02
[ 00:03:01.994 -->  00:03:16.646] P SPEAKER_02

In some other files I tested it looks like the speaker with label SPEAKER_0 is 'ignored' (as if other speakers are 'preferred')
Will do some more tests and see if can find what is causing this, or if I am just misconfiguring something 👍

@hbredin
Copy link
Member

hbredin commented Nov 29, 2023

@thomasmol @fablau @AntoineBlanot, could you report on the performance at this commit? Also, if you could share the faulty files (ideally with expected manual labels), that would help me setup a CI benchmark and make sure it does not break your use cases in the future.

@flyingleafe do you think this could be due to this PR?

@flyingleafe
Copy link
Contributor

@hbredin will test that; from the messages above, the issue is not present with 2.0 models, so I highly suspect that if my changes are involved, then it is due to some buggy interplay between them and powerset encoding output conversion to the discrete diarization.

@thomasmol
Copy link

thomasmol commented Nov 30, 2023

Okay I did some testing, with a sample of a file I use often, put it here: https://thomasmol.com/recordings/mark-lex-short.mp3 . Two people speaking, and clear speaker change at around 37s.

I did some testing with different combinations of the pre-trained model version, the python package version, and manually setting the num_speaker. So here goes:


Pretrained pipeline pyannote/speaker-diarization-3.1
pyannote version @commit bbc8044

num_speakers manually to 2
no speaker change

[(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
 (<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
 (<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
 (<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
 (<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
 (<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
 (<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
 (<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_00'),
 (<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_00')]

num_speakers auto detect
speaker change at 39s

 [(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
 (<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
 (<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
 (<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
 (<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
 (<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
 (<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
 (<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
 (<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')]

Pretrained pipeline pyannote/speaker-diarization-3.1
pyannote version @commit 23001a7

num_speakers manually to 2
no speaker change

[(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
 (<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
 (<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
 (<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
 (<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
 (<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
 (<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
 (<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_00'),
 (<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_00')]

num_speakers auto detect
speaker change at 39s

[(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
 (<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
 (<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
 (<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
 (<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
 (<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
 (<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
 (<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
 (<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')]

Pretrained pipeline pyannote/speaker-diarization-3.0
pyannote version @commit bbc8044

num_speakers manually to 2
no speaker change at 39s [Edit: should say "speaker change at 39s"]

[(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
 (<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
 (<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
 (<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
 (<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
 (<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
 (<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
 (<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
 (<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')]

num_speakers auto detect
speaker change at 39s

[(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
 (<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
 (<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
 (<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
 (<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
 (<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
 (<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
 (<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
 (<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')]

Pretrained pipeline pyannote/speaker-diarization-3.0
pyannote version @commit 23001a7

num_speakers manually to 2
speaker change at 39s

[(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
 (<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
 (<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
 (<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
 (<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
 (<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
 (<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
 (<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
 (<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')]

num_speakers auto detect
speaker change at 39s

[(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
 (<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
 (<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
 (<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
 (<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
 (<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
 (<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
 (<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
 (<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')]

So what I notice from this testing is that the diarization is only incorrect when using a combination of manually setting num_speakers and specifically using the 3.1 pre-trained model, regardless of commit/python package version. Let me know if this helps!

@fablau
Copy link
Author

fablau commented Nov 30, 2023

Thank you @thomasmol that's what I tried first: not defining the number of speakers and letting the system guess, but for me the problem was still there. At least for the example I uploaded in my first post above.

@hbredin
Copy link
Member

hbredin commented Dec 1, 2023

@thomasmol can you please check your conclusion about pyannote/speaker-diarization-3.0+
bbc8044 + num_speakers=2. You write "no speaker change at 39s" but from the output you pasted, it does seem to work. Can you please double check?

@thomasmol
Copy link

Ah my label is wrong, it should say "speaker change at 39s". Just checked, the output is definitely from pyannote/speaker-diarization-3.0 +
bbc8044 + num_speakers=2. So basically with pyannote/speaker-diarization-3.0, the output is always correct, and with pyannote/speaker-diarization-3.1 not, but only when manually setting num_speakers

@hbredin
Copy link
Member

hbredin commented Dec 1, 2023

OK, so here is my summary of @thomasmol's experiment.

Commit Pipeline num_speakers=2 num_speakers=None
23001a7 3.0
bbc8044 3.0
23001a7 3.1 🚫
bbc8044 3.1 🚫

@flyingleafe this shows that your PR has nothing to do with the problem.

The difference must come from the switch from hbredin/wespeaker-voxceleb-resnet34-LM to pyannote/wespeaker-voxceleb-resnet34-LM. Will look into this (I think I know the reason, no ETA though...)

hbredin added a commit that referenced this issue Dec 1, 2023
@hbredin hbredin linked a pull request Dec 1, 2023 that will close this issue
@hbredin
Copy link
Member

hbredin commented Dec 1, 2023

ETA just got closer :)
Can you please check that #1574 fixes the problem?

@thomasmol
Copy link

thomasmol commented Dec 1, 2023

It's fixed! Tested e80b542 with pyannote/speaker-diarization-3.1 and diarization = diarization_model('mark-lex-short.mp3', num_speakers=2), gives this (correct) output again:

[(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
 (<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
 (<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
 (<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
 (<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
 (<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
 (<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
 (<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
 (<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')]

Same output when no num_speakers is set as well.
Awesome 👍

@hbredin
Copy link
Member

hbredin commented Dec 1, 2023

Great. Currently running my usual benchmark with this fix.
Will release next if everything goes well.

@hbredin
Copy link
Member

hbredin commented Dec 1, 2023

I just released 3.1.1 fixing this issue.

@fablau
Copy link
Author

fablau commented Dec 2, 2023

Yes, I can confirm that now it is working for the example I uploaded in my first post above, and I tested it with other material, and works fine. But I still see the older version (pre 3.1) to be a little bit more accurate when people speak close to each other.

Try the attached mp3 file. The new version 3.1.1 gives me this:

start=0.01s stop=5.24s speaker_SPEAKER_01
start=5.51s stop=24.00s speaker_SPEAKER_01
start=24.00s stop=26.04s speaker_SPEAKER_00
start=26.44s stop=45.81s speaker_SPEAKER_00

Whereas the older version (pre 3.1) gives me this:

start=0.01s stop=27.01s speaker_SPEAKER_01
start=27.01s stop=37.01s speaker_SPEAKER_00
start=37.01s stop=40.16s speaker_SPEAKER_01
start=40.16s stop=46.00s speaker_SPEAKER_00

As you can see, the alternating voices are better recognized with the older version. But they both missed a lot of short sentences in between, and I am wondering if there is a way to have Pyannote detect those missed short sentences. If so, how?

In any case, thanks for fixing this!

CybertruckShort.mp3.zip

@fablau
Copy link
Author

fablau commented Dec 9, 2023

No thoughts on the performance issue above?

@lucienhughes
Copy link

I still have this issue from above:

A file I use to test with 2 speakers always was consistently diarized (near perfect) with version 2.1 and 3.0. But with 3.1 almost all segments indicate a single speaker, but only when I manually set the num_speakers. Sample of output:

When using pyannote/speaker-diarization-3.1 and not specifying a num_speakers, it recognises speaker changes (but unfortunately identifies 3 speakers instead of two in my audio), but if I do specify a num_speakers, it consistently fails to identify a change in speaker, even when they are very distinct (male and female with different accents).

3.0 works much better with a fixed num_speakers but runs 7-8x slower then 3.1 on my M3 Pro with torch.device("mps") making it borderline unusable.

@Isuxiz
Copy link

Isuxiz commented Oct 10, 2024

Same to me. model_name="pyannote/speaker-diarization" gets much better performance than default model_name "pyannote/speaker-diarization-3.1". This is very counter-intuitive, has anyone figured out why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants