-
-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to create a correct diarization #1567
Comments
Thank you for your issue. Feel free to close this issue if you found an answer in the FAQ. If your issue is a feature request, please read this first and update your request accordingly, if needed. If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:
Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users). Companies relying on
|
I can confirm that if I change this line:
Back to using the previous model:
The speakers are correctly recognized. How to fix this? |
Strangely, I tried again and now I get this error (complete output below):
The program "diarizationOnly.py" is the one I posted above, here it is again:
|
Facing the same issues. Diarization is really not doing good... |
I could fix the error above with this hack: But I still have the problem at the first point above for diarization. Can that be fixed any time soon? |
The obvious solution is to switch back to
Short answer: no. |
The problem with the previous version is that it looks like using the CPU instead of the GPU. Thoughts on that? |
Never mind, it looks like it uses the GPU correctly. I think I'll stick to pyannote/speaker-diarization for now. Thanks! |
Oh, you may want to apply the hack above to the repository to avoid the error I posted above. Thanks again. |
I am also getting slightly degraded performance with 3.1 compared to 3.0. A file I use to test with 2 speakers always was consistently diarized (near perfect) with version 2.1 and 3.0. But with 3.1 almost all segments indicate a single speaker, but only when I manually set the
When I let pyannote auto detect number of speakers the diarization is almost perfect again (with the exception of a small number of segments identified with a 3rd speaker). The speaker change at 39s is spot on:
In some other files I tested it looks like the speaker with label |
@thomasmol @fablau @AntoineBlanot, could you report on the performance at this commit? Also, if you could share the faulty files (ideally with expected manual labels), that would help me setup a CI benchmark and make sure it does not break your use cases in the future. @flyingleafe do you think this could be due to this PR? |
@hbredin will test that; from the messages above, the issue is not present with 2.0 models, so I highly suspect that if my changes are involved, then it is due to some buggy interplay between them and powerset encoding output conversion to the discrete diarization. |
Okay I did some testing, with a sample of a file I use often, put it here: https://thomasmol.com/recordings/mark-lex-short.mp3 . Two people speaking, and clear speaker change at around 37s. I did some testing with different combinations of the pre-trained model version, the python package version, and manually setting the Pretrained pipeline pyannote/speaker-diarization-3.1 num_speakers manually to 2 [(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
(<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
(<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
(<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
(<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
(<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
(<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
(<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_00'),
(<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_00')] num_speakers auto detect [(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
(<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
(<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
(<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
(<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
(<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
(<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
(<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
(<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')] Pretrained pipeline pyannote/speaker-diarization-3.1 num_speakers manually to 2 [(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
(<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
(<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
(<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
(<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
(<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
(<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
(<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_00'),
(<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_00')] num_speakers auto detect [(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
(<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
(<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
(<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
(<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
(<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
(<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
(<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
(<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')] Pretrained pipeline pyannote/speaker-diarization-3.0 num_speakers manually to 2 [(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
(<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
(<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
(<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
(<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
(<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
(<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
(<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
(<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')] num_speakers auto detect [(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
(<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
(<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
(<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
(<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
(<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
(<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
(<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
(<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')] Pretrained pipeline pyannote/speaker-diarization-3.0 num_speakers manually to 2 [(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
(<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
(<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
(<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
(<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
(<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
(<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
(<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
(<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')] num_speakers auto detect [(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
(<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
(<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
(<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
(<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
(<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
(<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
(<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
(<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')] So what I notice from this testing is that the diarization is only incorrect when using a combination of manually setting |
Thank you @thomasmol that's what I tried first: not defining the number of speakers and letting the system guess, but for me the problem was still there. At least for the example I uploaded in my first post above. |
@thomasmol can you please check your conclusion about |
Ah my label is wrong, it should say "speaker change at 39s". Just checked, the output is definitely from |
OK, so here is my summary of @thomasmol's experiment.
@flyingleafe this shows that your PR has nothing to do with the problem. The difference must come from the switch from |
ETA just got closer :) |
It's fixed! Tested e80b542 with [(<Segment(3.03056, 9.53311)>, 'A', 'SPEAKER_00'),
(<Segment(9.88964, 14.4228)>, 'B', 'SPEAKER_00'),
(<Segment(15.4584, 22.0458)>, 'C', 'SPEAKER_00'),
(<Segment(23.438, 23.9643)>, 'D', 'SPEAKER_00'),
(<Segment(24.2869, 28.8031)>, 'E', 'SPEAKER_00'),
(<Segment(28.9559, 29.5331)>, 'F', 'SPEAKER_00'),
(<Segment(29.584, 39.0407)>, 'G', 'SPEAKER_00'),
(<Segment(39.584, 61.9949)>, 'H', 'SPEAKER_01'),
(<Segment(62.7589, 79.4312)>, 'I', 'SPEAKER_01')] Same output when no |
Great. Currently running my usual benchmark with this fix. |
I just released 3.1.1 fixing this issue. |
Yes, I can confirm that now it is working for the example I uploaded in my first post above, and I tested it with other material, and works fine. But I still see the older version (pre 3.1) to be a little bit more accurate when people speak close to each other. Try the attached mp3 file. The new version 3.1.1 gives me this:
Whereas the older version (pre 3.1) gives me this:
As you can see, the alternating voices are better recognized with the older version. But they both missed a lot of short sentences in between, and I am wondering if there is a way to have Pyannote detect those missed short sentences. If so, how? In any case, thanks for fixing this! |
No thoughts on the performance issue above? |
I still have this issue from above:
When using pyannote/speaker-diarization-3.1 and not specifying a num_speakers, it recognises speaker changes (but unfortunately identifies 3 speakers instead of two in my audio), but if I do specify a num_speakers, it consistently fails to identify a change in speaker, even when they are very distinct (male and female with different accents). 3.0 works much better with a fixed num_speakers but runs 7-8x slower then 3.1 on my M3 Pro with torch.device("mps") making it borderline unusable. |
Same to me. |
Hello again.
I have just installed the latest version of pyannote, and discussed CPU/GPU issues on this thread, but now I am facing a major problem. It looks like this new version of pyannote is unable to perform diarization of very simple files like the attached one.
What I get is this:
Which clearly shows that the system hasn't understood they are two different persons (and they have quite different voices).
Here is the code I am using:
I have tested the system with other short files like this one, but I keep getting just a single speaker returned.
Thoughts?
TwoSpeakers.mp3.zip
The text was updated successfully, but these errors were encountered: