Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker reidentification #1383

Closed
Vermeille opened this issue May 24, 2023 · 6 comments
Closed

Speaker reidentification #1383

Vermeille opened this issue May 24, 2023 · 6 comments
Labels

Comments

@Vermeille
Copy link

Vermeille commented May 24, 2023

Hello,

The project I am currently working on needs to identify speakers. Having the audio segments labeled "Speaker 0" to Speaker N" is cool, but labels "Yann LeCun", "Yoshua Bengio" etc are much better.

I am willing to contribute the feature and was thinking of implementing it the following way:

  1. Have the user have a directory with speaker samples (speakers/Yann_LeCun.wav, speakers/Yoshua_Bengio.wav, etc)
  2. Get a speaker embedding for each file (Time avg pool? Max pool? Attention is out of the equation since it needs)
  3. Optional: store the embeddings on disk (safetensors)
  4. When doing the speaker clustering, assign each speaker embedding to a centroid (hungarian algorithm?) and use the speaker labels.

What do you think?

Related to: #1310 . Could be an implementation starting point and have two birds with one stone.

@github-actions
Copy link

We found the following entry in the FAQ which you may find helpful:

Feel free to close this issue if you found an answer in the FAQ. Otherwise, please give us a little time to review.

This is an automated reply, generated by FAQtory

@hbredin
Copy link
Member

hbredin commented May 25, 2023

Thanks. This is definitely a recurrent request and would make a nice addition to pyannote.

Related ongoing PR by @flyingleafe would make a good starting point.
It makes the speaker diarization pipeline output one speaker embedding per speaker.

diarization, embedding = pipeline("audio.wav", return_embedding=True)

You could maybe start by having a look at this PR and the three of us (@flyingleafe, @Vermeille, and I) can continue the discussion there.

@hbredin
Copy link
Member

hbredin commented May 25, 2023

To backup my "recurrent request" claim, here is a poll that I recently posted on Twitter (where speaker tracking ~= speaker reidentification)

https://twitter.com/hbredin/status/1647146368782270464

@flyingleafe
Copy link
Contributor

@hbredin By the way, I think that I have applied all your comments to the mentioned PR a long time ago. If you do not have further comments, maybe we can merge it, so that people can start using this apparently desired feature?

@flyingleafe
Copy link
Contributor

@Vermeille in the meanwhile, yes, you can use the PR branch and simply compare the embeddings returned by the pipeline with the known embeddings by cosine distance, then relabel the annotation accordingly.

Copy link

stale bot commented Nov 25, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants