Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feasability of the implementation of a Speaker Enrollment pipeline. #391

Closed
hadware opened this issue Jun 1, 2020 · 6 comments
Closed

Comments

@hadware
Copy link
Contributor

hadware commented Jun 1, 2020

Is your feature request related to a problem? Please describe.
The title is pretty self-explanatory. I'd just like to know how much work would be needed to implement a pipeline for a speaker unrolling task: are all the required building blocks already here in your opinion? If there isn't too much digging involved, i'd probably be willing to do it myself :)

@hbredin
Copy link
Member

hbredin commented Jun 2, 2020

Can you please define "speaker unrolling"? I am not familiar with this wording.

Did you mean "speaker enrollment"?
If so, what do you have in mind exactly? Speaker identification?

@hadware hadware changed the title Feasability of the implementation of a Speaker Unrolling pipeline. Feasability of the implementation of a Speaker Unrollment pipeline. Jun 3, 2020
@hadware hadware changed the title Feasability of the implementation of a Speaker Unrollment pipeline. Feasability of the implementation of a Speaker Enrollment pipeline. Jun 3, 2020
@Rachine
Copy link

Rachine commented Jun 3, 2020

Hello,

Yes, it would be speaker enrollment.
Based on variable amount of target speakers, find all the segments in the audio from these speakers.
We were thinking to a pipeline that look like this https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41939.pdf

@hbredin
Copy link
Member

hbredin commented Jun 4, 2020

I didn't go through the paper but I think the SpeechTurnClosestAssignment pipeline might get you started.

Enrollment

Basic idea: gather all speaker embedding for each target and take the average.

# gather targets embedding
labels = targets.labels()
X_targets, targets_labels = [], []
for l, label in enumerate(labels):
timeline = targets.label_timeline(label, copy=False)
# be more and more permissive until we have
# at least one embedding for current speech turn
for mode in ["strict", "center", "loose"]:
x = embedding.crop(timeline, mode=mode)
if len(x) > 0:
break
# skip labels so small we don't have any embedding for it
if len(x) < 1:
continue
targets_labels.append(label)
X_targets.append(np.mean(x, axis=0))

Recognition

Basic idea: for each test speech turn (or, here, speaker cluster), find closest target speaker (by comparing their average embedding. You might also want to consider the reject option if even the closest target speaker is too far.

# gather speech turns embedding
labels = speech_turns.labels()
X, assigned_labels, skipped_labels = [], [], []
for l, label in enumerate(labels):
timeline = speech_turns.label_timeline(label, copy=False)
# be more and more permissive until we have
# at least one embedding for current speech turn
for mode in ["strict", "center", "loose"]:
x = embedding.crop(timeline, mode=mode)
if len(x) > 0:
break
# skip labels so small we don't have any embedding for it
if len(x) < 1:
skipped_labels.append(label)
continue
assigned_labels.append(label)
X.append(np.mean(x, axis=0))
# assign speech turns to closest class
assignments = self.closest_assignment(np.vstack(X_targets), np.vstack(X))
mapping = {
label: targets_labels[k]
for label, k in zip(assigned_labels, assignments)
if not k < 0
}
return speech_turns.rename_labels(mapping=mapping)

@Rachine
Copy link

Rachine commented Jun 10, 2020

Amazing! Thank you! We will let you know how this goes and if it works on our 'special' data.

@hbredin
Copy link
Member

hbredin commented Sep 4, 2020

Closing this issue as I believe the original question has been answered.
I'd still be interested in knowing how it went 👀

@hbredin hbredin closed this as completed Sep 4, 2020
@Rachine
Copy link

Rachine commented Sep 4, 2020

Hey! This is still ongoing. We have the pipeline up and running but we are having a hard time to finetune correctly a spk emb model on our smallish dataset 😿

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants