Feasability of the implementation of a Speaker Enrollment pipeline. #391

hadware · 2020-06-01T19:34:54Z

Is your feature request related to a problem? Please describe.
The title is pretty self-explanatory. I'd just like to know how much work would be needed to implement a pipeline for a speaker unrolling task: are all the required building blocks already here in your opinion? If there isn't too much digging involved, i'd probably be willing to do it myself :)

hbredin · 2020-06-02T07:01:53Z

Can you please define "speaker unrolling"? I am not familiar with this wording.

Did you mean "speaker enrollment"?
If so, what do you have in mind exactly? Speaker identification?

Rachine · 2020-06-03T15:30:08Z

Hello,

Yes, it would be speaker enrollment.
Based on variable amount of target speakers, find all the segments in the audio from these speakers.
We were thinking to a pipeline that look like this https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41939.pdf

hbredin · 2020-06-04T07:09:57Z

I didn't go through the paper but I think the SpeechTurnClosestAssignment pipeline might get you started.

Enrollment

Basic idea: gather all speaker embedding for each target and take the average.

pyannote-audio/pyannote/audio/pipeline/speech_turn_assignment.py

Lines 94 to 113 in 06f76a2

    
           # gather targets embedding 
        
           labels = targets.labels() 
        
           X_targets, targets_labels = [], [] 
        
           for l, label in enumerate(labels): 
        
               timeline = targets.label_timeline(label, copy=False) 
        
               # be more and more permissive until we have 
        
               # at least one embedding for current speech turn 
        
               for mode in ["strict", "center", "loose"]: 
        
                   x = embedding.crop(timeline, mode=mode) 
        
                   if len(x) > 0: 
        
                       break 
        
               # skip labels so small we don't have any embedding for it 
        
               if len(x) < 1: 
        
                   continue 
        
               targets_labels.append(label) 
        
               X_targets.append(np.mean(x, axis=0))

Recognition

Basic idea: for each test speech turn (or, here, speaker cluster), find closest target speaker (by comparing their average embedding. You might also want to consider the reject option if even the closest target speaker is too far.

pyannote-audio/pyannote/audio/pipeline/speech_turn_assignment.py

Lines 115 to 144 in 06f76a2

    
           # gather speech turns embedding 
        
           labels = speech_turns.labels() 
        
           X, assigned_labels, skipped_labels = [], [], [] 
        
           for l, label in enumerate(labels): 
        
               timeline = speech_turns.label_timeline(label, copy=False) 
        
               # be more and more permissive until we have 
        
               # at least one embedding for current speech turn 
        
               for mode in ["strict", "center", "loose"]: 
        
                   x = embedding.crop(timeline, mode=mode) 
        
                   if len(x) > 0: 
        
                       break 
        
               # skip labels so small we don't have any embedding for it 
        
               if len(x) < 1: 
        
                   skipped_labels.append(label) 
        
                   continue 
        
               assigned_labels.append(label) 
        
               X.append(np.mean(x, axis=0)) 
        
           # assign speech turns to closest class 
        
           assignments = self.closest_assignment(np.vstack(X_targets), np.vstack(X)) 
        
           mapping = { 
        
               label: targets_labels[k] 
        
               for label, k in zip(assigned_labels, assignments) 
        
               if not k < 0 
        
           } 
        
           return speech_turns.rename_labels(mapping=mapping)

Rachine · 2020-06-10T14:27:23Z

Amazing! Thank you! We will let you know how this goes and if it works on our 'special' data.

hbredin · 2020-09-04T13:26:03Z

Closing this issue as I believe the original question has been answered.
I'd still be interested in knowing how it went 👀

Rachine · 2020-09-04T13:32:05Z

Hey! This is still ongoing. We have the pipeline up and running but we are having a hard time to finetune correctly a spk emb model on our smallish dataset 😿

hadware changed the title ~~Feasability of the implementation of a Speaker Unrolling pipeline.~~ Feasability of the implementation of a Speaker Unrollment pipeline. Jun 3, 2020

hadware changed the title ~~Feasability of the implementation of a Speaker Unrollment pipeline.~~ Feasability of the implementation of a Speaker Enrollment pipeline. Jun 3, 2020

hbredin mentioned this issue Jun 22, 2020

Trying to finetune model for new speaker #405

Closed

hbredin closed this as completed Sep 4, 2020

hadware mentioned this issue Mar 23, 2021

Performing Speaker Verification #651

Closed

This was referenced May 18, 2021

Universal Unique Id #669

Closed

Live Stream audio #670

Closed

pagdot mentioned this issue Nov 25, 2022

Transcriptions JupiterBroadcasting/jupiterbroadcasting.com#301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feasability of the implementation of a Speaker Enrollment pipeline. #391

Feasability of the implementation of a Speaker Enrollment pipeline. #391

hadware commented Jun 1, 2020

hbredin commented Jun 2, 2020

Rachine commented Jun 3, 2020 •

edited

Loading

hbredin commented Jun 4, 2020 •

edited

Loading

Rachine commented Jun 10, 2020

hbredin commented Sep 4, 2020

Rachine commented Sep 4, 2020

Feasability of the implementation of a Speaker Enrollment pipeline. #391

Feasability of the implementation of a Speaker Enrollment pipeline. #391

Comments

hadware commented Jun 1, 2020

hbredin commented Jun 2, 2020

Rachine commented Jun 3, 2020 • edited Loading

hbredin commented Jun 4, 2020 • edited Loading

Rachine commented Jun 10, 2020

hbredin commented Sep 4, 2020

Rachine commented Sep 4, 2020

Rachine commented Jun 3, 2020 •

edited

Loading

hbredin commented Jun 4, 2020 •

edited

Loading