-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any speaker diarization documentation and already trained model? #2523
Comments
Any updates on this? I'm looking as well for a proper documentation/usage examples. I'm mostly interested in some hints on how to get the pre-trained model running. |
@bluemonk482 there is a pretrained model at http://kaldi-asr.org/models/m6. We agree that there needs to be better documentation for this. We're discussing how best to do this. @iacoshoria For now, the best usage example for the pretrained model is the recipe that generated it: https://github.com/kaldi-asr/kaldi/blob/master/egs/callhome_diarization/v2/run.sh . Have you looked at this recipe? Could you tell us what you've tried and where (e.g., which stage) you're getting lost? |
@david-ryan-snyder Since the x-vector DNN is already trained in the model, I'm currently trying to run the recipe against my own test data (random audio sequences of multiple speakers) I've found out the only stages that I need to run are: computing features, extracting and clustering x-vectors, but I have difficulties getting the data in the required format. Are there any examples/hints of getting something similar to work? Would greatly appreciate. |
@iacoshoria the recipe is not bound to this dataset. We are talking about making a diarization recipe based on some freely available dataset (such as AMI), but that will probably not be in the main branch of Kaldi any time soon (@mmaciej2, do you have an easy to follow AMI recipe somewhere you can point to?). I recently ran diarization on the "Speakers in the Wild" dataset. I'll show you a few lines of each of the files I created in order to diarize it. I will then go over the steps needed to diarize a generic dataset. Hopefully you can generalize this to your own. wav.scp segments Regardless of what you use to compute segments, the file should look something like this: utt2spk
spk2utt
Now that your data is prepared, I'll try to walk you through the remaining steps. I'm using the variable $name in place of your dataset, so you might be able to just copy and paste these lines of code, and set the $name variable to whatever your dataset is called. Make Features Then run local/nnet3/xvector/prepare_feats.sh to apply sliding window CMVN to the data and dump it to disk. Extract Embeddings Perform PLDA scoring Cluster Speakers If you know know how many speakers are in your recordings (say it's summed channel telephone speech, you can assume there's probably 2 speakers) you can supply a file called reco2num_spk to the option --reco2num-spk. This is a file of the form Or, you may not know how many speakers are in a recording (say it's a random online video). Then you'll need to specify a threshold at which to stop clustering. E.g., once the pair-wise similarity of the embeddings drops below this threshold, stop clustering. You might obtain this by finding a threshold that minimizes the diarization error rate (DER) on a development set. But, this won't be possible if you don't have segment-level labels for a dataset. If you don't have these labels, @dpovey suggested clustering a set of in-domain data, and tuning the threshold until it gives you the average number of speakers per recording that you expect (e.g., you might expect that there's on average 2 speakers per recording, but sometimes more or less). Diarized Speech
|
@david-ryan-snyder I do not currently have an easy-to-follow AMI recipe. I was in the process of reworking it to be fairly simple and use VoxCeleb training data when I got distracted by more-pressing work. There is an AMI recipe here: |
@david-ryan-snyder I started off on wrong foot right in the data preparation step, since I needed a version of the data that doesn't know anything about speech segments, to run the segmentation step. One more thing, I'm using your pre-trained model, which has the PLDA models split between the two data segments In the recipe there's a final step that combines the results from the two, and evaluates them together, should this be the case? Bellow are snippets from the two result sets:
So, my question is: Should I only run the evaluation against a single PLDA model? Also, there are a few false positives, such as the following (from the first example):
Mostly short, under half a second. Should I increase the window/min-segment threshold, to filter out such entries? |
You only need to use one of those PLDA models for your system. Also, if you have enough in-domain training data, you'll have better results training a new PLDA model. If your data is wideband microphone data, you might even have better luck using a different x-vector system, such as this one: http://kaldi-asr.org/models/m7. It was developed for speaker recognition, but it should work just fine for diarization as well. In the egs/callhome_diarization, we split the evaluation dataset into two halves so that we can use one half as a development set for the other half. Callhome is split into callhome1 and callhome2. We then train a PLDA backend (let's call it backend1) on callhome1, and tune the stopping threshold so that it minimizes the error on callhome1. Then backend1 is used to diarize callhome2. Next, we do the same thing for callhome2: backend2 is developed on callhome2, and evaluated on callhome1. The concatenation at the end is so that we can evaluate on the entire dataset. It doesn't matter that the two backends would assign different labels to different speakers, since they diarized different recordings. Regarding the short segment, I think the issue is that your SAD has determined that there's a speech segment from 24.99 to 25.43 and a separate speech segment starting at 25.51. It might be a good idea to smooth these SAD decisions earlier in the pipeline (e.g., in your SAD system itself) to avoid having adjacent segments with small gaps between them. Increasing the min-segment threshold might cause the diarization system to throw out this segment, but to me it seems preferable to keep it, and just merge it with the adjacent segment. But this stuff requires a lot of tuning to get right, and it's hard to say what the optimal strategy is without playing with the data myself. By the way, what is this "nasa_telescopes" dataset you're using? |
Thank you for the suggested approach, will do try and switch the SAD system altogether, seeing that the The data is some random video I found having clean segments of voice between different speakers (eg: https://youtu.be/UkaNtpmoVI0?t=4250). I believe this video was one of the best-case scenario I found to test the diarization on. Thank you for your support! |
Hi there! I am trying to test the pre-trained models ( http://kaldi-asr.org/models/m7) on 16khz speech audio but I am getting an error when I run the
Do you have an idea as to what might be causing this error? When I run the same command on the models in http://kaldi-asr.org/models/m6, it works like a charm.
Many thanks! |
I just downloaded the model, and nnet3-copy works using a newer version of Kaldi. Could you create a new branch with the latest changes from upstream, and see if you still have this issue there? |
Many thanks! I took the latest kaldi master branch version (8e30fdd) and recompiled everything. It works fine now. I know that OP has already asked this question but I wanted to add something - segments
rttm
|
You'll need to post the actual error/warning for me to look into it more. From your description, it sounds like it's only a warning, not an error. |
It looks like perhaps your VAD is probably not giving good output. You should try to check its output before passing it to segmentation.pl to make sure it is reasonable in the first place. It's unlikely the very long speech segments with few silence regions is a bug in the segmentation.pl script. It probably is an issue with the VAD (e.g. bad parameters or just a data mismatch of some kind). As for the bad diarization output, it's hard to say what is going wrong there given the problem with the initial segmentation. It also might be a data mismatch that is degrading performance. It also could be that since the segmentation is bad, there's a lot of "bad" (nonspeech?) regions that are getting put into the clustering algorithm and overwhelming the speaker clusters. |
Hi, many thanks for your inputs! I had a very high |
Hi, super useful thread - thanks all! I've actually gotten all of @david-ryan-snyder's steps to technically run, but my diarization output is very wrong, and I suspect it has to do with my initial segments file. My questions are:
These segments are definitely wrong though as they clearly miss some speaker changes. Is this process right though to get segments? Where could it be going wrong?
This gives me an entirely different segments file, equally or more wrong:
Where is this going wrong? Thank you very much in advance! |
Hi @pbirsinger, For question 1, there is no initial segmentation. The make_mfcc script does not require initial segments. If there is no segments file available, it treats each recording in the wav.scp as one giant segment and computes features for the full recording. We do compute_vad_decision followed by vad_to_segments to create segmentation using a very naive speech activity detection method, i.e. just looking at the energy in each frame. It will not produce particularly good results unless the recording is very high-quality, i.e. strong signal and no noise, and even then the parameters might need to be tuned to get proper output. It's also worth noting that this initial segmentation is supposed to be speech-activity-detection–style segmentation, not speech-recognition–style segmentation, so it should not be detecting speaker changes if there is no appreciable silence between the speakers' utterances. As for question 3, it's not immediately clear what is going wrong, especially without taking a closer look at everything. Perhaps there is some kind of mismatch? Have you tried loading the SAD labels into something like Audacity to view while listening to the file? That might help you narrow it down to whether or not it's doing what it is supposed to, just extremely poorly, or if it's not even doing something appropriate at all (e.g. does it ever label silence where there is speech, are there ever silence regions that are properly labeled as silence, etc.). |
Hi @mmaciej2 - really appreciate the fast response! After some further experimentation, the latter However, when proceeding with the commands with default values that @david-ryan-snyder posted, the
when the actual result should be:
I also set number of speakers to 2 in reco2num_spk. I'm a bit at a loss to figure out where the process is going wrong now - any ideas? Thanks! |
One thing you should do is combine segments that are connected, if the speech activity detection system produces it. I believe that m4 ASpIRE model will split up long speech segments, which is undesirable behavior for diarization, though I'm not 100% sure it will do that. Short segments are not inherently undesirable, but they can be problematic. If a segment is shorter than the sliding window used in xvector extraction, it will result in a less-reliable embedding, in addition to having a slight mismatch on how many frames went into the embedding. At the same time, if there is a short turn, the sliding window extraction might miss the speaker change, while if the speech activity detection segments it out, there's a chance you'll catch it. There's a bit of a trade-off going on, but in general I'd lean toward suggesting having longer segments. You can either tune the speech activity detection system, or you can even do something more simple like merging segments if the silence between them is shorter than some threshold. As for the diarization output being off, again, it's hard to say what's going on. I'd recommend again looking at the labels along with the audio and see if you can figure out why almost everything is being attributed to the same speaker—i.e. is there anything special about what didn't. It's very possible that there is something like a laugh, which gets marked as speech by the speech activity detection system, but ends up being very dissimilar from the rest of the recording. This can illustrate one of the downsides of using a cluster stop criterion of the "correct" number of speakers as opposed to using a tuned threshold. Since the system is based on a similarity metric, it's possible that the difference between the two speakers ends up being smaller than the difference between a speaker and something like a laugh by that speaker. As a result, if you cluster to 2 speakers specifically, you're asking it to segment into the two most dissimilar categories, which is not the two speakers. In contrast, if you are clustering according to a threshold tuned to approximate the boundary between same- and different-speaker speech, it is more likely to find 3 clusters, which despite being incorrect, results in more accurate output. Now, I'm not saying that that is what is happening with your setup, but something like that can definitely happen and would manifest in recordings that are almost entirely attributed to a single speaker. |
Thanks @david-ryan-snyder this was very helpful
|
The usage message for this script says that the arguments are . So the first argument is the directory containing the PLDA model. The second argument is the name of the directory containing the x-vectors we're going to compare, and the last argument is the output directory, where the score matrixes are written. |
So if I use the model from http://kaldi-asr.org/models/m6 , I should use $nnet_dir/xvectors_callhome1 or $nnet_dir/xvectors_callhome2 ? |
@itaipee, you can use either the $nnet_dir/xvectors_callhome1 or $nnet_dir/xvectors_callhome2 directory for the PLDA. In the callhome recipe, since we do not have a held-out set to whiten with, we divide the evaluation set in two. We whiten on callhome1 to score callhome2 and vice-versa in order to score the full set fairly. In theory the two PLDA models should be comparable. |
Hello There! |
@SNSOHANI, our basic diarization systems are in egs/callhome_diarization. The v1 directory uses ivectors, the v2 directory uses xvectors (two methods of extracting speaker-identifiation vectors, the latter using a neural network). Both directories refer to a paper at the top of the run.sh file. I would read those papers and take a look at the run.sh scripts to try to get an understanding of how these things work. |
@mmaciej2 |
Hello, |
Hi, diarization/cluster.sh --cmd run.pl --mem 4G --nj 1 --reco2num-spk data/reco2num_spk 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers my wav.scp is data/test.wav data/test.wav could you please help me with that? |
Probably there is a mismatch in the recording-ids between two different
data sources.
…On Mon, Jun 10, 2019 at 4:44 PM Zimmath ***@***.***> wrote:
Hi,
I run with the pretrained model and it works totally fun when I did not
use the number of speakers. However, when I use reco2num_spk to set the
speakers number to 3, it has an error
diarization/cluster.sh --cmd run.pl --mem 4G --nj 1 --reco2num-spk
data/reco2num_spk
0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores
0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers
diarization/cluster.sh: clustering scores
bash: line 1: 36018 Abort trap: 6 ( agglomerative-cluster --threshold=0.5
--read-costs=false --reco2num-spk-rspecifier=ark,t:data/reco2num_spk
--max-spk-fraction=1.0 --first-pass-max-utterances=32767 "scp:utils/
filter_scp.pl
0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/tmp/split1/1/spk2utt
0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores/scores.scp
|"
ark,t:0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/tmp/split1/1/spk2utt
ark,t:0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/labels.1
) 2>>
0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/log/agglomerative_cluster.1.log
>>
0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/log/agglomerative_cluster.1.log
run.pl: job failed, log is in
0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/log/agglomerative_cluster.1.log
my wav.scp is data/test.wav data/test.wav
my reco2num_spk is data/test.wav 3
the log error is ERROR
(agglomerative-cluster[5.5.337~1-35f96]:Value():util/kaldi-table-inl.h:2402)
Value() called but no such key data/test.wav in archive data/reco2num_spk
could you please help me with that?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2523?email_source=notifications&email_token=AAZFLO6UMK3ERXMXMQAQ7QLPZ24L5A5CNFSM4FHFW2B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXLFWFY#issuecomment-500587287>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAZFLO7G6SVQA7Y23FNJXCLPZ24L5ANCNFSM4FHFW2BQ>
.
|
If you have that then I'd say you don't need to do diarization because the
work has already been done.
Perhaps what you need is just voice activity detection.
Is there a problem if recordings (actual wav files) have been segmented by
… speaker, so that in each wav file there is just one speaker?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#2523?email_source=notifications&email_token=AAZFLOZ7T4YZIMBEI2RGRZ3QGU63JA5CNFSM4FHFW2B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5IBRVY#issuecomment-525342935>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAZFLO5WKUVTFIK2VFPPVT3QGU63JANCNFSM4FHFW2BQ>
.
|
@danpovey even for training? They are part of the same audio which I've segmented by speaker to have shorter files. I want to use them for speaker diarization training. Should I merge them back? |
This should be fine for training the x-vector DNN, unless the segments are really short. If they're less than 3 seconds, you may need to merge them together in some way, since we usually train the x-vector DNN on segments of that length. You also need to ensure that all segments from the same speaker share the same class label. Whatever works for training the x-vector DNN should also be fine for PLDA training. Our diarization system doesn't require multi-speaker audio to train on. It's trained as if it will be used for single-speaker speaker recognition. We utilize it for diarizing multi-speaker recordings by extracting embeddings from short segments (and we generally assume one speaker per segment) and then cluster them to identify where the different speakers appear. |
My reco2num_spk file has entries in the form <rec_id> <num_speakers>. My speaker ID is <rec_id>_<speaker_tag> which I believe is acceptable. EDIT: I got it to work by creating a reco2num_spk file that has the format <spk_id> 1 for all speakers. But I don't think this is the right way to do it as all segments are labelled 1. |
In general, Kaldi programs are not supposed to call Value() without
checking that the key exists in the table. It should be rewritten to check
that, and probably print a warning if it's not present; and probably
accumulate some kind of count of how many were not present and if it was
more than half, exit with error status at the end. If you have time to
make a PR it would be good.
…On Thu, Sep 12, 2019 at 4:21 AM Roshan S Sharma ***@***.***> wrote:
# agglomerative-cluster --threshold=0.5 --read-costs=false --reco2num-spk-rspecifier=ark,t:data/iphone_data/reco2num_spk --max-spk-fraction=1.0 --first-pass-max-utterances=32767 "scp:utils/filter_scp.pl exp/xvector_nn
et_1a/xvectors_iphone_data/plda_scores_num_speakers/tmp/split20/10/spk2utt exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores/scores.scp |" ark,t:exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores_num_speakers/tmp
/split20/10/spk2utt ark,t:exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores_num_speakers/labels.10
# Started at Tue Sep 10 22:57:48 EDT 2019
agglomerative-cluster --threshold=0.5 --read-costs=false --reco2num-spk-rspecifier=ark,t:data/iphone_data/reco2num_spk --max-spk-fraction=1.0 --first-pass-max-utterances=32767 'scp:utils/filter_scp.pl exp/xvector_nnet
_1a/xvectors_iphone_data/plda_scores_num_speakers/tmp/split20/10/spk2utt exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores/scores.scp |' ark,t:exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores_num_speakers/tmp/s
plit20/10/spk2utt ark,t:exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores_num_speakers/labels.10
ERROR (agglomerative-cluster[5.5.347~1-8b54]:Value():util/kaldi-table-inl.h:2402) Value() called but no such key 2_d_S1 in archive data/iphone_data/reco2num_spk
[ Stack-Trace: ]
kaldi::MessageLogger::LogMessage() const
kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)
kaldi::RandomAccessTableReaderUnsortedArchiveImpl >::Value(std::__cxx11::basic_string, std::allocator > const&)
main
__libc_start_main
_start
kaldi::KaldiFatalError# Accounting: time=53 threads=1
# Ended (code 255) at Tue Sep 10 22:58:41 EDT 2019, elapsed time 53 seconds
My reco2num_spk file has entries in the form <rec_id> <num_speakers>.
I was wondering why spk2utt is used for any comparison at all in the
clustering step. The agglomerative clustering bin file reads that reco2utt
is required not spk2utt.
Please clarify what is to be used?
My speaker ID is <rec_id>_<speaker_tag> which I believe is acceptable.
S1 is a speaker tag in my data
I would appreciate help in trying to debug the issue. All the files are in
the correct Kaldi formats.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2523?email_source=notifications&email_token=AAZFLOZ6MFQEF47GPJAPS7LQJFHLBA5CNFSM4FHFW2B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6PYT2I#issuecomment-530549225>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAZFLOZTZBLTRGMAWSOFXMDQJFHLBANCNFSM4FHFW2BQ>
.
|
@danpovey , Thanks for the feedback.
@david-ryan-snyder , would appreciate your feedback too, if any. |
@RSSharma246, There seem to be some basic misunderstandings here, and I want to try to clear them up before you lead yourself further down an incorrect path. It's hard to figure out what is going wrong with your setup, but hopefully I can give you the tools to figure it out yourself. First of all, I'd recommend looking at the usage documentation for agglomerative-cluster and the diarization/cluster.sh script that is running it. It explains how these things work and what format the inputs should be in. I think the misunderstanding is that there is nothing inherently special about reco2num_spk, spk2utt, etc. They are essentially just tables in text format. They are named certain ways to make things more easily understandable, but, for example, there is no difference between reco2num_spk and utt2num_spk besides that we choose to call the keys in reco2num_spk recordings and the keys in utt2num_spk utterances. So, don't get too caught up in what the names of the files are while debugging. The names can be useful, but it's what the files contain that's important. Read the documentation of the higher level scripts (for example diarization/cluster.sh) and the C++ scripts they call (for example aggolmerative-cluster.cc) and make sure you understand what the tables are being used for, so you can verify that their contents are consistent with what they should be. But, to more specifically address your questions/comments: Regardless of what the names of the files are (for example that the reco2utt filename is called spk2utt), they should contain the above information. You should check to make sure that the files contain the right things. If they don't, then you need to figure out what created them, because that is probably where the bug came from, not agglomerative-cluster. |
@RSSharma246, |
@mmaciej2 @danpovey - I did look at all the files again, and I think there is a gap somewhere either in my understanding or the current setup.
Two possibilities:
Which of the two cases is true? Understanding that should help fix my problems |
@RSSharma246, |
Issue resolved! |
I started off with your guide and created a wav.scp file. I wanted to create a segment file and hence called off compute_vad_decision.sh with wav.scp, it returned with an error of feats.scp file. How will I be able to generate this file as there are no raw_mfcc_features present in pre-trained model. BTW I am using SRE16 pre-trained model (3 rd in the models page of kaldi). |
You have to generate the features (which are MFCCs) yourself. They're never distributed as part of a pretrained model. |
Hi, Thank you. |
With just one reocrding (wav file) I create the reco2num_spk file as such: I get an error when clustering: Any clues? |
The scores should, it seems, be indexed by recording, not utterance-id, so
an entry in the `scores` archive should, I think,
be something like: recording-id matrix-of-scores,#utts by #utts.
Yours seems to be indexed by utterance.
…On Thu, Sep 10, 2020 at 9:51 PM anderleich ***@***.***> wrote:
With just one reocrding (wav file) I create the reco2num_spk file as such:
recording1 2
and the wav.scp file as such:
recording1 absolyte/path/to/file
I get an error when clustering:
ERROR
(agglomerative-cluster[5.5.794~1-2b62]:Value():util/kaldi-table-inl.h:2402)
Value() called but no such key recoding1-0000000-0000659 in archive
/XXXXXXXX/0003_sre16_v2_1a/data/reco2num_spk
Any clues?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2523 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOYHKRZEZ3RV5PI5LITSFDKXLANCNFSM4FHFW2BQ>
.
|
I don't understand what you call recording. The actual wav file? wav.scp I guess I should obtain the speaker for each utterance (utt_0001, utt_0002...) |
I think it creates utterance-ids within those recordings, that may have
time marks on them. But yes, `recording1` is the recording,
and recording1-xxxxxxx-xxxxxx would be the utterance. I am not super
familiar with the diarization scripts.
…On Thu, Sep 10, 2020 at 11:09 PM anderleich ***@***.***> wrote:
I don't understand what you call recording. The actual wav file?
These are my files
wav.scp
recording1 /absolute/path/to/wav
segments
utt_0001 recording1 0.00 1.23 utt_0002 recording1 1.45 4.56 ...
reco2num_spk
recording1 2
utt2spk
utt_0001 utt_0001 utt_0002 utt_0002
I guess I should obtain the speaker for each utterance (utt_0001,
utt_0002...)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2523 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOZBLC56HXMPTHQGCRTSFDT2RANCNFSM4FHFW2BQ>
.
|
That's why I don't understand why by specifying two speakers on recording1 it gives the error I mentioned. If I update the reco2num_spk file to the following content it works. However it seems strange to say an utterance has two speakers.
|
Another thing to notice is that the final rttm has different values for timestamps when compared to the segments file. This creates some variations in time which don't really match. Is there any tool to pass the rttm information to the segments file? |
I dont follow this stuff super closely. There could be a bug when you
specify 1 speaker (b/c if there is only 1 speaker you don't really need
diarization, so code might not have been tested for that).
…On Fri, Sep 11, 2020 at 4:19 PM anderleich ***@***.***> wrote:
Another thing to notice is that the final rttm has different values for
timestamps when compared to the segments file. This creates some variations
in time which don't really match. Is there any tool to pass the rttm
information to the segments file?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2523 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO2RCHRCM4KVNDU5VNTSFHMRHANCNFSM4FHFW2BQ>
.
|
I didn't specify just 1 speaker. I'm trying to do a two speaker diarization. However, recording id seems to be utterance id as set in the first column of the segments file. |
I think you are correct that recording-id, as referred to in diarization
code, is really utterance id and what is called "utterances" in that code
may be
sub-segments of utterances. The idea may be that at the start, your
utterances and recordings are one and the same.
…On Fri, Sep 11, 2020 at 4:32 PM anderleich ***@***.***> wrote:
I didn't specify just 1 speaker. I'm trying to do a two speaker
diarization. However, recording id seems to be utterance id as set in the
first column of the segments file.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2523 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3ONIJ52G6JHQG3XI3SFHODVANCNFSM4FHFW2BQ>
.
|
Well, thank you! I finally managed to cluster speakers. However, the results are not good. I guess http://kaldi-asr.org/models/m6 this model is based on telephone conversations. For a more general, open domain scenario, should I train a model with my own data? If so, which is the most straightforward recipe? |
Mm. It's possible to adapt these systems while retaining the same x-vector
extractor by re-training the PLDA on your own data, but you need
speaker-labeled data.
Sorry I dont recall where an example of that would be.
…On Fri, Sep 11, 2020 at 6:26 PM anderleich ***@***.***> wrote:
Well, thank you! I finally managed to cluster speakers. However, the
results are not good. I guess http://kaldi-asr.org/models/m6 this model
is based on telephone conversations. For a more general, open domain
scenario, should I train a model with my own data? If so, which is the most
straightforward recipe?
Thanks
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2523 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO2B6TA7V4DL7BEQTILSFH3L7ANCNFSM4FHFW2BQ>
.
|
@david-ryan-snyder Hi, David, if I want to train an 8k SID model without much of those data, is downsampling from the wideband data the possible solution? Thanks in advance. |
Hi, I am still new to Kaldi. I would like to perform diarization on some of the speech samples from my own dataset which do not have any speaker labels available, so I would have to listen and compare it to what the diarization outputs. I have a questions on this: a) Does it make sense to use a pre-trained model, such as the callhome_v2 model, as there maybe different recording conditions, dialect and possibly language? Or are we assuming that the pretrained model has learned generelizable features (xvectors) so to be able to work well even on an unseen dataset? Thanks in advance |
Hi, I am working on speaker diarization, I have done till clustering step, the rttm file is successfully created. AFter that when i try to cluster the plda scores using the code: |
@maham7621 |
Hi there, thanks for Kaldi :)
I want to perform speaker diarization on a set of audio recordings. I believe Kaldi recently added the speaker diarization feature. I have managed to find this link, however, I have not been able to figure out how to use it since there is very little documentation. Also, may I ask is there any already trained model on conversions in English that I can use off-the-shelf, please?
Thanks a lot!
The text was updated successfully, but these errors were encountered: