Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Total duration of segments after filtering bad segments is less than result in paper #13

Open
ngocson1804 opened this issue Jun 8, 2022 · 3 comments

Comments

@ngocson1804
Copy link

ngocson1804 commented Jun 8, 2022

Hi,
I ran step 4 and 5 using the file /data/ja/202103.csv you provided. I got more than 10M files with a total duration of over 10,000 hours for all segments. But after filtering bad segments with min_confidence_score=-0.3, the total of number of good segments is only about 480,000 with a total duration of 351 hours. So, the yield is roughly 3.5% and the total duration is much less than what you mentioned in the paper (1,300 hours). Do you know the possible reasons?

@vebmaylrie
Copy link
Member

Please decrease the threshold. We used -3.0 to obtain >1300 hour data.

@ngocson1804
Copy link
Author

Thank you for the suggestion! I tried using the threshold -3.0 and got 5.7 million segments for a total duration of 6,046 hours, which is way more than 1,300 hours. So, I checked your paper more carefully and it seems that you applied the -3.0 threshold only to the top 15k videos and the single-speaker subset to get 1,376 hours. Meanwhile, I got a total of over 100,000 videos. So, is there any reason to use only the top 15k videos? Should I use all of the 100k videos to get 6,046 hours of segments with confident score over -3.0?

Also, is there any chance you could share the dev_easy_jun21, eval_easy_jun21, dev_normal_jun21 and eval_normal_jun21 sets?

@vebmaylrie
Copy link
Member

I got a total of over 100,000 videos. So, is there any reason to use only the top 15k videos?
There is no special reason. Our experiments using 15k videos were pilot studies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants