-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add preprocessing scripts for the librilight datasets #107
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your efforts. Please check out the comments.
…/Amphion into libri-light-preprocess Conflicts: preprocessors/librilight.py
Thank you so much for reading my PR; I have addressed your concerns, and please see my most recent commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
sure, I test them then. |
✨ Description
This update introduces preprocessing scripts for the Libri-Light datasets, enhancing their usability and compatibility with our processing workflows.
🚧 Related Issues
No related issues.
👨💻 Changes Proposed
preprocessors/librilight.py
for preprocessing Libri-Light datasets.utils/cut_by_vad.py
script to segment audio files using multiprocessing (Step 1: Segmentation).utils/mfa_prepare.py
script to convert audio files to 16kHz and 16-bit PCM, and to filter out longer audio files (Steps 2 & 3: Filter and Preprocess).utils/whisper_transcription.py
for audio transcriptions using distilled-whisper – a more efficient variant of Whisper, and included text preprocessing functions for these transcriptions (Steps 4 & 5: Transcription & Text-Preprocess).preprocessors/librilight.py
(Step 6: Alignment).preprocessors/librilight.py
(Steps 7-9).🧑🤝🧑 Who Can Review?
🛠 TODO
✅ Checklist