Add preprocessing scripts for the librilight datasets #107

HarryHe11 · 2024-01-14T09:41:06Z

✨ Description

This update introduces preprocessing scripts for the Libri-Light datasets, enhancing their usability and compatibility with our processing workflows.

🚧 Related Issues

No related issues.

👨‍💻 Changes Proposed

Implemented the main workflow preprocessors/librilight.py for preprocessing Libri-Light datasets.
Developed a utils/cut_by_vad.py script to segment audio files using multiprocessing (Step 1: Segmentation).
Created an utils/mfa_prepare.py script to convert audio files to 16kHz and 16-bit PCM, and to filter out longer audio files (Steps 2 & 3: Filter and Preprocess).
Added utils/whisper_transcription.py for audio transcriptions using distilled-whisper – a more efficient variant of Whisper, and included text preprocessing functions for these transcriptions (Steps 4 & 5: Transcription & Text-Preprocess).
Integrated an MFA alignment function specifically tailored for Libri-Light in preprocessors/librilight.py(Step 6: Alignment).
Enabled data splitting into train/dev/eval sets, along with statistics calculation and metadata construction for Libri-Light in preprocessors/librilight.py (Steps 7-9).
Provided support for different subsets of Libri-Light, including "tiny", "small", "medium", and "large".

🧑‍🤝‍🧑 Who Can Review?

🛠 TODO

Test on Libri-Light-tiny (custom split from Libri-Light-small).
Test on Libri-Light-small.
Test on Libri-Light-medium.
Test on Libri-Light-large.

✅ Checklist

Code has been reviewed
Code complies with the project's code standards and best practices
Code has passed all tests
Code does not affect the normal use of existing features
Code has been commented properly
Documentation has been updated (if applicable)
Demo/checkpoint has been attached (if applicable)

…/Amphion into libri-light-preprocess

HarryHe11 · 2024-01-14T09:50:03Z

In this comment, I provide screenshots from testing the implemented scripts on Libri-Light-tiny (a custom split from Libri-Light-small).

The Running Process

preprocessors/librilight.py

Part 1 of 5:

Part 2 of 5:

Part 3 of 5:

Part 4 of 5:

Part 5 of 5:

The Outcome:

Processed Data:

MetaData:

lmxue

Thanks for your efforts. Please check out the comments.

egs/datasets/README.md

preprocessors/librilight.py

utils/whisper_transcription.py

…/Amphion into libri-light-preprocess Conflicts: preprocessors/librilight.py

HarryHe11 · 2024-01-15T03:03:11Z

Thanks for your efforts. Please check out the comments

Thank you so much for reading my PR; I have addressed your concerns, and please see my most recent commits.

lmxue

LGTM.

lmxue · 2024-01-17T07:04:06Z

LGTM.
P.S. This PR has been tested on Libri-Light-tiny. However, three other subdatasets need to be tested as listed in TODO. You may need to test them when the dataset is ready.

HarryHe11 · 2024-01-17T07:22:38Z

LGTM.

P.S. This PR has been tested on Libri-Light-tiny. However, three other subdatasets need to be tested as listed in TODO. You may need to test them when the dataset is ready.

sure, I test them then.

HarryHe11 added 2 commits January 14, 2024 17:07

Add preprocessor for librilight dataset

7bfd349

Merge branch 'libri-light-preprocess' of https://github.com/HarryHe11…

12c8df2

…/Amphion into libri-light-preprocess

HarryHe11 requested review from RMSnow, lmxue and HeCheng0625 January 14, 2024 09:41

formatted codes with black

55c0b1e

HarryHe11 mentioned this pull request Jan 14, 2024

Add VALL-E pre-trained model trained on 6k-hour Librilight #101

Merged

9 tasks

HarryHe11 added 2 commits January 14, 2024 18:07

Update README.md

7a67b67

Fix a bug in librilight.py

1f82bd3

lmxue requested changes Jan 14, 2024

View reviewed changes

egs/datasets/README.md Outdated Show resolved Hide resolved

preprocessors/librilight.py Outdated Show resolved Hide resolved

preprocessors/librilight.py Outdated Show resolved Hide resolved

utils/whisper_transcription.py Outdated Show resolved Hide resolved

HarryHe11 and others added 5 commits January 15, 2024 10:55

update codes according to feedback

34b0bd7

Merge branch 'libri-light-preprocess' of https://github.com/HarryHe11…

45f02a5

…/Amphion into libri-light-preprocess Conflicts: preprocessors/librilight.py

Update README.md

0cd07ce

Update librilight.py

28dc333

Update README.md

b100058

HarryHe11 requested a review from lmxue January 15, 2024 07:27

lmxue approved these changes Jan 17, 2024

View reviewed changes

lmxue merged commit 9da5a24 into open-mmlab:main Jan 17, 2024
1 check passed

HarryHe11 deleted the libri-light-preprocess branch August 22, 2024 06:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add preprocessing scripts for the librilight datasets #107

Add preprocessing scripts for the librilight datasets #107

HarryHe11 commented Jan 14, 2024 •

edited

Loading

HarryHe11 commented Jan 14, 2024

lmxue left a comment

HarryHe11 commented Jan 15, 2024

lmxue left a comment

lmxue commented Jan 17, 2024

HarryHe11 commented Jan 17, 2024

Add preprocessing scripts for the librilight datasets #107

Add preprocessing scripts for the librilight datasets #107

Conversation

HarryHe11 commented Jan 14, 2024 • edited Loading

✨ Description

🚧 Related Issues

👨‍💻 Changes Proposed

🧑‍🤝‍🧑 Who Can Review?

🛠 TODO

✅ Checklist

HarryHe11 commented Jan 14, 2024

lmxue left a comment

Choose a reason for hiding this comment

HarryHe11 commented Jan 15, 2024

lmxue left a comment

Choose a reason for hiding this comment

lmxue commented Jan 17, 2024

HarryHe11 commented Jan 17, 2024

HarryHe11 commented Jan 14, 2024 •

edited

Loading