Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preprocessing scripts for the librilight datasets #107

Merged
merged 10 commits into from
Jan 17, 2024

Conversation

HarryHe11
Copy link
Collaborator

@HarryHe11 HarryHe11 commented Jan 14, 2024

✨ Description

This update introduces preprocessing scripts for the Libri-Light datasets, enhancing their usability and compatibility with our processing workflows.

🚧 Related Issues

No related issues.

👨‍💻 Changes Proposed

  • Implemented the main workflow preprocessors/librilight.py for preprocessing Libri-Light datasets.
  • Developed a utils/cut_by_vad.py script to segment audio files using multiprocessing (Step 1: Segmentation).
  • Created an utils/mfa_prepare.py script to convert audio files to 16kHz and 16-bit PCM, and to filter out longer audio files (Steps 2 & 3: Filter and Preprocess).
  • Added utils/whisper_transcription.py for audio transcriptions using distilled-whisper – a more efficient variant of Whisper, and included text preprocessing functions for these transcriptions (Steps 4 & 5: Transcription & Text-Preprocess).
  • Integrated an MFA alignment function specifically tailored for Libri-Light in preprocessors/librilight.py(Step 6: Alignment).
  • Enabled data splitting into train/dev/eval sets, along with statistics calculation and metadata construction for Libri-Light in preprocessors/librilight.py (Steps 7-9).
  • Provided support for different subsets of Libri-Light, including "tiny", "small", "medium", and "large".

🧑‍🤝‍🧑 Who Can Review?

🛠 TODO

  • Test on Libri-Light-tiny (custom split from Libri-Light-small).
  • Test on Libri-Light-small.
  • Test on Libri-Light-medium.
  • Test on Libri-Light-large.

✅ Checklist

  • Code has been reviewed
  • Code complies with the project's code standards and best practices
  • Code has passed all tests
  • Code does not affect the normal use of existing features
  • Code has been commented properly
  • Documentation has been updated (if applicable)
  • Demo/checkpoint has been attached (if applicable)

@HarryHe11
Copy link
Collaborator Author

In this comment, I provide screenshots from testing the implemented scripts on Libri-Light-tiny (a custom split from Libri-Light-small).

The Running Process

preprocessors/librilight.py

Part 1 of 5:

1321705224117_ pic

Part 2 of 5:

1331705224117_ pic

Part 3 of 5:

1341705224118_ pic

Part 4 of 5:

1351705224119_ pic

Part 5 of 5:

1361705224120_ pic

The Outcome:

Processed Data:

1381705225355_ pic

MetaData:

1371705225354_ pic

Copy link
Collaborator

@lmxue lmxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts. Please check out the comments.

@HarryHe11
Copy link
Collaborator Author

Thanks for your efforts. Please check out the comments

Thank you so much for reading my PR; I have addressed your concerns, and please see my most recent commits.

@HarryHe11 HarryHe11 requested a review from lmxue January 15, 2024 07:27
Copy link
Collaborator

@lmxue lmxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lmxue
Copy link
Collaborator

lmxue commented Jan 17, 2024

LGTM.
P.S. This PR has been tested on Libri-Light-tiny. However, three other subdatasets need to be tested as listed in TODO. You may need to test them when the dataset is ready.

@HarryHe11
Copy link
Collaborator Author

LGTM.

P.S. This PR has been tested on Libri-Light-tiny. However, three other subdatasets need to be tested as listed in TODO. You may need to test them when the dataset is ready.

sure, I test them then.

@lmxue lmxue merged commit 9da5a24 into open-mmlab:main Jan 17, 2024
1 check passed
@HarryHe11 HarryHe11 deleted the libri-light-preprocess branch August 22, 2024 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants