Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add speech transcriptions #4920

Open
wants to merge 2 commits into
base: ust
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions examples/speech_matrix/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,15 @@ Audios are saved to ${SAVE_ROOT}/audios/. For example, English audios are compre

Speech alignments are saved to ${SAVE_ROOT}/aligned_speech/. For example, en-fr.tsv.gz contains a pair of aligned audio paths in English and French respectively together with their alignment score in each line.

## Speech Transcriptions

While SpeechMatrix focuses on speech-only data mining and translation, we provide transcriptions for the mined speech in case they are needed for future research. The transcriptions are generated with [Whisper](https://github.com/openai/whisper), we use medium.en for English transcribing, and medium for other langauges. Curently transcriptions are provided the target speech in these language directions: {"cs", "de", "en", "es", "et", "fi", "fr", "hu", "it", "lt", "nl", "pl", "pt", "ro", "sl"}-{"de", "en", "es", "fr", "nl"}.

```bash
# SAVE_ROOT: the directory to save mined data
python mined_train_sets/download_transcriptions.py \
--save-root ${SAVE_ROOT}
```

## Speech-to-Unit Data

Expand Down
8 changes: 8 additions & 0 deletions examples/speech_matrix/data_helper/data_cfg.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,11 @@
hubert_key = "hubert"
vocoder_key = "vocoder"
s2s_key = "s2s_models"
trans_key = "transcriptions"
# langs with transcriptions
TRANS_SRC_LANGS = [
"cs", "de", "en", "es", "et", "fi",
"fr", "hu", "it", "lt", "nl", "pl",
"pt", "ro", "sl"
]
TRANS_TGT_LANGS = ["de", "en", "es", "fr", "nl"]
34 changes: 34 additions & 0 deletions examples/speech_matrix/mined_train_sets/download_transcriptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import os
import argparse
from examples.speech_matrix.data_helper.data_cfg import (
DOWNLOAD_HUB,
VP_LANGS,
trans_key,
TRANS_SRC_LANGS,
TRANS_TGT_LANGS
)


def download_transcriptions(src_lang, tgt_lang, save_root):
save_dir = os.path.join(save_root, trans_key)
os.makedirs(save_dir, exist_ok=True)

sorted_src_lang, sorted_tgt_lang = sorted([src_lang, tgt_lang])
s2s_dl = f"{DOWNLOAD_HUB}/{trans_key}/{sorted_src_lang}-{sorted_tgt_lang}_{tgt_lang}.tsv.gz"
os.system(f"wget {s2s_dl} -P {save_dir}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☹️



if __name__ == "__main__":

parser = argparse.ArgumentParser()
parser.add_argument("--save-root", type=str, required=True)
args = parser.parse_args()

# download transcriptions of ta
for src_lang in TRANS_SRC_LANGS:
for tgt_lang in TRANS_TGT_LANGS:
if src_lang == tgt_lang:
continue
download_transcriptions(
src_lang, tgt_lang, save_root=args.save_root
)