Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

created new script for converting bilingual captions to monolingual caption #399

Conversation

tyisme614
Copy link
Contributor

This pull request contains a new script written in python for converting bilingual cation to monolingual caption.

Introduction

This script is used for removing english caption from bilingual caption files to make it monolingual.

The english caption line is conventionally placed at the last line of each caption block in srt files. So the newly created script is aimed at removing the last line of each caption block of committed bilingual caption.

Env

This script is developed and tested in python v3.11.

Usage

python3 convert_bilingual_monolingual.py -i <input_file> -o <output_file>

Example

  • For instance, the input file name is "test.cn.en.srt", and you name your output file as "output_test.cn.srt" *

python3 convert_bilingual_monolingual.py -i test.cn.en.srt -o output_test.cn.srt

created script for converting bilingual caption to monolingual caption
updated quality check configuration
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Dec 5, 2022

The documentation is not available anymore as the PR was closed or merged.

@xianbaoqian xianbaoqian requested a review from lewtun December 6, 2022 02:50
@xianbaoqian
Copy link
Contributor

Hi @tyisme614 Thanks so much for contributing the script. It looks good to me but I'll let Lewis to double check.

Also thanks for flagging the issue on Github CI, python 3.6 is no longer supported. I raised #402 to fix that. Do you mind removing the file from the PR. Let's OK to fail the quality check in this PR as it's not related to your script.

@tyisme614
Copy link
Contributor Author

Hi @tyisme614 Thanks so much for contributing the script. It looks good to me but I'll let Lewis to double check.

Also thanks for flagging the issue on Github CI, python 3.6 is no longer supported. I raised #402 to fix that. Do you mind removing the file from the PR. Let's OK to fail the quality check in this PR as it's not related to your script.

sure.

@tyisme614
Copy link
Contributor Author

@xianbaoqian I reverted my modification on quality.yml.
so this PR is clean now.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for providing this conversion script @tyisme614 🔥 !!

Would you mind adding your instructions for running it to the README here? https://github.com/huggingface/course/tree/main/subtitles

Also we just merged a PR to fix the quality issues with the CI - would you mind rebasing / merging the main branch on yours and pushing again?

@tyisme614
Copy link
Contributor Author

Thank you very much for providing this conversion script @tyisme614 🔥 !!

Would you mind adding your instructions for running it to the README here? https://github.com/huggingface/course/tree/main/subtitles

Also we just merged a PR to fix the quality issues with the CI - would you mind rebasing / merging the main branch on yours and pushing again?

no prob~

merged quality.yml
added instruction of converting bilingual subtitle to monolingual
@lewtun
Copy link
Member

lewtun commented Dec 8, 2022

Thanks, this looks great!

@lewtun lewtun merged commit b983b66 into huggingface:main Dec 8, 2022
lewtun added a commit that referenced this pull request May 11, 2023
* Refactor tokenization of targets for transformers v4.22 (#316)

* Refactor tokenization of targets for transformers v4.22

* typo fix (#319)

line no.49 Changed _SQuAD_it-text.json_-> _SQuAD_it-test.json_

* [FR] Many corrections (#318)

* Fix URL to the Pile (#324)

* Fix URL to the Pile

* [RU] ch5  (#317)

* fix: book url (#323)

* zh-CN - Chapter 7,8,9finished (#315)

Co-authored-by: Lewis Tunstall <[email protected]>

* Refactor events (#261)

* Fix whole word masking labels (#326)

* Fix question answering indices (#327)

* Add translation checker (#329)

* [FR] Refactor events (#330)

* Translation Chapter 4 (#325)

* update author list (de) (#331)

* Fix Russian ToC (#332)

* Refactor dataset upload in Chapter 5 / section 5 (#334)

* Fix id2label types (#337)

* Fix keywords in de quiz chapter 3 (#338)

Noticed two `undefined` in the new render, because the `text` key was capitalized.

* Tweak course validator (#340)

* [Italian] Added Ch2/3 and Ch2/4 (#322)

* Completes chapter 1 (#341)

* Create 5.mdx and translate it into Japanese.

* Create 6.mdx and translate it into Japanese.

* done chapter1.2,1.3

* Create 4.mdx and translate it into Japanese.

* Slightly modified

* Slightly modified

* Slightly modified

* TF generation fixes (#344)

* Fixes to chapter 7

Co-authored-by: lewtun <[email protected]>

* i18n: ES - translate file chapter2/6.mdx (#346)

* Typo in russian translation (#349)

It should be "Обучающий цикл" not "Обучающий цикла"

* Remove translated string (#350)

* [It] Ch2/5, Ch2/6, Ch2/7 (#353)

* Add FAQ (#354)

* i18n: ES - translate file chapter2/7.mdx (#347)

* [id] Add translation to Bahasa Indonesia for chapter0 & some of chapter1 (#351)

* i18n: ES - chapter2/8.mdx (#352)

* Update 4.mdx based on the advice.

* [de] Translation Chapter 1 (#336)

* Update 1.mdx (#356)

* Update 1.mdx (#357)

* removed original english texts to open pull request

* removed original english texts to open pull request

* removed original english texts to open pull request

* add lines for chap1/4 to 6

* Slightly modified

* modify 2.mdx, 3.mdx

* modify _toctree.yml

* Update pr docs actions (#369)

* Add Python syntax highlighting (#370)

* [FR] Add FAQ and more (#367)


Co-authored-by: Lewis Tunstall <[email protected]>

* [RU] Chapter 6 (1/2) finished (#368)

* Spanish translation of Chapter 5 (#366)



Co-authored-by: Lewis Tunstall <[email protected]>

* Add Japanese trasnlation of chapter 1/ 7 to 10  (#359)

* Adding Portuguese Translation to Chapter3 (#361)

* make style

* Typo in Chapter 2, Section 2 (#364)

Replace "inputs" with "outputs".

* Revert "Update pr docs actions (#369)"

This reverts commit 44f77be.

* Typo (#374)

* Chapter 9 - Italian (#373)

* Fix notebook link (#378)

* docs: feat: chapter2-1 in Korean (#375)

Review by @lewtun 22/11/22
docs: fix: remove commented toc for future contributors

* Migrate Spaces URLs to new domain (#379)

* docs: feat: same links across languages (#380)

Added custom anchors using double square brackets, e.g. [[formatted-anchor]]

* Add video transcripts  (#150)

* docs: fix: Accurate for the origin (English) subtitles (#384)

* docs: i18n: add zh-CN machine translation (#385)

* [FR] Notebooks links (#386)

* Upgrade python version in the workflow (#402)

* Update README.md (#389)

Add that preview does not work with windows

* translated chapter2_1-3 (#392)

* fixes small typos (#397)

* Add Chap2/4.mdx and 5.mdx (#391)

Co-authored-by: 長澤春希 <[email protected]>

* created new script for converting bilingual captions to monolingual caption (#399)

* Add French YouTube videos transcription (#410)

* docs(zh-cn): Reviewed 56_data-processing-for-masked-language-modeling.srt (#400)

* docs(zh-cn): Reviewed 57_what-is-perplexity.srt (#401)

* reviewed ep.58 (#405)

* reviewed ep.59 (#406)

* docs(zh-cn): Reviewed 60_what-is-the-bleu-metric.srt (#407)

* finished review (#408)

* docs(zh-cn): Reviewed 61_data-processing-for-summarization.srt (#409)

* Fix subtitle - translation data processing (#411)

* [FR] Final PR (#412)

* [ko] Add chapter 8 translation (#417)

* docs(zh-cn): Reviewed 62_what-is-the-rouge-metric.srt (#419)

* finished review

* fixed errors in original english subtitle

* fixed errors (#420)

* docs(zh-cn): Reviewed 63_data-processing-for-causal-language-modeling.srt (#421)

* Update 63_data-processing-for-causal-language-modeling.srt

* finished review

* Update 63_data-processing-for-causal-language-modeling.srt

* docs(zh-cn): Reviewed 65_data-processing-for-question-answering.srt (#423)

* finished review

* finished review

* finished review (#422)

* Add Ko chapter2 2.mdx (#418)

* Add Ko chapter2 2.mdx

* [ko] Add chapter 8 translation (#417)

* docs(zh-cn): Reviewed 62_what-is-the-rouge-metric.srt (#419)

* finished review

* fixed errors in original english subtitle

* fixed errors (#420)

* docs(zh-cn): Reviewed 63_data-processing-for-causal-language-modeling.srt (#421)

* Update 63_data-processing-for-causal-language-modeling.srt

* finished review

* Update 63_data-processing-for-causal-language-modeling.srt

* docs(zh-cn): Reviewed 65_data-processing-for-question-answering.srt (#423)

* finished review

* finished review

* finished review (#422)

* Add Ko chapter2 2.mdx

Co-authored-by: IL-GU KIM <[email protected]>
Co-authored-by: Yuan <[email protected]>

* update textbook link (#427)

* Visual fixes (#428)

* finish first round review (#429)

* Fix French subtitles + refactor conversion script (#431)

* Fix subtitles and scripts

* Fix subtitle

* Add tokenizer to MLM Trainer (#432)

* Fix FR video descriptions (#433)

* Fix FR video descriptions

* Rename file

* Fix dead GPT model docs link. (#430)

* Translate into Korean: 2-3 (#434)

Co-authored-by: “Ryan” <“[email protected]”>

* Add korean translation of chapter5 (1,2) (#441)

update toctree for chapter 5 (1, 2)
ensure same title for 5-2
add updates from upstream English with custom anchors

Co-Authored-By: Minho Ryu <[email protected]>

Co-authored-by: Meta Learner응용개발팀 류민호 <[email protected]>
Co-authored-by: Minho Ryu <[email protected]>

* Update 3.mdx (#444)

* docs(zh-cn): Reviewed 67_the-post-processing-step-in-question-answering-(tensorflow).srt (#447)

* Update 67_the-post-processing-step-in-question-answering-(tensorflow).srt

* finished review

* docs(zh-cn): Reviewed 66_the-post-processing-step-in-question-answering-(pytorch).srt (#448)

* Update 66_the-post-processing-step-in-question-answering-(pytorch).srt

* finished review

* refined translation

* docs(zh-cn): Reviewed 01_the-pipeline-function.srt (#452)

* finish review

* Update subtitles/zh-CN/01_the-pipeline-function.srt

Co-authored-by: Luke Cheng <[email protected]>

Co-authored-by: Luke Cheng <[email protected]>

* finish review (#453)

* Revise some unnatural translations (#458)

Some unnatural translations have been revised to use expressions more popular with Chinese readers

* Fix chapter 5 links (#461)

* fix small typo (#460)

* Add Ko chapter2 3~8.mdx & Modify Ko chapter2 2.mdx typo (#446)

* Add captions for tasks videos (#464)

* Add captions for tasks videos

* Fix script

* [FR] Add 🤗  Tasks videos (#468)

* Synchronous Chinese course update

Update the Chinese Course document to
sha:f71cf6c3b4cb235bc75a14416c6e8a57fc3d00a7
sha date: 2023/01/06 00:02:26 UTC+8

* review sync

* Update 3.mdx

* format zh_CN

* format all mdx

* Remove temp folder

* finished review (#449)

* docs(zh-cn): Reviewed 31_navigating-the-model-hub.srt (#451)

* docs(zh-cn): Reviewed No. 08 - What happens inside the pipeline function? (PyTorch) (#454)

* docs(zh-cn): Reviewed 03_what-is-transfer-learning.srt (#457)

* docs(zh-cn): 32_managing-a-repo-on-the-model-hub.srt (#469)

* docs(zh-cn): Reviewed No. 10 - Instantiate a Transformers model (PyTorch) (#472)

* update Chinese translation

有一些英文句子与中文语序是相反的,我直接按照最终的中文语序排列了,这样是否可以?

* finish first round review

* finish second round review

* finish second round review

* branch commit

* Update subtitles/zh-CN/10_instantiate-a-transformers-model-(pytorch).srt

Co-authored-by: Luke Cheng <[email protected]>

* Update subtitles/zh-CN/10_instantiate-a-transformers-model-(pytorch).srt

Co-authored-by: Luke Cheng <[email protected]>

---------

Co-authored-by: Luke Cheng <[email protected]>

* docs(zh-cn): 33_the-push-to-hub-api-(pytorch).srt (#473)

* docs(zh-cn): Reviewed 34_the-push-to-hub-api-(tensorflow).srt (#479)

* running python utils/code_formatter.py

* review 05 cn translations

* review 06 cn translations

* Review No.11

* translate no.24

* review 06 cn translations

* review 07 cn translations

* Update 23_what-is-dynamic-padding.srt

* Update 23_what-is-dynamic-padding.srt

* Update 23_what-is-dynamic-padding.srt

* Update subtitles/zh-CN/23_what-is-dynamic-padding.srt

Co-authored-by: Luke Cheng <[email protected]>

* Update subtitles/zh-CN/23_what-is-dynamic-padding.srt

Co-authored-by: Luke Cheng <[email protected]>

* add blank

* Review No. 11, No. 12

* Review No. 13

* Review No. 12

* Review No. 14

* finished review

* optimized translation

* optimized translation

* docs(zh-cn): Reviewed No. 29 - Write your training loop in PyTorch

* Review 15

* Review 16

* Review 17

* Review 18

* Review ch 72 translation

* Update 72 cn translation

* To be reviewed No.42-No.54

* No.11 check-out

* No.12 check-out

* No. 13 14 check-out

* No. 15 16 check-out

* No. 17 18 check-out

* Add note for "token-*"

* Reviewed No.8, 9, 10

* Reviewed No.42

* Review No.43

* finished review

* optimized translation

* finished review

* optimized translation

* Review 44(need refine)

* Review 45(need refine)

* Review No. 46 (need refine)

* Review No.47

* Review No.46

* Review No.45

* Review No.44

* Review No.48

* Review No.49

* Review No.50

* Modify Ko chapter2 8.mdx (#465)

* Add Ko chapter2 2.mdx

* Add Ko chapter2 2.mdx

* Add Ko chapter2 3.mdx & 4.mdx

* Modify Ko chapter2 3.mdx & 4.mdx

* Modify Ko chapter2 3.mdx & 4.mdx

* Modify Ko chapter2 3.mdx & 4.mdx

* Modify _toctree.yml

* Add Ko chapter2 5.mdx

* Modify Ko chapter2 4.mdx

* Add doc-builder step

* Add Ko chapter2 6~8.mdx & Modify Ko chapter2 2.mdx typo

* Modify Ko _toctree.yml

* Modify Ko chapter2 8.mdx & README.md

* Fixed typo (#471)

* fixed subtitle errors (#474)

timestamp: 00:00:26,640 --> 00:00:28,620
modification: notification --> authentication

timestamp: 00:04:21,113 --> 00:04:22,923
modification: of --> or

* Fixed a typo (#475)

* Update 3.mdx (#526)

Fix typo

* [zh-TW] Added chapters 1-9 (#477)

The translation is based on Simplified Chinese version, converted via OpenCC and fixed some formatting issues.

* finished review

* Explain why there are more tokens, than reviews (#476)

* Explain why there are more tokens, than reviews

* Update chapters/en/chapter5/3.mdx

---------

Co-authored-by: lewtun <[email protected]>

* [RU] Subtitles for Chapter 1 of the video course (#489)

* Created a directory for the russian subtitles.

Created a folder for Russian subtitles for the video course and published a translation of the introductory video from chapter 1.

* Uploaded subtitles for chapter 1

Uploaded subtitles for the remaining videos for chapter 1 of the video course.

* Added subtitles for chapter 2 of the video course

Added STR subtitle files for the second chapter of the YouTube video course.

* Delete subtitles/ru directory

Removed the old translation. Incorrect timestamping.

* Create 00_welcome-to-the-hugging-face-course.srt

Create a directory and upload a subtitle file for the introductory video of the course.

* Add files via upload

Upload subtitle files for the first chapter of the course.

* Review No.52

* [ru] Added the glossary and translation guide (#490)

* Added the glossary and translation guide

* Fixed casing

* Minor fixes

* Updated glossary

* Glossary update

* Glossary update

* Glossary update

* [ru] Chapters 0 and 1 proofreading, updating and translating missing sections (#491)

* Chapter 0 proofreading

* Chapter 1 Section 1 proofreading
- Added new people from English version;
- Added links to creator's pages;
- Added FAQ translation;

* Chapter 1 Sections 2-5 proofreading

* Chapter 1 Sections 6-9 proofreading

* Final proofreading and added missing quiz section

* Minor spelling corrections

* Review No.51

* Review No.53

* Review No.54

* finished review

* modified translation

* modified translation

* modified subtitle

use the same text appeared in video

* translated

* Fix typo (#532)

* review chapter4/2

* review chapter4/2

* review chapter4/2

* Review 75

* Review No.20, need review some

* docs(zh-cn): Reviewed Chapter 7/1

* Update 1.mdx

* Review No.22

* Review No.21 (need refinement)

* Review No.30, need review: 26 27 28 30 73 74

* Review 30 (good)

* Review 20

* Review 21 (refine)

* Review 21

* Review 22

* Review 26

* Review 27

* Review 28

* Review 30

* Review 73

* Review 74

* Review 26-28, 42-54, 73-75

* Demo link fixes (#562)

* demo link fixes

* minor demo fix

---------

Co-authored-by: Aravind Kumar <[email protected]>
Co-authored-by: lbourdois <[email protected]>
Co-authored-by: Pavel <[email protected]>
Co-authored-by: buti1021 <[email protected]>
Co-authored-by: 1375626371 <[email protected]>
Co-authored-by: Fabrizio Damicelli <[email protected]>
Co-authored-by: Jesper Dramsch <[email protected]>
Co-authored-by: Acciaro Gennaro Daniele <[email protected]>
Co-authored-by: Caterina Bonan <[email protected]>
Co-authored-by: Haruki Nagasawa <[email protected]>
Co-authored-by: blackdoor571 <[email protected]>
Co-authored-by: Matt <[email protected]>
Co-authored-by: Angel Mendez <[email protected]>
Co-authored-by: Artem Vysotsky <[email protected]>
Co-authored-by: Gusti Adli Anshari <[email protected]>
Co-authored-by: Marcus Fraaß <[email protected]>
Co-authored-by: Christopher Akiki <[email protected]>
Co-authored-by: Mishig <[email protected]>
Co-authored-by: David Gilbertson <[email protected]>
Co-authored-by: Camilo Martínez Burgos <[email protected]>
Co-authored-by: Hiroaki Funayama <[email protected]>
Co-authored-by: Cesar0106 <[email protected]>
Co-authored-by: Younes Belkada <[email protected]>
Co-authored-by: Filippo Broggini <[email protected]>
Co-authored-by: Mishig <[email protected]>
Co-authored-by: Nanachi <[email protected]>
Co-authored-by: Edoardo Abati <[email protected]>
Co-authored-by: Wonhyeong Seo <[email protected]>
Co-authored-by: Luke Cheng <[email protected]>
Co-authored-by: xianbaoqian <[email protected]>
Co-authored-by: Thomas Simonini <[email protected]>
Co-authored-by: Subaru Kimura <[email protected]>
Co-authored-by: Carlos Santos Garcia <[email protected]>
Co-authored-by: 長澤春希 <[email protected]>
Co-authored-by: Yuan <[email protected]>
Co-authored-by: IL-GU KIM <[email protected]>
Co-authored-by: Kim Bo Geum <[email protected]>
Co-authored-by: Bartosz Szmelczynski <[email protected]>
Co-authored-by: Shawn Lee <[email protected]>
Co-authored-by: Naveen Reddy D <[email protected]>
Co-authored-by: rainmaker <[email protected]>
Co-authored-by: “Ryan” <“[email protected]”>
Co-authored-by: Meta Learner응용개발팀 류민호 <[email protected]>
Co-authored-by: Minho Ryu <[email protected]>
Co-authored-by: richardachen <[email protected]>
Co-authored-by: beyondguo <[email protected]>
Co-authored-by: bsenst <[email protected]>
Co-authored-by: 1375626371 <[email protected]>
Co-authored-by: yaoqih <[email protected]>
Co-authored-by: 李洋 <[email protected]>
Co-authored-by: PowerChina <[email protected]>
Co-authored-by: chenglu99 <[email protected]>
Co-authored-by: iCell <[email protected]>
Co-authored-by: Qi Zhang <[email protected]>
Co-authored-by: researcher <[email protected]>
Co-authored-by: simpleAI <[email protected]>
Co-authored-by: FYJNEVERFOLLOWS <[email protected]>
Co-authored-by: zhangchaosd <[email protected]>
Co-authored-by: TK Buristrakul <[email protected]>
Co-authored-by: Carlos Aguayo <[email protected]>
Co-authored-by: ateliershen <[email protected]>
Co-authored-by: Pavel Nesterov <[email protected]>
Co-authored-by: Artyom Boyko <[email protected]>
Co-authored-by: Kirill Milintsevich <[email protected]>
Co-authored-by: jybarnes21 <[email protected]>
Co-authored-by: gxy-gxy <[email protected]>
Co-authored-by: iLeGend <[email protected]>
Co-authored-by: Maria Khalusova <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants