From a977f4a627ef06382dd08a49a7b63b6f288bcaac Mon Sep 17 00:00:00 2001 From: Bartosz Szmelczynski <43574448+Bearnardd@users.noreply.github.com> Date: Tue, 27 Dec 2022 02:08:05 +0100 Subject: [PATCH 1/5] update textbook link (#427) --- chapters/en/chapter7/6.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/chapter7/6.mdx b/chapters/en/chapter7/6.mdx index 2a42aa6b4..7a498a863 100644 --- a/chapters/en/chapter7/6.mdx +++ b/chapters/en/chapter7/6.mdx @@ -36,7 +36,7 @@ This is actually showcasing the model that was trained and uploaded to the Hub u ## Gathering the data[[gathering-the-data]] -Python code is abundantly available from code repositories such as GitHub, which we can use to create a dataset by scraping for every Python repository. This was the approach taken in the [Transformers textbook](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/) to pretrain a large GPT-2 model. Using a GitHub dump of about 180 GB containing roughly 20 million Python files called `codeparrot`, the authors built a dataset that they then shared on the [Hugging Face Hub](https://huggingface.co/datasets/transformersbook/codeparrot). +Python code is abundantly available from code repositories such as GitHub, which we can use to create a dataset by scraping for every Python repository. This was the approach taken in the [Transformers textbook](https://learning.oreilly.com/library/view/natural-language-processing/9781098136789/) to pretrain a large GPT-2 model. Using a GitHub dump of about 180 GB containing roughly 20 million Python files called `codeparrot`, the authors built a dataset that they then shared on the [Hugging Face Hub](https://huggingface.co/datasets/transformersbook/codeparrot). However, training on the full corpus is time- and compute-consuming, and we only need the subset of the dataset concerned with the Python data science stack. So, let's start by filtering the `codeparrot` dataset for all files that include any of the libraries in this stack. Because of the dataset's size, we want to avoid downloading it; instead, we'll use the streaming feature to filter it on the fly. To help us filter the code samples using the libraries we mentioned earlier, we'll use the following function: From 44277ebc8e55465855ab949c85663e054cc81928 Mon Sep 17 00:00:00 2001 From: lbourdois <58078086+lbourdois@users.noreply.github.com> Date: Tue, 27 Dec 2022 10:14:18 +0100 Subject: [PATCH 2/5] Visual fixes (#428) --- chapters/en/chapter1/1.mdx | 19 ++++++++++--------- chapters/en/chapter1/4.mdx | 2 ++ chapters/fr/chapter7/6.mdx | 2 +- chapters/vi/chapter1/1.mdx | 18 +++++++++--------- 4 files changed, 22 insertions(+), 19 deletions(-) diff --git a/chapters/en/chapter1/1.mdx b/chapters/en/chapter1/1.mdx index 5136d0fe4..30c992371 100644 --- a/chapters/en/chapter1/1.mdx +++ b/chapters/en/chapter1/1.mdx @@ -37,23 +37,23 @@ After you've completed this course, we recommend checking out DeepLearning.AI's About the authors: -**Abubakar Abid** completed his PhD at Stanford in applied machine learning. During his PhD, he founded [Gradio](https://github.com/gradio-app/gradio), an open-source Python library that has been used to build over 600,000 machine learning demos. Gradio was acquired by Hugging Face, which is where Abubakar now serves as a machine learning team lead. +[**Abubakar Abid**](https://huggingface.co/abidlabs) completed his PhD at Stanford in applied machine learning. During his PhD, he founded [Gradio](https://github.com/gradio-app/gradio), an open-source Python library that has been used to build over 600,000 machine learning demos. Gradio was acquired by Hugging Face, which is where Abubakar now serves as a machine learning team lead. -**Matthew Carrigan** is a Machine Learning Engineer at Hugging Face. He lives in Dublin, Ireland and previously worked as an ML engineer at Parse.ly and before that as a post-doctoral researcher at Trinity College Dublin. He does not believe we're going to get to AGI by scaling existing architectures, but has high hopes for robot immortality regardless. +[**Matthew Carrigan**](https://huggingface.co/Rocketknight1) is a Machine Learning Engineer at Hugging Face. He lives in Dublin, Ireland and previously worked as an ML engineer at Parse.ly and before that as a post-doctoral researcher at Trinity College Dublin. He does not believe we're going to get to AGI by scaling existing architectures, but has high hopes for robot immortality regardless. -**Lysandre Debut** is a Machine Learning Engineer at Hugging Face and has been working on the 🤗 Transformers library since the very early development stages. His aim is to make NLP accessible for everyone by developing tools with a very simple API. +[**Lysandre Debut**](https://huggingface.co/lysandre) is a Machine Learning Engineer at Hugging Face and has been working on the 🤗 Transformers library since the very early development stages. His aim is to make NLP accessible for everyone by developing tools with a very simple API. -**Sylvain Gugger** is a Research Engineer at Hugging Face and one of the core maintainers of the 🤗 Transformers library. Previously he was a Research Scientist at fast.ai, and he co-wrote _[Deep Learning for Coders with fastai and PyTorch](https://learning.oreilly.com/library/view/deep-learning-for/9781492045519/)_ with Jeremy Howard. The main focus of his research is on making deep learning more accessible, by designing and improving techniques that allow models to train fast on limited resources. +[**Sylvain Gugger**](https://huggingface.co/sgugger) is a Research Engineer at Hugging Face and one of the core maintainers of the 🤗 Transformers library. Previously he was a Research Scientist at fast.ai, and he co-wrote _[Deep Learning for Coders with fastai and PyTorch](https://learning.oreilly.com/library/view/deep-learning-for/9781492045519/)_ with Jeremy Howard. The main focus of his research is on making deep learning more accessible, by designing and improving techniques that allow models to train fast on limited resources. -**Dawood Khan** is a Machine Learning Engineer at Hugging Face. He's from NYC and graduated from New York University studying Computer Science. After working as an iOS Engineer for a few years, Dawood quit to start Gradio with his fellow co-founders. Gradio was eventually acquired by Hugging Face. +[**Dawood Khan**](https://huggingface.co/dawoodkhan82) is a Machine Learning Engineer at Hugging Face. He's from NYC and graduated from New York University studying Computer Science. After working as an iOS Engineer for a few years, Dawood quit to start Gradio with his fellow co-founders. Gradio was eventually acquired by Hugging Face. -**Merve Noyan** is a developer advocate at Hugging Face, working on developing tools and building content around them to democratize machine learning for everyone. +[**Merve Noyan**](https://huggingface.co/merve) is a developer advocate at Hugging Face, working on developing tools and building content around them to democratize machine learning for everyone. -**Lucile Saulnier** is a machine learning engineer at Hugging Face, developing and supporting the use of open source tools. She is also actively involved in many research projects in the field of Natural Language Processing such as collaborative training and BigScience. +[**Lucile Saulnier**](https://huggingface.co/SaulLu) is a machine learning engineer at Hugging Face, developing and supporting the use of open source tools. She is also actively involved in many research projects in the field of Natural Language Processing such as collaborative training and BigScience. -**Lewis Tunstall** is a machine learning engineer at Hugging Face, focused on developing open-source tools and making them accessible to the wider community. He is also a co-author of the O’Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). +[**Lewis Tunstall**](https://huggingface.co/lewtun) is a machine learning engineer at Hugging Face, focused on developing open-source tools and making them accessible to the wider community. He is also a co-author of the O’Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). -**Leandro von Werra** is a machine learning engineer in the open-source team at Hugging Face and also a co-author of the O’Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). He has several years of industry experience bringing NLP projects to production by working across the whole machine learning stack.. +[**Leandro von Werra**](https://huggingface.co/lvwerra) is a machine learning engineer in the open-source team at Hugging Face and also a co-author of the O’Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). He has several years of industry experience bringing NLP projects to production by working across the whole machine learning stack.. ## FAQ[[faq]] @@ -100,6 +100,7 @@ Of course! The course is released under the permissive [Apache 2 license](https: } ``` +## Let's Go Are you ready to roll? In this chapter, you will learn: * How to use the `pipeline()` function to solve NLP tasks such as text generation and classification diff --git a/chapters/en/chapter1/4.mdx b/chapters/en/chapter1/4.mdx index 7097771f9..80f692852 100644 --- a/chapters/en/chapter1/4.mdx +++ b/chapters/en/chapter1/4.mdx @@ -81,6 +81,8 @@ Imagine if each time a research team, a student organization, or a company wante This is why sharing language models is paramount: sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community. +By the way, you can evaluate the carbon footprint of your models' training through several tools. For example [ML CO2 Impact](https://mlco2.github.io/impact/) or [Code Carbon]( https://codecarbon.io/) which is integrated in 🤗 Transformers. To learn more about this, you can read this [blog post](https://huggingface.co/blog/carbon-emissions-on-the-hub) which will show you how to generate an `emissions.csv` file with an estimate of the footprint of your training, as well as the [documentation](https://huggingface.co/docs/hub/model-cards-co2) of 🤗 Transformers addressing this topic. + ## Transfer Learning[[transfer-learning]] diff --git a/chapters/fr/chapter7/6.mdx b/chapters/fr/chapter7/6.mdx index e8dd9a3cd..91c90a96c 100644 --- a/chapters/fr/chapter7/6.mdx +++ b/chapters/fr/chapter7/6.mdx @@ -41,7 +41,7 @@ Il s'agit d'une présentation du modèle qui a été entraîné à l'aide du cod ## Collecte des données -On peut trouver du code Python en abondance dans les dépôts de code tels que GitHub, que nous pouvons utiliser pour créer un jeu de données en récupérant chaque dépôt Python. C'est l'approche adoptée dans le [livre *Natural Language Processing with Transformers*](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/) pour pré-entraîner un grand GPT-2. En utilisant un dépôt GitHub d'environ 180 Go contenant approximativement 20 millions de fichiers Python, les auteurs du livre ont construit un jeu de données appelé `codeparrot` qu'ils ont ensuite partagé sur le [*Hub*](https://huggingface.co/datasets/transformersbook/codeparrot). +On peut trouver du code Python en abondance dans les dépôts de code tels que GitHub, que nous pouvons utiliser pour créer un jeu de données en récupérant chaque dépôt Python. C'est l'approche adoptée dans le [livre *Natural Language Processing with Transformers*](https://learning.oreilly.com/library/view/natural-language-processing/9781098136789/) pour pré-entraîner un grand GPT-2. En utilisant un dépôt GitHub d'environ 180 Go contenant approximativement 20 millions de fichiers Python, les auteurs du livre ont construit un jeu de données appelé `codeparrot` qu'ils ont ensuite partagé sur le [*Hub*](https://huggingface.co/datasets/transformersbook/codeparrot). Cependant, entraîner sur l'ensemble du corpus prend beaucoup de temps et demande beaucoup de ressources de calculs. Dans notre cas, nous n'avons besoin que du sous-ensemble du jeu de données qui est relatif aux codes portant sur la science des données. Commençons donc par filtrer le jeu de données `codeparrot` en ne gardant que les fichiers incluant l'une des bibliothèques de science des données énumérées précédemment. En raison de la taille du jeu de données, nous voulons éviter de le télécharger. Nous utiliserons donc la fonctionnalité de *streaming* de 🤗 *Datasets* afin de le filtrer à la volée. Pour nous aider à filtrer les échantillons de code utilisant les bibliothèques que nous avons mentionnées précédemment, nous utilisons la fonction suivante : diff --git a/chapters/vi/chapter1/1.mdx b/chapters/vi/chapter1/1.mdx index 0dffa6bd9..fe0bb2bcf 100644 --- a/chapters/vi/chapter1/1.mdx +++ b/chapters/vi/chapter1/1.mdx @@ -36,23 +36,23 @@ Sau khi bạn hoàn thành khóa học này, chúng tôi khuyến khích bạn x Giới thiệu về tác giả: -**Abubakar Abid** đã hoàn thành chương trình Tiến sĩ về học máy ứng dụng tại Stanford. Trong thời gian học tiến sĩ, anh ấy đã tạo ra [Gradio](https://github.com/gradio-app/gradio), một thư viện Python mã nguồn mở được sử dụng để xây dựng hơn 600,000 bản demo học máy. Gradio được mua lại bởi Hugging Face, nơi Abubakar hiện đóng vai trò là trưởng nhóm học máy. +[**Abubakar Abid**](https://huggingface.co/abidlabs) đã hoàn thành chương trình Tiến sĩ về học máy ứng dụng tại Stanford. Trong thời gian học tiến sĩ, anh ấy đã tạo ra [Gradio](https://github.com/gradio-app/gradio), một thư viện Python mã nguồn mở được sử dụng để xây dựng hơn 600,000 bản demo học máy. Gradio được mua lại bởi Hugging Face, nơi Abubakar hiện đóng vai trò là trưởng nhóm học máy. -**Matthew Carrigan** là một Kỹ sư Học máy tại Hugging Face. Anh ấy sống ở Dublin, Ireland, trước đây là kỹ sư Học máy tại Parse.ly và trước đó là nhà nghiên cứu sau tiến sĩ tại Trinity College Dublin. Anh ấy không tin rằng chúng ta sẽ đạt được AGI bằng cách mở rộng các kiến ​​trúc hiện có, nhưng có niềm tin vào sự bất tử của robot. +[**Matthew Carrigan**](https://huggingface.co/Rocketknight1) là một Kỹ sư Học máy tại Hugging Face. Anh ấy sống ở Dublin, Ireland, trước đây là kỹ sư Học máy tại Parse.ly và trước đó là nhà nghiên cứu sau tiến sĩ tại Trinity College Dublin. Anh ấy không tin rằng chúng ta sẽ đạt được AGI bằng cách mở rộng các kiến ​​trúc hiện có, nhưng có niềm tin vào sự bất tử của robot. -**Lysandre Debut** là một Kỹ sư Học máy tại Hugging Face và đã làm việc với thư viện 🤗 Transformers từ những giai đoạn đầu phát triển. Mục tiêu của anh ấy là làm cho NLP có thể dễ dàng truy cập được từ tất cả mọi người bằng cách phát triển các công cụ với một API rất đơn giản. +[**Lysandre Debut**](https://huggingface.co/lysandre) là một Kỹ sư Học máy tại Hugging Face và đã làm việc với thư viện 🤗 Transformers từ những giai đoạn đầu phát triển. Mục tiêu của anh ấy là làm cho NLP có thể dễ dàng truy cập được từ tất cả mọi người bằng cách phát triển các công cụ với một API rất đơn giản. -**Sylvain Gugger** là Kỹ sư nghiên cứu tại Hugging Face và là một trong những thành viên cốt lõi của thư viện 🤗 Transformers. Trước đây, anh ấy là Nhà nghiên cứu khoa học tại fast.ai và anh ấy là đồng sáng tác đầu sách _[Deep Learning for Coders with fastai and PyTorch](https://learning.oreilly.com/library/view/deep-learning-for/9781492045519/)_ cùng với Jeremy Howard. Hướng nghiên cứu chính của anh ấy là làm cho việc học sâu trở nên dễ tiếp cận hơn, bằng cách thiết kế và cải tiến các kỹ thuật cho phép các mô hình huấn luyện nhanh trên các tài nguyên hạn chế. +[**Sylvain Gugger**](https://huggingface.co/sgugger) là Kỹ sư nghiên cứu tại Hugging Face và là một trong những thành viên cốt lõi của thư viện 🤗 Transformers. Trước đây, anh ấy là Nhà nghiên cứu khoa học tại fast.ai và anh ấy là đồng sáng tác đầu sách _[Deep Learning for Coders with fastai and PyTorch](https://learning.oreilly.com/library/view/deep-learning-for/9781492045519/)_ cùng với Jeremy Howard. Hướng nghiên cứu chính của anh ấy là làm cho việc học sâu trở nên dễ tiếp cận hơn, bằng cách thiết kế và cải tiến các kỹ thuật cho phép các mô hình huấn luyện nhanh trên các tài nguyên hạn chế. -**Dawood Khan** là một Kỹ sư Học máy tại Hugging Face. Anh ấy đến từ New York và tốt nghiệp Đại học New York chuyên ngành Khoa học máy tính. Sau khi làm việc với tư cách là Kỹ sư iOS trong một vài năm, Dawood đã nghỉ việc để bắt đầu phát triển Gradio cùng với những người đồng sáng lập của mình. Gradio cuối cùng đã được mua lại bởi Hugging Face. +[**Dawood Khan**](https://huggingface.co/dawoodkhan82) là một Kỹ sư Học máy tại Hugging Face. Anh ấy đến từ New York và tốt nghiệp Đại học New York chuyên ngành Khoa học máy tính. Sau khi làm việc với tư cách là Kỹ sư iOS trong một vài năm, Dawood đã nghỉ việc để bắt đầu phát triển Gradio cùng với những người đồng sáng lập của mình. Gradio cuối cùng đã được mua lại bởi Hugging Face. -**Merve Noyan** là Chuyên gia về Quan hệ lập trình viên tại Hugging Face, hiện đang phát triển các công cụ và xây dựng nội dung xung quanh chúng để tất cả mọi người có thể tiếp cận học máy dễ dàng hơn. +[**Merve Noyan**](https://huggingface.co/merve) là Chuyên gia về Quan hệ lập trình viên tại Hugging Face, hiện đang phát triển các công cụ và xây dựng nội dung xung quanh chúng để tất cả mọi người có thể tiếp cận học máy dễ dàng hơn. -**Lucile Saulnier** là một Kỹ sư Học máy tại Hugging Face, phát triển và hỗ trợ việc sử dụng các công cụ mã nguồn mở. Cô cũng tích cực tham gia vào nhiều dự án nghiên cứu trong lĩnh vực Xử lý Ngôn ngữ Tự nhiên như huấn luyện cộng tác và BigScience. +[**Lucile Saulnier**](https://huggingface.co/SaulLu) là một Kỹ sư Học máy tại Hugging Face, phát triển và hỗ trợ việc sử dụng các công cụ mã nguồn mở. Cô cũng tích cực tham gia vào nhiều dự án nghiên cứu trong lĩnh vực Xử lý Ngôn ngữ Tự nhiên như huấn luyện cộng tác và BigScience. -**Lewis Tunstall** là một Kỹ sư Học máy tại Hugging Face, tập trung vào việc phát triển các công cụ mã nguồn mở và giúp chúng có thể tiếp cận được với cộng đồng rộng lớn hơn. Anh cũng là đồng tác giả của cuốn sách O’Reilly [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). +[**Lewis Tunstall**](https://huggingface.co/lewtun) là một Kỹ sư Học máy tại Hugging Face, tập trung vào việc phát triển các công cụ mã nguồn mở và giúp chúng có thể tiếp cận được với cộng đồng rộng lớn hơn. Anh cũng là đồng tác giả của cuốn sách O’Reilly [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). -**Leandro von Werra** là một Kỹ sư Học máy trong nhóm mã nguồn mở tại Hugging Face và cũng là đồng tác giả của cuốn sách O'Reilly [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). Anh ấy có nhiều năm kinh nghiệm thực tế triển khai các dự án NLP vào sản xuất bằng cách làm việc trên toàn bộ hệ thống học máy. +[**Leandro von Werra**](https://huggingface.co/lvwerra) là một Kỹ sư Học máy trong nhóm mã nguồn mở tại Hugging Face và cũng là đồng tác giả của cuốn sách O'Reilly [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). Anh ấy có nhiều năm kinh nghiệm thực tế triển khai các dự án NLP vào sản xuất bằng cách làm việc trên toàn bộ hệ thống học máy. Bạn đã sẵn sàng chưa? Trong chương này, bạn sẽ học: From 78a357611b2af5bd52030f1aa0af37669236caa2 Mon Sep 17 00:00:00 2001 From: Shawn Lee Date: Wed, 28 Dec 2022 10:08:31 +0900 Subject: [PATCH 3/5] finish first round review (#429) --- .../00_welcome-to-the-hugging-face-course.srt | 158 +++++++++--------- 1 file changed, 79 insertions(+), 79 deletions(-) diff --git a/subtitles/zh-CN/00_welcome-to-the-hugging-face-course.srt b/subtitles/zh-CN/00_welcome-to-the-hugging-face-course.srt index 0e8575089..462977fe2 100644 --- a/subtitles/zh-CN/00_welcome-to-the-hugging-face-course.srt +++ b/subtitles/zh-CN/00_welcome-to-the-hugging-face-course.srt @@ -5,7 +5,7 @@ 2 00:00:08,550 --> 00:00:10,320 -本课程旨在带你了解 +本课程旨在带您了解 This course has been designed to teach you 3 @@ -15,7 +15,7 @@ all about the Hugging Face ecosystem, 4 00:00:12,750 --> 00:00:14,700 -如何使用数据集和模型中心 +包括如何使用数据集和模型中心 how to use the dataset and model hub 5 @@ -25,37 +25,37 @@ as well as all our open-source libraries. 6 00:00:18,300 --> 00:00:19,950 -这是目录。 +这是本课程的目录。 Here is the Table of Contents. 7 00:00:19,950 --> 00:00:22,770 -如你所见,它分为三个部分 +它共分为三个部分 As you can see, it's divided in three sections 8 00:00:22,770 --> 00:00:25,110 -逐渐变得更先进。 +由浅入深地带您学习。 which become progressively more advanced. 9 00:00:25,110 --> 00:00:28,500 -现阶段,前两部分已经发布。 +到目前为止,前两部分已经发布。 At this stage, the first two sections have been released. 10 00:00:28,500 --> 00:00:30,120 -所以首先,我们会教你基础知识 +在课程的一开始,我们会教您基础知识 So first, we'll teach you the basics 11 00:00:30,120 --> 00:00:32,250 -如何使用 Transformer 模型, +包括如何使用 Transformer 模型, of how to use a Transformer model, 12 00:00:32,250 --> 00:00:34,230 -在你自己的数据集上微调 +以及如何基于您自己的数据集上进行微调 fine-tune it on your own data set 13 @@ -65,22 +65,22 @@ and share the result with the community. 14 00:00:36,960 --> 00:00:39,420 -其次,我们将更深入地研究我们的图书馆 +然后,我们将带您深入了解我们的开源库 So second, we'll dive deeper into our libraries 15 00:00:39,420 --> 00:00:42,360 -并教你如何处理任何 NLP 任务。 +并教您如果能够处理任何 NLP 任务。 and teach you how to tackle any NLP task. 16 00:00:42,360 --> 00:00:44,430 -我们正在积极研究最后一个 +我们正在积极研究最后一部分 We're actively working on the last one 17 00:00:44,430 --> 00:00:47,280 -并希望在 2022 年春季为你准备好它。 +并希望在 2022 年春季完成并发布。 and hope to have it ready for you for the spring of 2022. 18 @@ -90,62 +90,62 @@ The first chapter requires no technical knowledge 19 00:00:50,880 --> 00:00:52,320 -是一个很基础的介绍 +只会为您介绍一些基础知识 and is a good introduction to learn 20 00:00:52,320 --> 00:00:54,180 -关于 Transformers 模型可以做什么 +例如 Transformers 模型可以做什么 what Transformers models can do 21 00:00:54,180 --> 00:00:56,883 -以及它如何对你或你的公司有用。 +以及它如何帮助到您以及应用到您公司的业务中。 and how it could be of use to you or your company. 22 00:00:58,050 --> 00:01:01,110 -接下来的章节需要对 Python 有很好的了解 +第一部分之后的章节需要具备 Python 的相关知识 The next chapters require a good knowledge of Python 23 00:01:01,110 --> 00:01:02,130 -以及一些基础知识 +以及机器学习和深度学习的 and some basic knowledge of 24 00:01:02,130 --> 00:01:04,350 -机器学习和深度学习。 +一些基础知识。 Machine Learning and Deep Learning. 25 00:01:04,350 --> 00:01:07,110 -如果你不知道什么是训练集和验证集 +如果您不知道什么是训练集和验证集 If you don't know what a training and validation set are 26 00:01:07,110 --> 00:01:09,360 -或者梯度体面意味着什么, +或者梯度下降法意味着什么, or what gradient decent means, 27 00:01:09,360 --> 00:01:11,340 -你应该看看入门课程 +您应该看看一些 you should look at an introductory course 28 00:01:11,340 --> 00:01:14,863 -例如 deeplearning.ai 或 fast.ai 发布的那些。 +诸如 deeplearning.ai 或 fast.ai 发布的入门课程 such as the ones published by deeplearning.ai or fast.ai. 29 00:01:16,200 --> 00:01:17,910 -如果你有一些基础知识也是最好的 +如果您有一些关于某个 It's also best if you have some basics 30 00:01:17,910 --> 00:01:21,150 -在一个深度学习框架、PyTorch 或 TensorFlow 中。 +深度学习框架、PyTorch 或 TensorFlow 中的基础知识那就更好了。 in one Deep Learning Framework, PyTorch or TensorFlow. 31 @@ -155,17 +155,17 @@ Each part of the material introduced in this course 32 00:01:23,520 --> 00:01:25,590 -在这两个框架中都有一个版本, +在 PyTorch 和 TensorFlow 中都有一个相对应的版本, has a version in both those frameworks, 33 00:01:25,590 --> 00:01:26,730 -这样你就可以选择一个 +这样您就可以选择一个 so you will be able to pick the one 34 00:01:26,730 --> 00:01:28,230 -你最舒服。 +您最熟悉的版本。 you are most comfortable with. 35 @@ -175,7 +175,7 @@ This is the team that developed this course. 36 00:01:31,740 --> 00:01:33,120 -我现在让每个发言者 +接下来每位讲师会先 I'll now let each of the speakers 37 @@ -185,7 +185,7 @@ introduce themselves briefly. 38 00:01:37,230 --> 00:01:38,880 -- 你好,我叫马修, +- Hi,我叫马修, - Hi, my name is Matthew, 39 @@ -200,22 +200,22 @@ I work on the open-source team 41 00:01:43,200 --> 00:01:45,180 -我负责特别维护 +我负责维护 and I'm responsible for maintaining particularly 42 00:01:45,180 --> 00:01:47,280 -那里的 TensorFlow 代码。 +团队内的 TensorFlow 代码。 the TensorFlow code there. 43 00:01:47,280 --> 00:01:50,130 -之前,我是 Parsley 的机器学习工程师, +在此之前,我是 Parsley 的机器学习工程师, Previously, I was a Machine Learning Engineer at Parsley, 44 00:01:50,130 --> 00:01:52,620 -最近被 Automatic 收购, +最近该公司被 Automatic 收购, who've recently been acquired by Automatic, 45 @@ -230,12 +230,12 @@ before that at Trinity College, Dublin in Ireland 47 00:01:57,000 --> 00:02:00,093 -致力于计算遗传学和视网膜疾病。 +致力于计算遗传学和视网膜疾病的研究。 working on computational genetics and retinal disease. 48 00:02:02,400 --> 00:02:03,870 -- 你好,我是莱桑德尔。 +- Hi,我是 Lysandre - Hi, I'm Lysandre. 49 @@ -245,32 +245,32 @@ I'm a Machine Learning Engineer at Hugging Face 50 00:02:05,640 --> 00:02:08,700 -我特别是开源团队的一员。 +我是开源团队的一员。 and I'm specifically part of the open-source team. 51 00:02:08,700 --> 00:02:10,890 -我已经在 Hugging Face 工作了几年 +我已经在 Hugging Face 团队和我的团队成员一起 I've been at Hugging Face for a few years now 52 00:02:10,890 --> 00:02:12,300 -和我的团队成员一起, +工作了好几年, and alongside my team members, 53 00:02:12,300 --> 00:02:13,890 -我一直在研究大多数工具 +我一直致力于研究大多数您将在 I've been working on most of the tools 54 00:02:13,890 --> 00:02:15,790 -你将在本课程中看到。 +本课程中看到的工具。 that you'll get to see in this course. 55 00:02:18,270 --> 00:02:20,130 -- 你好,我是西尔万。 +- Hi,我是 Sylvain - Hi, I'm Sylvain. 56 @@ -280,7 +280,7 @@ I'm a Research Engineer at Hugging Face 57 00:02:22,140 --> 00:02:25,830 -也是 Transformers 库的主要维护者之一。 +也是 Transformers 代码库的主要维护者之一。 and one of the main maintainers of the Transformers Library. 58 @@ -300,17 +300,17 @@ as well as the online book. 61 00:02:32,220 --> 00:02:35,340 -在那之前,我是一名数学和计算机科学老师 +在那之前,我是一名在法国的 Before that, I was a math and computer science teacher 62 00:02:35,340 --> 00:02:36,173 -在法国。 +数学和计算机科学老师。 in France. 63 00:02:38,550 --> 00:02:41,340 -- 你好,我叫 Sasha,是 Hugging Face 的一名研究员, +- Hi,我叫 Sasha,是 Hugging Face 的一名研究员, - Hi, my name is Sasha and I'm a Researcher at Hugging Face, 64 @@ -320,67 +320,67 @@ working on the ethical, 65 00:02:42,420 --> 00:02:46,230 -机器学习模型的环境和社会影响。 +机器学习模型的环境和社会影响相关的研究。 environmental and social impacts of machine learning models. 66 00:02:46,230 --> 00:02:49,020 -之前,我是 Mila 的博士后研究员, +之前,我是 Mila 蒙特利尔大学的 Previously, I was a postdoctoral researcher at Mila, 67 00:02:49,020 --> 00:02:50,400 -蒙特利尔大学 +博士后研究员 University in Montreal 68 00:02:50,400 --> 00:02:53,040 -我还担任过应用人工智能研究员 +我还为联合国全球脉搏计划担任过 and I also worked as an Applied AI Researcher 69 00:02:53,040 --> 00:02:55,140 -为联合国全球脉搏。 +应用人工智能研究员。 for the United Nations Global Pulse. 70 00:02:55,140 --> 00:02:57,300 -参与过 CodeCarbon 等项目 +参与过 CodeCarbon 和 I've been involved in projects such as CodeCarbon 71 00:02:57,300 --> 00:02:59,790 -和机器学习影响计算器 +机器学习影响计算器等项目 and the Machine Learning Impacts Calculator 72 00:02:59,790 --> 00:03:02,390 -衡量机器学习的碳足迹。 +致力于衡量机器学习的碳足迹的研究。 to measure the carbon footprint of machine learning. 73 00:03:05,160 --> 00:03:07,650 -- 大家好,我是 Merve,我是 Hugging Face 团队的开发技术推广工程师 +- Hi,我是 Merve,我是 Hugging Face 团队的 - Hi, I'm Merve and I'm a Developer Advocate 74 00:03:07,650 --> 00:03:09,390 -- 大家好,我是 Merve,我是 Hugging Face 团队的开发技术推广工程师 +开发技术推广工程师 at Hugging Face. 75 00:03:09,390 --> 00:03:12,480 -以前,我是一名机器学习工程师 +在此之前,我是一名机器学习工程师 Previously, I was working as a Machine Learning Engineer 76 00:03:12,480 --> 00:03:15,360 -构建 NLP 工具和聊天机器人。 +负责构建 NLP 工具和聊天机器人。 building NLP tools and chat bots. 77 00:03:15,360 --> 00:03:17,670 -目前,我正在努力改进中心 +目前,我正在努力改进模型中心 Currently, I'm working to improve the hub 78 @@ -395,17 +395,17 @@ and democratize machine learning. 80 00:03:23,670 --> 00:03:27,210 -我叫 Lucile,是 Hugging Face 团队的一名机器学习工程师 +我叫 Lucile,是 Hugging Face 团队的 My name is Lucile and I'm a Machine Learning Engineer 81 00:03:27,210 --> 00:03:28,353 -我叫 Lucile,是 Hugging Face 团队的一名机器学习工程师 +一名机器学习工程师 at Hugging Face. 82 00:03:29,580 --> 00:03:32,550 -用两句话告诉你我是谁, +用两句话告诉您我是谁, To tell you in two sentences who I am, 83 @@ -415,12 +415,12 @@ I work on the development and support of open-source tools 84 00:03:36,600 --> 00:03:39,595 -我也参与了几个研究项目 +我也参与了在自然语言处理领域的 and I also participate in several research project 85 00:03:39,595 --> 00:03:41,795 -在自然语言处理领域。 +几个研究项目。 in the field of Natural Language Processing. 86 @@ -430,12 +430,12 @@ in the field of Natural Language Processing. 87 00:03:45,540 --> 00:03:47,550 -我是刘易斯,我是一名机器学习工程师 +我是 Lewis,我是 Hugging Face 开源团队中的 I'm Lewis and I'm a Machine Learning Engineer 88 00:03:47,550 --> 00:03:50,130 -在 Hugging Face 的开源团队中。 +一名机器学习工程师。 in the open-source team at Hugging Face. 89 @@ -445,12 +445,12 @@ I'm passionate about developing tools for the NLP community 90 00:03:53,490 --> 00:03:55,050 -你能在很多 Hugging Face 对外的活动里见到我 +您可以在很多 Hugging Face and you'll see me at many of Hugging Face's 91 00:03:55,050 --> 00:03:56,910 -你能在很多 Hugging Face 对外的活动里见到我 +对外的活动中见到我 outreach activities. 92 @@ -460,56 +460,56 @@ Before joining Hugging Face, 93 00:03:58,470 --> 00:03:59,790 -我花了几年时间开发 +我花了几年时间 I spent several years developing 94 00:03:59,790 --> 00:04:01,860 -初创公司的机器学习应用程序 +为初创公司和 NLP 领域的企业 machine learning applications for startups 95 00:04:01,860 --> 00:04:04,230 -和 NLP 领域的企业, +开发机器学习应用程序, and enterprises in the domains of NLP, 96 00:04:04,230 --> 00:04:07,260 -拓扑数据分析和时间序列。 +以及拓扑数据分析和时间序列。 topological data analysis and time series. 97 00:04:07,260 --> 00:04:10,110 -前世,我是一名理论物理学家, +在此之前,我是一名理论物理学家, In a former life, I was a theoretical physicist, 98 00:04:10,110 --> 00:04:11,760 -我在哪里研究粒子碰撞 +负责在大型强子对撞机等 where I researched particle collisions 99 00:04:11,760 --> 00:04:13,560 -在大型强子对撞机等。 +研究粒子碰撞。 at the Large Hadron Collider and so. 100 00:04:15,900 --> 00:04:18,450 -- 嘿,我是 Leandro,我是一名机器学习工程师 +- Hey,我是 Leandro,我是一名 Hugging Face 开源团队中 - Hey, I'm Leandro and I'm a Machine Learning Engineer 101 00:04:18,450 --> 00:04:21,030 -在 Hugging Face 的开源团队中。 +的一名机器学习工程师 in the open-source team at Hugging Face. 102 00:04:21,030 --> 00:04:23,460 -在加入 Hugging Face 之前,我是一名数据科学家 +在加入 Hugging Face 之前,我是一名在瑞士的数据科学家 Before joining Hugging Face, I worked as a Data Scientist 103 00:04:23,460 --> 00:04:26,733 -在瑞士,并在大学教授数据科学。 +并在大学教授数据科学。 in Switzerland and have taught Data Science at University. From af0c221e078e8af7002dd16998b9cb5a3b412fd3 Mon Sep 17 00:00:00 2001 From: lewtun Date: Wed, 28 Dec 2022 12:41:20 +1100 Subject: [PATCH 4/5] Fix French subtitles + refactor conversion script (#431) * Fix subtitles and scripts * Fix subtitle --- subtitles/README.md | 30 ++++++-- .../00_welcome-to-the-hugging-face-course.srt | 2 +- .../00_welcome-to-the-hugging-face-course.srt | 16 ++-- subtitles/fr/03_what-is-transfer-learning.srt | 2 +- subtitles/fr/68_data-collators-a-tour.srt | 2 + .../00_welcome-to-the-hugging-face-course.srt | 4 +- utils/convert_bilingual_monolingual.py | 74 +++++++++---------- utils/generate_subtitles.py | 14 ++-- utils/validate_translation.py | 7 +- 9 files changed, 80 insertions(+), 71 deletions(-) diff --git a/subtitles/README.md b/subtitles/README.md index 53d87db37..002948954 100644 --- a/subtitles/README.md +++ b/subtitles/README.md @@ -28,15 +28,29 @@ python utils/generate_subtitles.py --language zh-CN --youtube_language_code zh-H Once you have the `.srt` files you can manually fix any translation errors and then open a pull request with the new files. -# How to convert bilingual subtitle to monolingual subtitle +# Convert bilingual subtitles to monolingual subtitles -# Logic +In some SRT files, the English caption line is conventionally placed at the last line of each subtitle block to enable easier comparison when correcting the machine translation. -The english caption line is conventionally placed at the last line of each subtitle block in srt files. So removing the last line of each subtitle block would make the bilingual subtitle a monolingual subtitle. +For example, in the `zh-CN` subtitles, each block has the following format: -# Usage -> python3 convert_bilingual_monolingual.py -i \ -o \ +``` +1 +00:00:05,850 --> 00:00:07,713 +- 欢迎来到 Hugging Face 课程。 +- Welcome to the Hugging Face Course. +``` + +To upload the SRT file to YouTube, we need the subtitle in monolingual format, i.e. the above block should read: + +``` +1 +00:00:05,850 --> 00:00:07,713 +- 欢迎来到 Hugging Face 课程。 +``` -**Example** -* For instance, the input file name is "test.cn.en.srt", and you name your output file as "output_test.cn.srt" * -> python3 convert_bilingual_monolingual.py -i test.cn.en.srt -o output_test.cn.srt \ No newline at end of file +To handle this, we provide a script that converts the bilingual SRT files to monolingual ones. To perform the conversion, run: + +```bash +python utils/convert_bilingual_monolingual.py --input_language_folder subtitles/LANG_ID --output_language_folder tmp-subtitles +``` \ No newline at end of file diff --git a/subtitles/en/00_welcome-to-the-hugging-face-course.srt b/subtitles/en/00_welcome-to-the-hugging-face-course.srt index ae8eb7042..e7f55a2fe 100644 --- a/subtitles/en/00_welcome-to-the-hugging-face-course.srt +++ b/subtitles/en/00_welcome-to-the-hugging-face-course.srt @@ -1,6 +1,6 @@ 1 00:00:05,850 --> 00:00:07,713 -- Welcome to the Hugging Face Course. +Welcome to the Hugging Face Course. 2 00:00:08,550 --> 00:00:10,320 diff --git a/subtitles/fr/00_welcome-to-the-hugging-face-course.srt b/subtitles/fr/00_welcome-to-the-hugging-face-course.srt index 73960fe7a..d7fdaea1e 100644 --- a/subtitles/fr/00_welcome-to-the-hugging-face-course.srt +++ b/subtitles/fr/00_welcome-to-the-hugging-face-course.srt @@ -7,7 +7,7 @@ Bienvenue au cours d'Hugging Face. Ce cours a été conçu pour vous enseigner tout ce qu'il faut savoir à propos de l'écosystème d'Hugging Face. 3 -0:00:12.559,0:00:18.080 +0:00:12.559 --> 0:00:18.080 Comment utiliser le Hub de jeux de données et de modèles ainsi que toutes nos bibliothèques open source. 4 @@ -27,7 +27,7 @@ La première vous apprendra les bases sur comment utiliser un transformer finetu La deuxième est une plongée plus profonde dans nos bibliothèques et vous apprendra à aborder n'importe quelle tâche de NLP. 8 -0:00:42.079,0:00:48.320 +0:00:42.079 --> 0:00:48.320 Nous travaillons activement sur la dernière partie et nous espérons qu'elle sera prête pour le printemps 2022. 9 @@ -39,7 +39,7 @@ Le premier chapitre ne requiert aucune connaissance et constitue une bonne intro Les chapitres suivants nécessitent une bonne connaissance de Python et quelques notions de base de l'apprentissage automatique et de l'apprentissage profond. 11 -0:01:04.159,0:01:09.840 +0:01:04.159 --> 0:01:09.840 Si vous ne savez pas ce qu'un entraînement et une validation sont ou encore ce qu'une descente de gradient signifie, 12 @@ -59,7 +59,7 @@ Chaque partie abordée dans ce cours a une version dans ces deux frameworks. Vou Voici l'équipe qui a développé ce cours. Je vais maintenant laisser chacun des intervenants se présenter brièvement. 16 -0:01:37.119,0:01:41.000 +0:01:37.119 --> 0:01:41.000 Bonjour, je m'appelle Matthew et je suis ingénieur en apprentissage machine chez Hugging Face. 17 @@ -67,7 +67,7 @@ Bonjour, je m'appelle Matthew et je suis ingénieur en apprentissage machine che Je travaille dans l'équipe open source et je suis responsable de la maintenance en particulier des codes en TensorFlow. 18 -0:01:47.119,0:01:52.960 +0:01:47.119 --> 0:01:52.960 Auparavant, j'étais ingénieur en apprentissage automatique chez Parse.ly qui a récemment été acquis par Automattic. 19 @@ -79,7 +79,7 @@ Avant cela j'étais chercheur en post-doc au Trinity College Dublin en Irlande, Bonjour, je suis Lysandre, je suis ingénieur en apprentissage automatique chez Hugging Face et je fais spécifiquement partie de l'équipe open source. 21 -0:02:08.479,0:02:18.080 +0:02:08.479 --> 0:02:18.080 Je suis à Hugging Face depuis quelques années maintenant et aux côtés des membres de mon équipe j'ai travaillé sur la plupart des outils que vous verrez dans ce cours. 22 @@ -87,7 +87,7 @@ Je suis à Hugging Face depuis quelques années maintenant et aux côtés des me Bonjour, je m'appelle Sylvain, je suis ingénieur de recherche chez Hugging Face et l'un des principaux mainteneurs de la bibliothèque Transformers. 23 -0:02:25.599,0:02:32.000 +0:02:25.599 --> 0:02:32.000 Auparavant, j'ai travaillé chez Fast.ai où j'ai aidé à développer la bibliothèque Fastai ainsi que le livre en ligne. 24 @@ -151,5 +151,5 @@ Dans une vie antérieure, j'étais physicien théoricien et je faisais des reche Je m'appelle Leandro et je suis ingénieur en apprentissage automatique dans le domaine de l'équipe open source d'Hugging Face. 39 -0:04:20.799,0:04:28.680 +0:04:20.799 --> 0:04:28.680 Avant de rejoindre Hugging Face, j'ai travaillé comme data scientist en Suisse et j'ai enseigné la science des données à l'université. \ No newline at end of file diff --git a/subtitles/fr/03_what-is-transfer-learning.srt b/subtitles/fr/03_what-is-transfer-learning.srt index 05849f178..50ee488b4 100644 --- a/subtitles/fr/03_what-is-transfer-learning.srt +++ b/subtitles/fr/03_what-is-transfer-learning.srt @@ -124,7 +124,7 @@ OpenAI a également étudié le biais de prédiction de son modèle GPT-3 0:03:35.840,0:03:39.519 qui a été pré-entrainé en utilisant l'objectif de deviner le mot suivant. -0:03:39.5190:03:50.000 +0:03:39.519,0:03:50.000 En changeant le genre du prompt de « Il était très » à « Elle était très », les prédictions majoritairement neutres sont devenues presque uniquement axées sur le physique. 0:03:50.000,0:03:59.640 diff --git a/subtitles/fr/68_data-collators-a-tour.srt b/subtitles/fr/68_data-collators-a-tour.srt index 9075987e8..4cdfe8aa7 100644 --- a/subtitles/fr/68_data-collators-a-tour.srt +++ b/subtitles/fr/68_data-collators-a-tour.srt @@ -226,8 +226,10 @@ Mais l'assembleur de données pour la modélisation du langage le fera pour vous 00:05:57.680 --> 00:05:59.280 Et c'est tout. +60 00:05:59.280 --> 00:06:02.560 Ceci couvre donc les assembleurs de données les plus couramment utilisés et les tâches pour lesquelles ils sont utilisés. +61 00:06:02.560 --> 00:06:08.720 Nous espérons que vous savez maintenant quand utiliser les assembleurs de données et lequel choisir pour votre tâche spécifique. \ No newline at end of file diff --git a/subtitles/zh-CN/00_welcome-to-the-hugging-face-course.srt b/subtitles/zh-CN/00_welcome-to-the-hugging-face-course.srt index 462977fe2..ee4ca158d 100644 --- a/subtitles/zh-CN/00_welcome-to-the-hugging-face-course.srt +++ b/subtitles/zh-CN/00_welcome-to-the-hugging-face-course.srt @@ -1,7 +1,7 @@ 1 00:00:05,850 --> 00:00:07,713 -- 欢迎来到 Hugging Face 课程。 -- Welcome to the Hugging Face Course. +欢迎来到 Hugging Face 课程。 +Welcome to the Hugging Face Course. 2 00:00:08,550 --> 00:00:10,320 diff --git a/utils/convert_bilingual_monolingual.py b/utils/convert_bilingual_monolingual.py index 4a8004cdb..c993a6516 100644 --- a/utils/convert_bilingual_monolingual.py +++ b/utils/convert_bilingual_monolingual.py @@ -1,61 +1,53 @@ -#!/usr/bin/python3 -import getopt import re -import sys +import argparse +from pathlib import Path -PATTERN_TIMESTAMP = re.compile('^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]') -PATTERN_NUM = re.compile('\\d+') +PATTERN_TIMESTAMP = re.compile( + "^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]" +) +PATTERN_NUM = re.compile("\\d+") -def main(argv): - inputfile = '' - outputfile = '' - try: - opts, args = getopt.getopt(argv, "hi:o:", ["ifile=", "ofile="]) - except getopt.GetoptError: - print('srt_worker.py -i -o ') - sys.exit(2) - for opt, arg in opts: - if opt == '-h': - print( 'Usage: convert_bilingual_monolingual.py -i -o ') - sys.exit(-2) - elif opt in ("-i", "--ifile"): - inputfile = arg - elif opt in ("-o", "--ofile"): - outputfile = arg - - if not inputfile: - print('no input file is specified.\nUsage: convert_bilingual_monolingual.py -i -o ') - elif not outputfile: - print('no output file is specified.\nUsage: convert_bilingual_monolingual.py -i -o ') - else: - process(inputfile, outputfile) - - -def process(input_file, output): +def convert(input_file, output_file): """ - Convert bilingual caption file to monolingual caption, supported caption file type is srt. + Convert bilingual caption file to monolingual caption. Supported caption file type is SRT. """ line_count = 0 with open(input_file) as file: - with open(output, 'a') as output: + with open(output_file, "w") as output_file: for line in file: if line_count == 0: line_count += 1 - output.write(line) + output_file.write(line) elif PATTERN_TIMESTAMP.match(line): line_count += 1 - output.write(line) - elif line == '\n': + output_file.write(line) + elif line == "\n": line_count = 0 - output.write(line) + output_file.write(line) else: if line_count == 2: - output.write(line) + output_file.write(line) line_count += 1 - output.close() - print('conversion completed!') + output_file.close() if __name__ == "__main__": - main(sys.argv[1:]) + parser = argparse.ArgumentParser() + parser.add_argument( + "--input_language_folder", type=str, help="Folder with input bilingual SRT files to be converted" + ) + parser.add_argument( + "--output_language_folder", + type=str, + default="tmp-subtitles", + help="Folder to store converted monolingual SRT files", + ) + args = parser.parse_args() + + output_path = Path(args.output_language_folder) + output_path.mkdir(parents=True, exist_ok=True) + input_files = Path(args.input_language_folder).glob("*.srt") + for input_file in input_files: + convert(input_file, output_path / input_file.name) + print(f"Succesfully converted {len(list(input_files))} files to {args.output_language_folder} folder") diff --git a/utils/generate_subtitles.py b/utils/generate_subtitles.py index f5d1d4a05..31dccbe2c 100644 --- a/utils/generate_subtitles.py +++ b/utils/generate_subtitles.py @@ -7,14 +7,13 @@ import argparse import sys -def generate_subtitles(language: str, youtube_language_code: str=None): + +def generate_subtitles(language: str, youtube_language_code: str = None): metadata = [] formatter = SRTFormatter() path = Path(f"subtitles/{language}") path.mkdir(parents=True, exist_ok=True) - playlist_videos = Playlist.getVideos( - "https://youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o" - ) + playlist_videos = Playlist.getVideos("https://youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o") for idx, video in enumerate(playlist_videos["videos"]): video_id = video["id"] @@ -34,7 +33,9 @@ def generate_subtitles(language: str, youtube_language_code: str=None): # Map mismatched language codes if language not in languages: if youtube_language_code is None: - raise ValueError(f"Language code {language} not found in YouTube's list of supported language: {languages}. Please provide a value for `youtube_language_code` and try again.") + raise ValueError( + f"Language code {language} not found in YouTube's list of supported language: {languages}. Please provide a value for `youtube_language_code` and try again." + ) language_code = youtube_language_code else: language_code = language @@ -55,10 +56,11 @@ def generate_subtitles(language: str, youtube_language_code: str=None): df = pd.DataFrame(metadata) df.to_csv(f"subtitles/{language}/metadata.csv", index=False) + if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--language", type=str, help="Language to generate subtitles for") parser.add_argument("--youtube_language_code", type=str, help="YouTube language code") args = parser.parse_args() generate_subtitles(args.language, args.youtube_language_code) - print(f"All done! Subtitles stored at subtitles/{args.language}") \ No newline at end of file + print(f"All done! Subtitles stored at subtitles/{args.language}") diff --git a/utils/validate_translation.py b/utils/validate_translation.py index ef28a00fa..b3892c2cc 100644 --- a/utils/validate_translation.py +++ b/utils/validate_translation.py @@ -6,10 +6,9 @@ PATH_TO_COURSE = Path("chapters/") + def load_sections(language: str): - toc = yaml.safe_load( - open(os.path.join(PATH_TO_COURSE / language, "_toctree.yml"), "r") - ) + toc = yaml.safe_load(open(os.path.join(PATH_TO_COURSE / language, "_toctree.yml"), "r")) sections = [] for chapter in toc: for section in chapter["sections"]: @@ -35,4 +34,4 @@ def load_sections(language: str): for section in missing_sections: print(section) else: - print("✅ No missing sections - translation complete!") \ No newline at end of file + print("✅ No missing sections - translation complete!") From 1d4e07f6fc235fba2494ce254bb3051a1905dddb Mon Sep 17 00:00:00 2001 From: lewtun Date: Wed, 28 Dec 2022 16:48:15 +1100 Subject: [PATCH 5/5] Add tokenizer to MLM Trainer (#432) --- chapters/en/chapter7/3.mdx | 1 + chapters/fr/chapter7/3.mdx | 1 + chapters/ja/chapter7/3.mdx | 1 + chapters/vi/chapter7/3.mdx | 1 + chapters/zh-CN/chapter7/3.mdx | 1 + utils/generate_notebooks.py | 2 +- 6 files changed, 6 insertions(+), 1 deletion(-) diff --git a/chapters/en/chapter7/3.mdx b/chapters/en/chapter7/3.mdx index a1387158d..a31bb432c 100644 --- a/chapters/en/chapter7/3.mdx +++ b/chapters/en/chapter7/3.mdx @@ -723,6 +723,7 @@ trainer = Trainer( train_dataset=downsampled_dataset["train"], eval_dataset=downsampled_dataset["test"], data_collator=data_collator, + tokenizer=tokenizer, ) ``` diff --git a/chapters/fr/chapter7/3.mdx b/chapters/fr/chapter7/3.mdx index 675965901..11733c1b3 100644 --- a/chapters/fr/chapter7/3.mdx +++ b/chapters/fr/chapter7/3.mdx @@ -728,6 +728,7 @@ trainer = Trainer( train_dataset=downsampled_dataset["train"], eval_dataset=downsampled_dataset["test"], data_collator=data_collator, + tokenizer=tokenizer, ) ``` diff --git a/chapters/ja/chapter7/3.mdx b/chapters/ja/chapter7/3.mdx index b550203cc..7090ced4d 100644 --- a/chapters/ja/chapter7/3.mdx +++ b/chapters/ja/chapter7/3.mdx @@ -738,6 +738,7 @@ trainer = Trainer( train_dataset=downsampled_dataset["train"], eval_dataset=downsampled_dataset["test"], data_collator=data_collator, + tokenizer=tokenizer, ) ``` diff --git a/chapters/vi/chapter7/3.mdx b/chapters/vi/chapter7/3.mdx index 96d819c08..0cc470d6c 100644 --- a/chapters/vi/chapter7/3.mdx +++ b/chapters/vi/chapter7/3.mdx @@ -723,6 +723,7 @@ trainer = Trainer( train_dataset=downsampled_dataset["train"], eval_dataset=downsampled_dataset["test"], data_collator=data_collator, + tokenizer=tokenizer, ) ``` diff --git a/chapters/zh-CN/chapter7/3.mdx b/chapters/zh-CN/chapter7/3.mdx index abf328d1f..b5c410d23 100644 --- a/chapters/zh-CN/chapter7/3.mdx +++ b/chapters/zh-CN/chapter7/3.mdx @@ -724,6 +724,7 @@ trainer = Trainer( train_dataset=downsampled_dataset["train"], eval_dataset=downsampled_dataset["test"], data_collator=data_collator, + tokenizer=tokenizer, ) ``` diff --git a/utils/generate_notebooks.py b/utils/generate_notebooks.py index d7f235243..f4e77cd62 100644 --- a/utils/generate_notebooks.py +++ b/utils/generate_notebooks.py @@ -201,7 +201,7 @@ def build_notebook(fname, title, output_dir="."): installs = ["!pip install datasets evaluate transformers[sentencepiece]"] if section_name in sections_with_accelerate: installs.append("!pip install accelerate") - installs.append("# To run the training on TPU, you will need to uncomment the followin line:") + installs.append("# To run the training on TPU, you will need to uncomment the following line:") installs.append( "# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl" )