Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge test #1

Merged
merged 5 commits into from
Dec 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 10 additions & 9 deletions chapters/en/chapter1/1.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -37,23 +37,23 @@ After you've completed this course, we recommend checking out DeepLearning.AI's

About the authors:

**Abubakar Abid** completed his PhD at Stanford in applied machine learning. During his PhD, he founded [Gradio](https://github.com/gradio-app/gradio), an open-source Python library that has been used to build over 600,000 machine learning demos. Gradio was acquired by Hugging Face, which is where Abubakar now serves as a machine learning team lead.
[**Abubakar Abid**](https://huggingface.co/abidlabs) completed his PhD at Stanford in applied machine learning. During his PhD, he founded [Gradio](https://github.com/gradio-app/gradio), an open-source Python library that has been used to build over 600,000 machine learning demos. Gradio was acquired by Hugging Face, which is where Abubakar now serves as a machine learning team lead.

**Matthew Carrigan** is a Machine Learning Engineer at Hugging Face. He lives in Dublin, Ireland and previously worked as an ML engineer at Parse.ly and before that as a post-doctoral researcher at Trinity College Dublin. He does not believe we're going to get to AGI by scaling existing architectures, but has high hopes for robot immortality regardless.
[**Matthew Carrigan**](https://huggingface.co/Rocketknight1) is a Machine Learning Engineer at Hugging Face. He lives in Dublin, Ireland and previously worked as an ML engineer at Parse.ly and before that as a post-doctoral researcher at Trinity College Dublin. He does not believe we're going to get to AGI by scaling existing architectures, but has high hopes for robot immortality regardless.

**Lysandre Debut** is a Machine Learning Engineer at Hugging Face and has been working on the 🤗 Transformers library since the very early development stages. His aim is to make NLP accessible for everyone by developing tools with a very simple API.
[**Lysandre Debut**](https://huggingface.co/lysandre) is a Machine Learning Engineer at Hugging Face and has been working on the 🤗 Transformers library since the very early development stages. His aim is to make NLP accessible for everyone by developing tools with a very simple API.

**Sylvain Gugger** is a Research Engineer at Hugging Face and one of the core maintainers of the 🤗 Transformers library. Previously he was a Research Scientist at fast.ai, and he co-wrote _[Deep Learning for Coders with fastai and PyTorch](https://learning.oreilly.com/library/view/deep-learning-for/9781492045519/)_ with Jeremy Howard. The main focus of his research is on making deep learning more accessible, by designing and improving techniques that allow models to train fast on limited resources.
[**Sylvain Gugger**](https://huggingface.co/sgugger) is a Research Engineer at Hugging Face and one of the core maintainers of the 🤗 Transformers library. Previously he was a Research Scientist at fast.ai, and he co-wrote _[Deep Learning for Coders with fastai and PyTorch](https://learning.oreilly.com/library/view/deep-learning-for/9781492045519/)_ with Jeremy Howard. The main focus of his research is on making deep learning more accessible, by designing and improving techniques that allow models to train fast on limited resources.

**Dawood Khan** is a Machine Learning Engineer at Hugging Face. He's from NYC and graduated from New York University studying Computer Science. After working as an iOS Engineer for a few years, Dawood quit to start Gradio with his fellow co-founders. Gradio was eventually acquired by Hugging Face.
[**Dawood Khan**](https://huggingface.co/dawoodkhan82) is a Machine Learning Engineer at Hugging Face. He's from NYC and graduated from New York University studying Computer Science. After working as an iOS Engineer for a few years, Dawood quit to start Gradio with his fellow co-founders. Gradio was eventually acquired by Hugging Face.

**Merve Noyan** is a developer advocate at Hugging Face, working on developing tools and building content around them to democratize machine learning for everyone.
[**Merve Noyan**](https://huggingface.co/merve) is a developer advocate at Hugging Face, working on developing tools and building content around them to democratize machine learning for everyone.

**Lucile Saulnier** is a machine learning engineer at Hugging Face, developing and supporting the use of open source tools. She is also actively involved in many research projects in the field of Natural Language Processing such as collaborative training and BigScience.
[**Lucile Saulnier**](https://huggingface.co/SaulLu) is a machine learning engineer at Hugging Face, developing and supporting the use of open source tools. She is also actively involved in many research projects in the field of Natural Language Processing such as collaborative training and BigScience.

**Lewis Tunstall** is a machine learning engineer at Hugging Face, focused on developing open-source tools and making them accessible to the wider community. He is also a co-author of the O’Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/).
[**Lewis Tunstall**](https://huggingface.co/lewtun) is a machine learning engineer at Hugging Face, focused on developing open-source tools and making them accessible to the wider community. He is also a co-author of the O’Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/).

**Leandro von Werra** is a machine learning engineer in the open-source team at Hugging Face and also a co-author of the O’Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). He has several years of industry experience bringing NLP projects to production by working across the whole machine learning stack..
[**Leandro von Werra**](https://huggingface.co/lvwerra) is a machine learning engineer in the open-source team at Hugging Face and also a co-author of the O’Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). He has several years of industry experience bringing NLP projects to production by working across the whole machine learning stack..

## FAQ[[faq]]

Expand Down Expand Up @@ -100,6 +100,7 @@ Of course! The course is released under the permissive [Apache 2 license](https:
}
```

## Let's Go
Are you ready to roll? In this chapter, you will learn:

* How to use the `pipeline()` function to solve NLP tasks such as text generation and classification
Expand Down
2 changes: 2 additions & 0 deletions chapters/en/chapter1/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,8 @@ Imagine if each time a research team, a student organization, or a company wante

This is why sharing language models is paramount: sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community.

By the way, you can evaluate the carbon footprint of your models' training through several tools. For example [ML CO2 Impact](https://mlco2.github.io/impact/) or [Code Carbon]( https://codecarbon.io/) which is integrated in 🤗 Transformers. To learn more about this, you can read this [blog post](https://huggingface.co/blog/carbon-emissions-on-the-hub) which will show you how to generate an `emissions.csv` file with an estimate of the footprint of your training, as well as the [documentation](https://huggingface.co/docs/hub/model-cards-co2) of 🤗 Transformers addressing this topic.


## Transfer Learning[[transfer-learning]]

Expand Down
1 change: 1 addition & 0 deletions chapters/en/chapter7/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -723,6 +723,7 @@ trainer = Trainer(
train_dataset=downsampled_dataset["train"],
eval_dataset=downsampled_dataset["test"],
data_collator=data_collator,
tokenizer=tokenizer,
)
```

Expand Down
2 changes: 1 addition & 1 deletion chapters/en/chapter7/6.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ This is actually showcasing the model that was trained and uploaded to the Hub u

## Gathering the data[[gathering-the-data]]

Python code is abundantly available from code repositories such as GitHub, which we can use to create a dataset by scraping for every Python repository. This was the approach taken in the [Transformers textbook](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/) to pretrain a large GPT-2 model. Using a GitHub dump of about 180 GB containing roughly 20 million Python files called `codeparrot`, the authors built a dataset that they then shared on the [Hugging Face Hub](https://huggingface.co/datasets/transformersbook/codeparrot).
Python code is abundantly available from code repositories such as GitHub, which we can use to create a dataset by scraping for every Python repository. This was the approach taken in the [Transformers textbook](https://learning.oreilly.com/library/view/natural-language-processing/9781098136789/) to pretrain a large GPT-2 model. Using a GitHub dump of about 180 GB containing roughly 20 million Python files called `codeparrot`, the authors built a dataset that they then shared on the [Hugging Face Hub](https://huggingface.co/datasets/transformersbook/codeparrot).

However, training on the full corpus is time- and compute-consuming, and we only need the subset of the dataset concerned with the Python data science stack. So, let's start by filtering the `codeparrot` dataset for all files that include any of the libraries in this stack. Because of the dataset's size, we want to avoid downloading it; instead, we'll use the streaming feature to filter it on the fly. To help us filter the code samples using the libraries we mentioned earlier, we'll use the following function:

Expand Down
1 change: 1 addition & 0 deletions chapters/fr/chapter7/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -728,6 +728,7 @@ trainer = Trainer(
train_dataset=downsampled_dataset["train"],
eval_dataset=downsampled_dataset["test"],
data_collator=data_collator,
tokenizer=tokenizer,
)
```

Expand Down
2 changes: 1 addition & 1 deletion chapters/fr/chapter7/6.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Il s'agit d'une présentation du modèle qui a été entraîné à l'aide du cod

## Collecte des données

On peut trouver du code Python en abondance dans les dépôts de code tels que GitHub, que nous pouvons utiliser pour créer un jeu de données en récupérant chaque dépôt Python. C'est l'approche adoptée dans le [livre *Natural Language Processing with Transformers*](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/) pour pré-entraîner un grand GPT-2. En utilisant un dépôt GitHub d'environ 180 Go contenant approximativement 20 millions de fichiers Python, les auteurs du livre ont construit un jeu de données appelé `codeparrot` qu'ils ont ensuite partagé sur le [*Hub*](https://huggingface.co/datasets/transformersbook/codeparrot).
On peut trouver du code Python en abondance dans les dépôts de code tels que GitHub, que nous pouvons utiliser pour créer un jeu de données en récupérant chaque dépôt Python. C'est l'approche adoptée dans le [livre *Natural Language Processing with Transformers*](https://learning.oreilly.com/library/view/natural-language-processing/9781098136789/) pour pré-entraîner un grand GPT-2. En utilisant un dépôt GitHub d'environ 180 Go contenant approximativement 20 millions de fichiers Python, les auteurs du livre ont construit un jeu de données appelé `codeparrot` qu'ils ont ensuite partagé sur le [*Hub*](https://huggingface.co/datasets/transformersbook/codeparrot).

Cependant, entraîner sur l'ensemble du corpus prend beaucoup de temps et demande beaucoup de ressources de calculs. Dans notre cas, nous n'avons besoin que du sous-ensemble du jeu de données qui est relatif aux codes portant sur la science des données. Commençons donc par filtrer le jeu de données `codeparrot` en ne gardant que les fichiers incluant l'une des bibliothèques de science des données énumérées précédemment. En raison de la taille du jeu de données, nous voulons éviter de le télécharger. Nous utiliserons donc la fonctionnalité de *streaming* de 🤗 *Datasets* afin de le filtrer à la volée. Pour nous aider à filtrer les échantillons de code utilisant les bibliothèques que nous avons mentionnées précédemment, nous utilisons la fonction suivante :

Expand Down
1 change: 1 addition & 0 deletions chapters/ja/chapter7/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -738,6 +738,7 @@ trainer = Trainer(
train_dataset=downsampled_dataset["train"],
eval_dataset=downsampled_dataset["test"],
data_collator=data_collator,
tokenizer=tokenizer,
)
```

Expand Down
Loading