Skip to content

Commit

Permalink
Merge branch 'huggingface:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
gstdl authored Oct 28, 2022
2 parents 31ff225 + f5d8039 commit f6bb091
Show file tree
Hide file tree
Showing 13 changed files with 776 additions and 54 deletions.
4 changes: 2 additions & 2 deletions chapters/en/chapter7/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -371,7 +371,7 @@ As we can see, the second set of labels has been padded to the length of the fir

{:else}

Our data collator is ready to go! Now let's use it to make a `tf.data.Dataset` with the `to_tf_dataset()` method.
Our data collator is ready to go! Now let's use it to make a `tf.data.Dataset` with the `to_tf_dataset()` method. You can also use `model.prepare_tf_dataset()` to do this with a bit less boilerplate code - you'll see this in some of the other sections of this chapter.

```py
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
Expand Down Expand Up @@ -616,7 +616,7 @@ import numpy as np
all_predictions = []
all_labels = []
for batch in tf_eval_dataset:
logits = model.predict(batch)["logits"]
logits = model.predict_on_batch(batch)["logits"]
labels = batch["labels"]
predictions = np.argmax(logits, axis=-1)
for prediction, label in zip(predictions, labels):
Expand Down
12 changes: 6 additions & 6 deletions chapters/en/chapter7/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,6 @@ model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)
We can see how many parameters this model has by calling the `summary()` method:

```python
model(model.dummy_inputs) # Build the model
model.summary()
```

Expand Down Expand Up @@ -636,18 +635,18 @@ in your favorite terminal and log in there.

{#if fw === 'tf'}

Once we're logged in, we can create our `tf.data` datasets. We'll just use the standard data collator here, but you can also try the whole word masking collator and compare the results as an exercise:
Once we're logged in, we can create our `tf.data` datasets. To do so, we'll use the `prepare_tf_dataset()` method, which uses our model to automatically infer which columns should go into the dataset. If you want to control exactly which columns to use, you can use the `Dataset.to_tf_dataset()` method instead. To keep things simple, we'll just use the standard data collator here, but you can also try the whole word masking collator and compare the results as an exercise:

```python
tf_train_dataset = downsampled_dataset["train"].to_tf_dataset(
columns=["input_ids", "attention_mask", "labels"],
tf_train_dataset = model.prepare_tf_dataset(
downsampled_dataset["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=32,
)

tf_eval_dataset = downsampled_dataset["test"].to_tf_dataset(
columns=["input_ids", "attention_mask", "labels"],
tf_eval_dataset = model.prepare_tf_dataset(
downsampled_dataset["test"],
collate_fn=data_collator,
shuffle=False,
batch_size=32,
Expand Down Expand Up @@ -675,6 +674,7 @@ model.compile(optimizer=optimizer)
# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

model_name = model_checkpoint.split("/")[-1]
callback = PushToHubCallback(
output_dir=f"{model_name}-finetuned-imdb", tokenizer=tokenizer
)
Expand Down
48 changes: 31 additions & 17 deletions chapters/en/chapter7/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -378,14 +378,14 @@ We will pass this `data_collator` along to the `Seq2SeqTrainer`. Next, let's hav
We can now use this `data_collator` to convert each of our datasets to a `tf.data.Dataset`, ready for training:

```python
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
columns=["input_ids", "attention_mask", "labels"],
tf_train_dataset = model.prepare_tf_dataset(
tokenized_datasets["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=32,
)
tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
columns=["input_ids", "attention_mask", "labels"],
tf_eval_dataset = model.prepare_tf_dataset(
tokenized_datasets["validation"],
collate_fn=data_collator,
shuffle=False,
batch_size=16,
Expand Down Expand Up @@ -495,28 +495,42 @@ The score can go from 0 to 100, and higher is better.

{#if fw === 'tf'}

To get from the model outputs to texts the metric can use, we will use the `tokenizer.batch_decode()` method. We just have to clean up all the `-100`s in the labels; the tokenizer will automatically do the same for the padding token. Let's define a function that takes our model and a dataset and computes metrics on it. Because generation of long sequences can be slow, we subsample the validation set to make sure this doesn't take forever:
To get from the model outputs to texts the metric can use, we will use the `tokenizer.batch_decode()` method. We just have to clean up all the `-100`s in the labels; the tokenizer will automatically do the same for the padding token. Let's define a function that takes our model and a dataset and computes metrics on it. We're also going to use a trick that dramatically increases performance - compiling our generation code with [XLA](https://www.tensorflow.org/xla), TensorFlow's accelerated linear algebra compiler. XLA applies various optimizations to the model's computation graph, and results in significant improvements to speed and memory usage. As described in the Hugging Face [blog](https://huggingface.co/blog/tf-xla-generate), XLA works best when our input shapes don't vary too much. To handle this, we'll pad our inputs to multiples of 128, and make a new dataset with the padding collator, and then we'll apply the `@tf.function(jit_compile=True)` decorator to our generation function, which marks the whole function for compilation with XLA.

```py
import numpy as np
import tensorflow as tf
from tqdm import tqdm

generation_data_collator = DataCollatorForSeq2Seq(
tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128
)

tf_generate_dataset = model.prepare_tf_dataset(
tokenized_datasets["validation"],
collate_fn=generation_data_collator,
shuffle=False,
batch_size=8,
)


@tf.function(jit_compile=True)
def generate_with_xla(batch):
return model.generate(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
max_new_tokens=128,
)


def compute_metrics():
all_preds = []
all_labels = []
sampled_dataset = tokenized_datasets["validation"].shuffle().select(range(200))
tf_generate_dataset = sampled_dataset.to_tf_dataset(
columns=["input_ids", "attention_mask", "labels"],
collate_fn=data_collator,
shuffle=False,
batch_size=4,
)
for batch in tf_generate_dataset:
predictions = model.generate(
input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]
)

for batch, labels in tqdm(tf_generate_dataset):
predictions = generate_with_xla(batch)
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
labels = batch["labels"].numpy()
labels = labels.numpy()
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
decoded_preds = [pred.strip() for pred in decoded_preds]
Expand Down
43 changes: 33 additions & 10 deletions chapters/en/chapter7/5.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -289,9 +289,10 @@ def preprocess_function(examples):
max_length=max_input_length,
truncation=True,
)
labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True)
labels = tokenizer(
examples["review_title"], max_length=max_target_length, truncation=True
)
model_inputs["labels"] = labels["input_ids"]
model_inputs["labels_mask"] = labels["attention_mask"]
return model_inputs
```

Expand Down Expand Up @@ -673,14 +674,14 @@ To wrap up this section, let's take a look at how we can also fine-tune mT5 usin
We're almost ready to train! We just need to convert our datasets to `tf.data.Dataset`s using the data collator we defined above, and then `compile()` and `fit()` the model. First, the datasets:

```python
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
columns=["input_ids", "attention_mask", "labels"],
tf_train_dataset = model.prepare_tf_dataset(
tokenized_datasets["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=8,
)
tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
columns=["input_ids", "attention_mask", "labels"],
tf_eval_dataset = model.prepare_tf_dataset(
tokenized_datasets["validation"],
collate_fn=data_collator,
shuffle=False,
batch_size=8,
Expand Down Expand Up @@ -727,18 +728,40 @@ model.fit(
)
```

We got some loss values during training, but really we'd like to see the ROUGE metrics we computed earlier. To get those metrics, we'll need to generate outputs from the model and convert them to strings. Let's build some lists of labels and predictions for the ROUGE metric to compare (note that if you get import errors for this section, you may need to`!pip install tqdm`):
We got some loss values during training, but really we'd like to see the ROUGE metrics we computed earlier. To get those metrics, we'll need to generate outputs from the model and convert them to strings. Let's build some lists of labels and predictions for the ROUGE metric to compare (note that if you get import errors for this section, you may need to`!pip install tqdm`). We're also going to use a trick that dramatically increases performance - compiling our generation code with [XLA](https://www.tensorflow.org/xla), TensorFlow's accelerated linear algebra compiler. XLA applies various optimizations to the model's computation graph, and results in significant improvements to speed and memory usage. As described in the Hugging Face [blog](https://huggingface.co/blog/tf-xla-generate), XLA works best when our input shapes don't vary too much. To handle this, we'll pad our inputs to multiples of 128, and make a new dataset with the padding collator, and then we'll apply the `@tf.function(jit_compile=True)` decorator to our generation function, which marks the whole function for compilation with XLA.

```python
from tqdm import tqdm
import numpy as np

generation_data_collator = DataCollatorForSeq2Seq(
tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=320
)

tf_generate_dataset = model.prepare_tf_dataset(
tokenized_datasets["validation"],
collate_fn=generation_data_collator,
shuffle=False,
batch_size=8,
drop_remainder=True,
)


@tf.function(jit_compile=True)
def generate_with_xla(batch):
return model.generate(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
max_new_tokens=32,
)


all_preds = []
all_labels = []
for batch in tqdm(tf_eval_dataset):
predictions = model.generate(**batch)
for batch, labels in tqdm(tf_generate_dataset):
predictions = generate_with_xla(batch)
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
labels = batch["labels"].numpy()
labels = labels.numpy()
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
Expand Down
12 changes: 6 additions & 6 deletions chapters/en/chapter7/6.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -379,17 +379,17 @@ We can see that the examples have been stacked and all the tensors have the same

{#if fw === 'tf'}

Now we can use the `to_tf_dataset()` method to convert our datasets to TensorFlow datasets with the data collator we created above:
Now we can use the `prepare_tf_dataset()` method to convert our datasets to TensorFlow datasets with the data collator we created above:

```python
tf_train_dataset = tokenized_dataset["train"].to_tf_dataset(
columns=["input_ids", "attention_mask", "labels"],
tf_train_dataset = model.prepare_tf_dataset(
tokenized_dataset["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=32,
)
tf_eval_dataset = tokenized_dataset["valid"].to_tf_dataset(
columns=["input_ids", "attention_mask", "labels"],
tf_eval_dataset = model.prepare_tf_dataset(
tokenized_dataset["valid"],
collate_fn=data_collator,
shuffle=False,
batch_size=32,
Expand Down Expand Up @@ -515,7 +515,7 @@ model.fit(tf_train_dataset, validation_data=tf_eval_dataset, callbacks=[callback

{:else}

💡 If you have access to a machine with multiple GPUs, you can try using a `MirroredStrategy` context to substantially speed up training. You'll need to create a `tf.distribute.MirroredStrategy` object, and make sure that the `to_tf_dataset` commands as well as model creation and the call to `fit()` are all run in its `scope()` context. You can see documentation on this [here](https://www.tensorflow.org/guide/distributed_training#use_tfdistributestrategy_with_keras_modelfit).
💡 If you have access to a machine with multiple GPUs, you can try using a `MirroredStrategy` context to substantially speed up training. You'll need to create a `tf.distribute.MirroredStrategy` object, and make sure that any `to_tf_dataset()` or `prepare_tf_dataset()` methods as well as model creation and the call to `fit()` are all run in its `scope()` context. You can see documentation on this [here](https://www.tensorflow.org/guide/distributed_training#use_tfdistributestrategy_with_keras_modelfit).

{/if}

Expand Down
14 changes: 4 additions & 10 deletions chapters/en/chapter7/7.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -862,20 +862,14 @@ data_collator = DefaultDataCollator(return_tensors="tf")
And now we create the datasets as usual.

```python
tf_train_dataset = train_dataset.to_tf_dataset(
columns=[
"input_ids",
"start_positions",
"end_positions",
"attention_mask",
"token_type_ids",
],
tf_train_dataset = model.prepare_tf_dataset(
train_dataset,
collate_fn=data_collator,
shuffle=True,
batch_size=16,
)
tf_eval_dataset = validation_dataset.to_tf_dataset(
columns=["input_ids", "attention_mask", "token_type_ids"],
tf_eval_dataset = model.prepare_tf_dataset(
validation_dataset,
collate_fn=data_collator,
shuffle=False,
batch_size=16,
Expand Down
2 changes: 2 additions & 0 deletions chapters/es/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@
title: Tokenizadores
- local: chapter2/5
title: Manejando Secuencias Múltiples
- local: chapter2/6
title: Poniendo todo junto

- title: 3. Ajuste (fine-tuning) de un modelo preentrenado
sections:
Expand Down
Loading

0 comments on commit f6bb091

Please sign in to comment.