From 0bc7dc00abdba0110b79dacd657d15a43d3c9dda Mon Sep 17 00:00:00 2001 From: jybarnes21 Date: Fri, 17 Mar 2023 18:52:40 +0900 Subject: [PATCH] Fix typo (#532) --- chapters/en/chapter6/3.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/chapter6/3.mdx b/chapters/en/chapter6/3.mdx index 62d143dcc..88250f6df 100644 --- a/chapters/en/chapter6/3.mdx +++ b/chapters/en/chapter6/3.mdx @@ -109,7 +109,7 @@ We can see that the tokenizer's special tokens `[CLS]` and `[SEP]` are mapped to -The notion of what a word is is complicated. For instance, does "I'll" (a contraction of "I will") count as one or two words? It actually depends on the tokenizer and the pre-tokenization operation it applies. Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words. +The notion of what a word is complicated. For instance, does "I'll" (a contraction of "I will") count as one or two words? It actually depends on the tokenizer and the pre-tokenization operation it applies. Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words. ✏️ **Try it out!** Create a tokenizer from the `bert-base-cased` and `roberta-base` checkpoints and tokenize "81s" with them. What do you observe? What are the word IDs?