You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While running the tokenize function in the sub-section (Tokenizing the Whole Dataset) of Chapter 2, I am encountering an issue while calling the tokenize function.
The problem arises in chapter:
Introduction
[ x] Text Classification
Transformer Anatomy
Multilingual Named Entity Recognition
Text Generation
Summarization
Question Answering
Making Transformers Efficient in Production
Dealing with Few to No Labels
Training Transformers from Scratch
Future Directions
Describe the bug
While calling the tokenize function with emotions dataset (with "text" and "label" columns), the resulting emotions_encoded dataset is dropping "text" and "label" columns and only have "input_ids" and "attention_mask" column. The book code shows that emotions_encoded dataset has all 4 columns after calling the tokenize function - ['attention_mask', 'input_ids', 'label', 'text']
After calling the tokenize function, the resulting emotions_encoded["train"] dataset has only 151 rows, which I guess is not matching with original emotions["train"] dataset that has 16,000 rows.
Information
While running the tokenize function in the sub-section (Tokenizing the Whole Dataset) of Chapter 2, I am encountering an issue while calling the tokenize function.
The problem arises in chapter:
Describe the bug
While calling the tokenize function with emotions dataset (with "text" and "label" columns), the resulting emotions_encoded dataset is dropping "text" and "label" columns and only have "input_ids" and "attention_mask" column. The book code shows that emotions_encoded dataset has all 4 columns after calling the tokenize function - ['attention_mask', 'input_ids', 'label', 'text']
After calling the tokenize function, the resulting emotions_encoded["train"] dataset has only 151 rows, which I guess is not matching with original emotions["train"] dataset that has 16,000 rows.
To Reproduce
Steps to reproduce the behavior:
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
Call the tokenize function
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)
Print column names
print(emotions_encoded["train"].column_names)
output: ['input_ids', 'attention_mask']
Print emotions_encoded
print(emotions_encoded)
output: DatasetDict({
train: Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 151
})
validation: Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 144
})
test: Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 152
})
})
Expected behavior: the number of rows in train dataset is 151 whereas the original dataset rows in emotion["train"]
dataset is 16,000.
The text was updated successfully, but these errors were encountered: