Chapter-2-Tokenize whole dataset function drops original columns and mismatched row numbers #134

rshokeen · 2024-03-21T05:59:57Z

Information

While running the tokenize function in the sub-section (Tokenizing the Whole Dataset) of Chapter 2, I am encountering an issue while calling the tokenize function.

The problem arises in chapter:

Describe the bug

While calling the tokenize function with emotions dataset (with "text" and "label" columns), the resulting emotions_encoded dataset is dropping "text" and "label" columns and only have "input_ids" and "attention_mask" column. The book code shows that emotions_encoded dataset has all 4 columns after calling the tokenize function - ['attention_mask', 'input_ids', 'label', 'text']
After calling the tokenize function, the resulting emotions_encoded["train"] dataset has only 151 rows, which I guess is not matching with original emotions["train"] dataset that has 16,000 rows.

To Reproduce

Steps to reproduce the behavior:

def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
Call the tokenize function
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)
Print column names
print(emotions_encoded["train"].column_names)

output: ['input_ids', 'attention_mask']
```
**Expected behavior:** INSTEAD IT SHOULD BE ['attention_mask', 'input_ids', 'label', 'text']
```
Print emotions_encoded
print(emotions_encoded)

output: DatasetDict({
train: Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 151
})
validation: Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 144
})
test: Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 152
})
})

Expected behavior: the number of rows in train dataset is 151 whereas the original dataset rows in emotion["train"]
dataset is 16,000.

The text was updated successfully, but these errors were encountered:

VitaminCplus · 2025-01-21T01:00:54Z

I got a similar error. I eventually fixed it by switching off the Accelerator in Kaggle.

Settings -> Accelerator -> None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter-2-Tokenize whole dataset function drops original columns and mismatched row numbers #134

Chapter-2-Tokenize whole dataset function drops original columns and mismatched row numbers #134

rshokeen commented Mar 21, 2024

VitaminCplus commented Jan 21, 2025

Chapter-2-Tokenize whole dataset function drops original columns and mismatched row numbers #134

Chapter-2-Tokenize whole dataset function drops original columns and mismatched row numbers #134

Comments

rshokeen commented Mar 21, 2024

Information

Describe the bug

To Reproduce

VitaminCplus commented Jan 21, 2025