Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter-2-Tokenize whole dataset function drops original columns and mismatched row numbers #134

Open
10 tasks
rshokeen opened this issue Mar 21, 2024 · 1 comment

Comments

@rshokeen
Copy link

Information

While running the tokenize function in the sub-section (Tokenizing the Whole Dataset) of Chapter 2, I am encountering an issue while calling the tokenize function.

The problem arises in chapter:

  • Introduction
  • [ x] Text Classification
  • Transformer Anatomy
  • Multilingual Named Entity Recognition
  • Text Generation
  • Summarization
  • Question Answering
  • Making Transformers Efficient in Production
  • Dealing with Few to No Labels
  • Training Transformers from Scratch
  • Future Directions

Describe the bug

  1. While calling the tokenize function with emotions dataset (with "text" and "label" columns), the resulting emotions_encoded dataset is dropping "text" and "label" columns and only have "input_ids" and "attention_mask" column. The book code shows that emotions_encoded dataset has all 4 columns after calling the tokenize function - ['attention_mask', 'input_ids', 'label', 'text']

  2. After calling the tokenize function, the resulting emotions_encoded["train"] dataset has only 151 rows, which I guess is not matching with original emotions["train"] dataset that has 16,000 rows.

To Reproduce

Steps to reproduce the behavior:

  1. def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

  2. Call the tokenize function
    emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

  3. Print column names
    print(emotions_encoded["train"].column_names)

    output: ['input_ids', 'attention_mask']

    **Expected behavior:** INSTEAD IT SHOULD BE ['attention_mask', 'input_ids', 'label', 'text']
    
  4. Print emotions_encoded
    print(emotions_encoded)

    output: DatasetDict({
    train: Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 151
    })
    validation: Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 144
    })
    test: Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 152
    })
    })

    Expected behavior: the number of rows in train dataset is 151 whereas the original dataset rows in emotion["train"]
    dataset is 16,000.

@VitaminCplus
Copy link

I got a similar error. I eventually fixed it by switching off the Accelerator in Kaggle.

Settings -> Accelerator -> None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants