Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permanent Skipping of Batches with accelerator.skip_first_batches #1451

Closed
2 of 4 tasks
Teoge opened this issue May 18, 2023 · 5 comments
Closed
2 of 4 tasks

Permanent Skipping of Batches with accelerator.skip_first_batches #1451

Teoge opened this issue May 18, 2023 · 5 comments
Assignees

Comments

@Teoge
Copy link

Teoge commented May 18, 2023

System Info

- `Accelerate` version: 0.18.0
- Platform: Linux-5.4.0-144-generic-x86_64-with-glibc2.27
- Python version: 3.9.16
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.12.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: fp16
        - use_cpu: False
        - num_processes: 5
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0,1,2,3,4
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/accelerate/blob/main/examples/complete_cv_example.py

Expected behavior

When using the skip_first_batches function, I discovered that it does not behave as expected. The function is intended to skip a specified number of batches only in the first resumed epoch, but it permanently skips a specified number of batches from the dataloader. The problem arises when the dataloader is not reinitialized, resulting in a permanently shorter data loader for subsequent epochs.
The official script provided in the repository (https://github.com/huggingface/accelerate/blob/main/examples/complete_cv_example.py#L208) does not include any steps to renew the data loader after the first epoch, which is misleading.

@sgugger
Copy link
Collaborator

sgugger commented May 18, 2023

Indeed, the dataloader needs to be changed only at the first epoch. Would you like to suggest a PR with a fix?

@Teoge
Copy link
Author

Teoge commented May 19, 2023

Thank you for your response. Since I am not familiar with the accelerate data loading logic, the only solution I can suggest at the moment is for users to retain the original data loader and use it for later epochs. These might require some changes to the function documentation and example scripts.
Given your expertise, I would like to ask if you think there might be a better solution available to address this issue.

@sgugger
Copy link
Collaborator

sgugger commented May 19, 2023

No, the dataloader should be changed for the the first epoch and then go back to the initial dataloader for all subsequent epochs, that is the correct fix and how we use this internally in the Trainer.

@Teoge
Copy link
Author

Teoge commented May 22, 2023

If so, please forgive me for not knowing how to implement this.

@muellerzr
Copy link
Collaborator

Hi @Teoge, no worries! I did it in #1466 :)

@Teoge Teoge closed this as completed May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants