Permanent Skipping of Batches with `accelerator.skip_first_batches` #1451

Teoge · 2023-05-18T10:30:05Z

System Info

- `Accelerate` version: 0.18.0
- Platform: Linux-5.4.0-144-generic-x86_64-with-glibc2.27
- Python version: 3.9.16
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.12.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: fp16
        - use_cpu: False
        - num_processes: 5
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0,1,2,3,4
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/accelerate/blob/main/examples/complete_cv_example.py

Expected behavior

When using the skip_first_batches function, I discovered that it does not behave as expected. The function is intended to skip a specified number of batches only in the first resumed epoch, but it permanently skips a specified number of batches from the dataloader. The problem arises when the dataloader is not reinitialized, resulting in a permanently shorter data loader for subsequent epochs.
The official script provided in the repository (https://github.com/huggingface/accelerate/blob/main/examples/complete_cv_example.py#L208) does not include any steps to renew the data loader after the first epoch, which is misleading.

The text was updated successfully, but these errors were encountered:

sgugger · 2023-05-18T13:24:58Z

Indeed, the dataloader needs to be changed only at the first epoch. Would you like to suggest a PR with a fix?

Teoge · 2023-05-19T06:57:30Z

Thank you for your response. Since I am not familiar with the accelerate data loading logic, the only solution I can suggest at the moment is for users to retain the original data loader and use it for later epochs. These might require some changes to the function documentation and example scripts.
Given your expertise, I would like to ask if you think there might be a better solution available to address this issue.

sgugger · 2023-05-19T11:45:53Z

No, the dataloader should be changed for the the first epoch and then go back to the initial dataloader for all subsequent epochs, that is the correct fix and how we use this internally in the Trainer.

Teoge · 2023-05-22T09:13:47Z

If so, please forgive me for not knowing how to implement this.

muellerzr · 2023-05-23T10:48:07Z

Hi @Teoge, no worries! I did it in #1466 :)

muellerzr self-assigned this May 22, 2023

muellerzr mentioned this issue May 22, 2023

Fix skip first batch being perminant #1466

Merged

Teoge closed this as completed May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Permanent Skipping of Batches with `accelerator.skip_first_batches` #1451

Permanent Skipping of Batches with `accelerator.skip_first_batches` #1451

Teoge commented May 18, 2023

sgugger commented May 18, 2023

Teoge commented May 19, 2023

sgugger commented May 19, 2023

Teoge commented May 22, 2023

muellerzr commented May 23, 2023

Permanent Skipping of Batches with accelerator.skip_first_batches #1451

Permanent Skipping of Batches with accelerator.skip_first_batches #1451

Comments

Teoge commented May 18, 2023

System Info

Information

Tasks

Reproduction

Expected behavior

sgugger commented May 18, 2023

Teoge commented May 19, 2023

sgugger commented May 19, 2023

Teoge commented May 22, 2023

muellerzr commented May 23, 2023

Permanent Skipping of Batches with `accelerator.skip_first_batches` #1451

Permanent Skipping of Batches with `accelerator.skip_first_batches` #1451