Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The combination of different mode and split leads to wrong calculation for number of batches and number of epochs #264

Closed
DavdGao opened this issue Jul 27, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@DavdGao
Copy link
Collaborator

DavdGao commented Jul 27, 2022

Describe the bug
As the title says, the current number of batches and epochs are calculated for each split as follows:

        ...
        # Process training data
        if self.train_data is not None or self.train_loader is not None:
            # Calculate the number of update steps during training given the
            # local_update_steps
            num_train_batch, num_train_batch_last_epoch, num_train_epoch, \
                num_total_train_batch = self.pre_calculate_batch_epoch_num(
                    self.cfg.train.local_update_steps)

            self.num_train_epoch = num_train_epoch
            self.num_train_batch = num_train_batch
            self.num_train_batch_last_epoch = num_train_batch_last_epoch
            self.num_total_train_batch = num_total_train_batch

        # Process evaluation data
        for mode in ["val", "test"]:
            setattr(self, "num_{}_epoch".format(mode), 1)
            if self.get("{}_data".format(mode)) is not None or self.get(
                    "{}_loader".format(mode)) is not None:
                setattr(
                    self, "num_{}_batch".format(mode),
                    getattr(self, "num_{}_data".format(mode)) //
                    self.cfg.data.batch_size +
                    int(not self.cfg.data.drop_last and bool(
                        getattr(self, "num_{}_data".format(mode)) %
                        self.cfg.data.batch_size)))
            ...

and the fintune and training routine stops at

    def _run_routine(self, ...):
            ...
            # Break in the final epoch
            if self.ctx.cur_mode == 'train' and epoch_i == \
                    self.ctx.num_train_epoch - 1:
                if batch_i >= self.ctx.num_train_batch_last_epoch - 1:
                    break
            ...

The problems are

  • If we choose test or validate split for training routine, the num_train_batch_last_epoch and num_train_epoch are all wrong(since they are calculated for the training split).
  • If we set different parameters (say local update steps) for finetune and training, they should have different num_train_batch_last_epoch and num_train_epoch.

Expected behavior
The number of batches and epochs should follow the combination of mode and split.

@DavdGao DavdGao added the bug Something isn't working label Jul 27, 2022
@DavdGao DavdGao changed the title The actual number of batches within the finetune and training routines is wrong. The combination of different mode and split leads to wrong calculation for number of batches and number of epochs Jul 27, 2022
@rayrayraykk
Copy link
Collaborator

Fixed in #415 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants