The combination of different mode and split leads to wrong calculation for number of batches and number of epochs #264

DavdGao · 2022-07-27T09:33:45Z

Describe the bug
As the title says, the current number of batches and epochs are calculated for each split as follows:

        ...
        # Process training data
        if self.train_data is not None or self.train_loader is not None:
            # Calculate the number of update steps during training given the
            # local_update_steps
            num_train_batch, num_train_batch_last_epoch, num_train_epoch, \
                num_total_train_batch = self.pre_calculate_batch_epoch_num(
                    self.cfg.train.local_update_steps)

            self.num_train_epoch = num_train_epoch
            self.num_train_batch = num_train_batch
            self.num_train_batch_last_epoch = num_train_batch_last_epoch
            self.num_total_train_batch = num_total_train_batch

        # Process evaluation data
        for mode in ["val", "test"]:
            setattr(self, "num_{}_epoch".format(mode), 1)
            if self.get("{}_data".format(mode)) is not None or self.get(
                    "{}_loader".format(mode)) is not None:
                setattr(
                    self, "num_{}_batch".format(mode),
                    getattr(self, "num_{}_data".format(mode)) //
                    self.cfg.data.batch_size +
                    int(not self.cfg.data.drop_last and bool(
                        getattr(self, "num_{}_data".format(mode)) %
                        self.cfg.data.batch_size)))
            ...

and the fintune and training routine stops at

    def _run_routine(self, ...):
            ...
            # Break in the final epoch
            if self.ctx.cur_mode == 'train' and epoch_i == \
                    self.ctx.num_train_epoch - 1:
                if batch_i >= self.ctx.num_train_batch_last_epoch - 1:
                    break
            ...

The problems are

If we choose test or validate split for training routine, the num_train_batch_last_epoch and num_train_epoch are all wrong(since they are calculated for the training split).
If we set different parameters (say local update steps) for finetune and training, they should have different num_train_batch_last_epoch and num_train_epoch.

Expected behavior
The number of batches and epochs should follow the combination of mode and split.

The text was updated successfully, but these errors were encountered:

rayrayraykk · 2022-11-30T03:11:19Z

Fixed in #415 .

DavdGao added the bug Something isn't working label Jul 27, 2022

DavdGao changed the title ~~The actual number of batches within the finetune and training routines is wrong.~~ The combination of different mode and split leads to wrong calculation for number of batches and number of epochs Jul 27, 2022

rayrayraykk closed this as completed Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The combination of different mode and split leads to wrong calculation for number of batches and number of epochs #264

The combination of different mode and split leads to wrong calculation for number of batches and number of epochs #264

DavdGao commented Jul 27, 2022

rayrayraykk commented Nov 30, 2022

The combination of different mode and split leads to wrong calculation for number of batches and number of epochs #264

The combination of different mode and split leads to wrong calculation for number of batches and number of epochs #264

Comments

DavdGao commented Jul 27, 2022

rayrayraykk commented Nov 30, 2022