Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug to save log dict #8887

Closed
qqueing opened this issue Aug 13, 2021 · 4 comments
Closed

bug to save log dict #8887

qqueing opened this issue Aug 13, 2021 · 4 comments
Assignees
Labels
logging Related to the `LoggerConnector` and `log()` waiting on author Waiting on user action, correction, or update working as intended Working as intended

Comments

@qqueing
Copy link
Contributor

qqueing commented Aug 13, 2021

🐛 Bug

To Reproduce

    def validation_epoch_end(self, metrics):
        total_metrics = {}
        total_metrics["val_imp_sum"] = 198487
        self.log_dict(total_metrics)

but return result is 'val_imp_sum': tensor(198487.0156, device='cuda:0'),

inner result class maybe has cum batch size, but not matched information.

results
Out[9]: {False, device(type='cuda', index=0), {'validation_epoch_end.val_imp_sum': ResultMetric('val_imp_sum', value=372957088.0, cumulated_batch_size=1879.0)}}

Expected behavior

but return result is 'val_imp_sum': tensor(198487.000, device='cuda:0'),

Environment

  • PyTorch Lightning Version (e.g., 1.3.0): 1.4.2
  • PyTorch Version (e.g., 1.8) 1.9
  • Python version:
  • OS (e.g., Linux): linux
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

1.3.8 version works fine. but 1.4.2 versions don't work.

@qqueing qqueing added bug Something isn't working help wanted Open to be worked on labels Aug 13, 2021
@Borda Borda added the logging Related to the `LoggerConnector` and `log()` label Aug 13, 2021
@carmocca carmocca added working as intended Working as intended and removed bug Something isn't working help wanted Open to be worked on labels Aug 14, 2021
@carmocca
Copy link
Contributor

Hi @qqueing. Can you elaborate exactly on what issue are you seeing?

The results class you are observing is entirely internal and you shouldn't need to do anything with it. The internal values look right, as you are using the mean reduction and 372957088 / 1879 = 198487 .

If you were to access trainer.callback_metrics right after, you would see:

{'val_imp_sum': tensor(198487.)}

Which is what you expect

@carmocca carmocca added the waiting on author Waiting on user action, correction, or update label Aug 14, 2021
@qqueing
Copy link
Contributor Author

qqueing commented Aug 15, 2021

I registered val_imp_sum as 198487. but return value is 198487.0156.
if return value is 198487, cum sum value is 372957073 not 372957088.
I don't know this mechnism...

@carmocca
Copy link
Contributor

Seems like it's a floating point precision issue. Would you mind trying to reproduce it?

This does print 198487.0 on my machine.

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        self.log("a", 198487)

    def validation_step(self, *args, **kwargs):
        self.log("b", 198487)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=3,
        limit_val_batches=3,
        num_sanity_val_steps=0,
        max_epochs=1,
        weights_summary=None,
        progress_bar_refresh_rate=0,
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)

    print({k: v.item() for k, v in trainer.callback_metrics.items()})


if __name__ == "__main__":
    run()

@tchaton
Copy link
Contributor

tchaton commented Aug 27, 2021

Dear @qqueing,

Closing this issue as it doesn't seem related to Lightning and carmocca showed a working example.

Best,
T.C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logging Related to the `LoggerConnector` and `log()` waiting on author Waiting on user action, correction, or update working as intended Working as intended
Projects
None yet
Development

No branches or pull requests

4 participants