Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeptuneObserver raises Neptune.api_exceptions.ChannelsValuesSendBatchError #5130

Closed
wjaskowski opened this issue Dec 14, 2020 · 11 comments · Fixed by #5510
Closed

NeptuneObserver raises Neptune.api_exceptions.ChannelsValuesSendBatchError #5130

wjaskowski opened this issue Dec 14, 2020 · 11 comments · Fixed by #5510
Labels
bug Something isn't working help wanted Open to be worked on logger Related to the Loggers

Comments

@wjaskowski
Copy link

🐛 Bug

NeptuneObserver throws

Failed to send channel value.
Traceback (most recent call last):
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.7/site-packages/neptune/internal/channels/channels_values_sender.py", line 156, in _send_values
    self._experiment._send_channels_values(channels_with_values)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.7/site-packages/neptune/experiments.py", line 1167, in _send_channels_values
    self._backend.send_channels_values(self, channels_with_values)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.7/site-packages/neptune/utils.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.7/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 571, in send_channels_values
    raise ChannelsValuesSendBatchError(experiment.id, batch_errors)
neptune.api_exceptions.ChannelsValuesSendBatchError: Received batch errors sending channels' values to experiment SAN-28. Cause: Error(code=400, message='X-coordinates must be strictly increasing for channel: e968f192-b466-4419-89ce-469fdc4cf86f. Invalid point: InputChannelValue(timestamp=2020-12-14T15:53:20.877Z, x=0.0, numericValue=0.0, textValue=null, image', type=None) (metricId: 'e968f192-b466-4419-89ce-469fdc4cf86f', x: 0.0) Skipping 1 values.
/home/wojciech/miniconda3/envs/ml/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:49: UserWarning: The dataloader, test dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 6 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
import os
import torch
from pytorch_lightning.loggers import NeptuneLogger
from torch.utils.data import Dataset
from pytorch_lightning import Trainer, LightningModule


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        """
        Testing PL Module
        Use as follows:
        - subclass
        - modify the behavior for what you want
        class TestModel(BaseTestModel):
            def training_step(...):
                # do your own thing
        or:
        model = BaseTestModel()
        model.training_epoch_end = None
        """
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def step(self, x):
        x = self.layer(x)
        out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
        return out

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('train_loss', loss, on_step=True, on_epoch=True)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def training_epoch_end(self, outputs) -> None:
        torch.stack([x["loss"] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('val_loss', loss, on_step=False, on_epoch=True)
        return {"x": loss}

    def validation_epoch_end(self, outputs) -> None:
        torch.stack([x['x'] for x in outputs]).mean()

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"y": loss}

    def test_epoch_end(self, outputs) -> None:
        torch.stack([x["y"] for x in outputs]).mean()

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]


#  NOTE: If you are using a cmd line to run your script,
#  provide the cmd line as below.
#  opt = "--max_epochs 1 --limit_train_batches 1".split(" ")
#  parser = ArgumentParser()
#  args = parser.parse_args(opt)

def run_test():

    class TestModel(BoringModel):

        def on_train_epoch_start(self) -> None:
            print('override any method to prove your bug')

    # fake data
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
    test_data = torch.utils.data.DataLoader(RandomDataset(32, 64))

    neptune_logger = NeptuneLogger(
        api_key = "ANONYMOUS",
        project_name = "shared/pytorch-lightning-integration",
    )

    # model
    model = TestModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        max_epochs=1,
        weights_summary=None,
        logger=neptune_logger
    )
    trainer.fit(model, train_data, val_data)
    trainer.test(test_dataloaders=test_data)


if __name__ == '__main__':
    run_test()

Expected behavior

No exception.

Environment

This happens both with pytorch-lighting 1.0.7 and 1.1 and neptune-client-0.4.126 and 129.

  • CUDA:
    • GPU:
      • GeForce GTX 1080 Ti
    • available: True
    • version: 11.0
  • Packages:
    • numpy: 1.18.5
    • pyTorch_debug: False
    • pyTorch_version: 1.7.1
    • pytorch-lightning: 1.1.0
    • tqdm: 4.54.1
  • System:

Additional context

Happens only when we log during validation_step only with on_epoch=True, i.e.:

        self.log('val_loss', loss, on_step=False, on_epoch=True)
@wjaskowski wjaskowski added bug Something isn't working help wanted Open to be worked on labels Dec 14, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@wjaskowski
Copy link
Author

Any ideas @jakubczakon?

@jakubczakon
Copy link
Contributor

Thanks for bringing this up @wjaskowski.
@pitercl could you take a look?

@pitercl
Copy link
Contributor

pitercl commented Dec 18, 2020

Hi @wjaskowski!

I bumped into a similar issue a while back and from what I remember, PL automatically adds an epoch metric with some global_step as x. From what I've seen, this global_step is reset on each fit/test call and this causes the x-s to be non-increasing.

To be honest, I'm not sure what's the best way to approach this. The only Neptune-specific behaviour here is that we don't accept non-monotonic x-s. In general, I'm not sure if resetting the step is a good idea.

Maybe someone from the PL team can weigh in? @Borda?

PS. I investigated this some time ago, so my findings may be a bit outdated.

@stale
Copy link

stale bot commented Jan 17, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jan 17, 2021
@wjaskowski
Copy link
Author

Well, it is still there. The problem won't disappear by itself.

@stale stale bot removed the won't fix This will not be worked on label Jan 17, 2021
@pitercl
Copy link
Contributor

pitercl commented Jan 18, 2021

We're approaching this here: #5510. Once it's merged, it will most likely fix the issue you're observing. Sorry for the wait!

@wjaskowski
Copy link
Author

wjaskowski commented Jan 18, 2021 via email

@PiotrJander
Copy link
Contributor

This comment is relevant: #5510 (comment)

@PiotrJander
Copy link
Contributor

@awaelchli Regarding you comment #5510 (comment) ("Multiple calls to fit will not reset the global step."), the example at the start of this issue only contains calls to fit() and then to test() on the same trainer, but the problem of an non-increasing step occurs anyway.

In particular, I was able to verify (by modifying PTL code and adding a debug statement after https://github.com/PyTorchLightning/pytorch-lightning/blob/f477c2fd2980ad128bfe79a3b859e0b81b435507/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L190) that when metrics are logged during trainer.test(), step is set from trainer.global_step which is 0 at that point.

So it does seem that calling trainer.test() resets the global step.

Steps to reproduce:

  1. Run the example at the start of this issue.
  2. Observe the line linked above.
  3. Verify that trainer.global_step is 0 after trainer.test()is called.

@PiotrJander
Copy link
Contributor

Interestingly, the above is not the case for the Boring Model: https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing#scrollTo=FOma5cYzSWp7

It might be worth comparing the Boring Model and @wjaskowski 's model to see what the difference could be - I'll try to do that later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on logger Related to the Loggers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants