NeptuneObserver raises Neptune.api_exceptions.ChannelsValuesSendBatchError #5130

wjaskowski · 2020-12-14T15:58:40Z

🐛 Bug

NeptuneObserver throws

Failed to send channel value.
Traceback (most recent call last):
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.7/site-packages/neptune/internal/channels/channels_values_sender.py", line 156, in _send_values
    self._experiment._send_channels_values(channels_with_values)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.7/site-packages/neptune/experiments.py", line 1167, in _send_channels_values
    self._backend.send_channels_values(self, channels_with_values)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.7/site-packages/neptune/utils.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/wojciech/miniconda3/envs/ml/lib/python3.7/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 571, in send_channels_values
    raise ChannelsValuesSendBatchError(experiment.id, batch_errors)
neptune.api_exceptions.ChannelsValuesSendBatchError: Received batch errors sending channels' values to experiment SAN-28. Cause: Error(code=400, message='X-coordinates must be strictly increasing for channel: e968f192-b466-4419-89ce-469fdc4cf86f. Invalid point: InputChannelValue(timestamp=2020-12-14T15:53:20.877Z, x=0.0, numericValue=0.0, textValue=null, image', type=None) (metricId: 'e968f192-b466-4419-89ce-469fdc4cf86f', x: 0.0) Skipping 1 values.
/home/wojciech/miniconda3/envs/ml/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:49: UserWarning: The dataloader, test dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 6 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)

import os
import torch
from pytorch_lightning.loggers import NeptuneLogger
from torch.utils.data import Dataset
from pytorch_lightning import Trainer, LightningModule


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        """
        Testing PL Module
        Use as follows:
        - subclass
        - modify the behavior for what you want
        class TestModel(BaseTestModel):
            def training_step(...):
                # do your own thing
        or:
        model = BaseTestModel()
        model.training_epoch_end = None
        """
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def step(self, x):
        x = self.layer(x)
        out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
        return out

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('train_loss', loss, on_step=True, on_epoch=True)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def training_epoch_end(self, outputs) -> None:
        torch.stack([x["loss"] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('val_loss', loss, on_step=False, on_epoch=True)
        return {"x": loss}

    def validation_epoch_end(self, outputs) -> None:
        torch.stack([x['x'] for x in outputs]).mean()

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"y": loss}

    def test_epoch_end(self, outputs) -> None:
        torch.stack([x["y"] for x in outputs]).mean()

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]


#  NOTE: If you are using a cmd line to run your script,
#  provide the cmd line as below.
#  opt = "--max_epochs 1 --limit_train_batches 1".split(" ")
#  parser = ArgumentParser()
#  args = parser.parse_args(opt)

def run_test():

    class TestModel(BoringModel):

        def on_train_epoch_start(self) -> None:
            print('override any method to prove your bug')

    # fake data
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
    test_data = torch.utils.data.DataLoader(RandomDataset(32, 64))

    neptune_logger = NeptuneLogger(
        api_key = "ANONYMOUS",
        project_name = "shared/pytorch-lightning-integration",
    )

    # model
    model = TestModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        max_epochs=1,
        weights_summary=None,
        logger=neptune_logger
    )
    trainer.fit(model, train_data, val_data)
    trainer.test(test_dataloaders=test_data)


if __name__ == '__main__':
    run_test()

Expected behavior

No exception.

Environment

This happens both with pytorch-lighting 1.0.7 and 1.1 and neptune-client-0.4.126 and 129.

CUDA:
- GPU:
  - GeForce GTX 1080 Ti
- available: True
- version: 11.0
Packages:
- numpy: 1.18.5
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.1.0
- tqdm: 4.54.1
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.4
- version: moving examples out of the package #36-Ubuntu SMP Wed Dec 9 09:14:40 UTC 2020

Additional context

Happens only when we log during validation_step only with on_epoch=True, i.e.:

        self.log('val_loss', loss, on_step=False, on_epoch=True)

The text was updated successfully, but these errors were encountered:

github-actions · 2020-12-14T15:59:22Z

Hi! thanks for your contribution!, great first issue!

wjaskowski · 2020-12-18T11:58:54Z

Any ideas @jakubczakon?

jakubczakon · 2020-12-18T12:55:02Z

Thanks for bringing this up @wjaskowski.
@pitercl could you take a look?

pitercl · 2020-12-18T15:44:30Z

Hi @wjaskowski!

I bumped into a similar issue a while back and from what I remember, PL automatically adds an epoch metric with some global_step as x. From what I've seen, this global_step is reset on each fit/test call and this causes the x-s to be non-increasing.

To be honest, I'm not sure what's the best way to approach this. The only Neptune-specific behaviour here is that we don't accept non-monotonic x-s. In general, I'm not sure if resetting the step is a good idea.

Maybe someone from the PL team can weigh in? @Borda?

PS. I investigated this some time ago, so my findings may be a bit outdated.

stale · 2021-01-17T18:34:54Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

wjaskowski · 2021-01-17T18:48:03Z

Well, it is still there. The problem won't disappear by itself.

pitercl · 2021-01-18T08:22:08Z

We're approaching this here: #5510. Once it's merged, it will most likely fix the issue you're observing. Sorry for the wait!

wjaskowski · 2021-01-18T08:41:09Z

Thanks! I posted just to not let it being closed by the stałe bot.

…

On Mon, Jan 18, 2021, 09:22 Piotr Łusakowski ***@***.***> wrote: We're approaching this here: #5510 <#5510>. Once it's merged, it will most likely fix the issue you're observing. Sorry for the wait! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5130 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABFZEHMLYGRR2NS73QEHC3TS2PVT7ANCNFSM4U26PP3Q> .

PiotrJander · 2021-01-20T08:54:23Z

This comment is relevant: #5510 (comment)

PiotrJander · 2021-01-20T10:27:19Z

@awaelchli Regarding you comment #5510 (comment) ("Multiple calls to fit will not reset the global step."), the example at the start of this issue only contains calls to fit() and then to test() on the same trainer, but the problem of an non-increasing step occurs anyway.

In particular, I was able to verify (by modifying PTL code and adding a debug statement after https://github.com/PyTorchLightning/pytorch-lightning/blob/f477c2fd2980ad128bfe79a3b859e0b81b435507/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L190) that when metrics are logged during trainer.test(), step is set from trainer.global_step which is 0 at that point.

So it does seem that calling trainer.test() resets the global step.

Steps to reproduce:

Run the example at the start of this issue.
Observe the line linked above.
Verify that trainer.global_step is 0 after trainer.test()is called.

PiotrJander · 2021-01-20T10:40:55Z

Interestingly, the above is not the case for the Boring Model: https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing#scrollTo=FOma5cYzSWp7

It might be worth comparing the Boring Model and @wjaskowski 's model to see what the difference could be - I'll try to do that later today.

wjaskowski added bug Something isn't working help wanted Open to be worked on labels Dec 14, 2020

stale bot added the won't fix This will not be worked on label Jan 17, 2021

stale bot removed the won't fix This will not be worked on label Jan 17, 2021

PiotrJander mentioned this issue Jan 18, 2021

Ignore step param in Neptune logger's log_metric method #5510

Merged

12 tasks

awaelchli added the logger Related to the Loggers label Jan 20, 2021

awaelchli mentioned this issue Jan 23, 2021

Failing to log to Neptune.ai when resuming from checkpoint #5626

Closed

SkafteNicki closed this as completed in #5510 Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeptuneObserver raises Neptune.api_exceptions.ChannelsValuesSendBatchError #5130

NeptuneObserver raises Neptune.api_exceptions.ChannelsValuesSendBatchError #5130

wjaskowski commented Dec 14, 2020

github-actions bot commented Dec 14, 2020

wjaskowski commented Dec 18, 2020

jakubczakon commented Dec 18, 2020

pitercl commented Dec 18, 2020

stale bot commented Jan 17, 2021

wjaskowski commented Jan 17, 2021

pitercl commented Jan 18, 2021

wjaskowski commented Jan 18, 2021 via email

PiotrJander commented Jan 20, 2021

PiotrJander commented Jan 20, 2021

PiotrJander commented Jan 20, 2021

NeptuneObserver raises Neptune.api_exceptions.ChannelsValuesSendBatchError #5130

NeptuneObserver raises Neptune.api_exceptions.ChannelsValuesSendBatchError #5130

Comments

wjaskowski commented Dec 14, 2020

🐛 Bug

Expected behavior

Environment

Additional context

github-actions bot commented Dec 14, 2020

wjaskowski commented Dec 18, 2020

jakubczakon commented Dec 18, 2020

pitercl commented Dec 18, 2020

stale bot commented Jan 17, 2021

wjaskowski commented Jan 17, 2021

pitercl commented Jan 18, 2021

wjaskowski commented Jan 18, 2021 via email

PiotrJander commented Jan 20, 2021

PiotrJander commented Jan 20, 2021

PiotrJander commented Jan 20, 2021