[Bug] Improper Property name in Pytorch_Lightning integration with tune #21002

gg-aking · 2021-12-10T00:25:11Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

Crashes when running tune.run with pytorch_lightning. Apparently, the most recent version ofpytorch_lightning.Trainer has a property called sanity_checking, but tune.integration.pytorch_lightning (line 177) is trying to access running_sanity_check.

Log:

(ImplicitFunc pid=1758) /databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:116: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
(ImplicitFunc pid=1758)   rank_zero_warn(
(ImplicitFunc pid=1752) /databricks/python/lib/python3.8/site-packages/torch/nn/modules/conv.py:294: UserWarning: Using padding='same' with even kernel lengths and odd dilation may require a zero-padded copy of the input be created (Triggered internally at  /pytorch/aten/src/ATen/native/Convolution.cpp:660.)
(ImplicitFunc pid=1752)   return F.conv1d(input, weight, bias, self.stride,
(ImplicitFunc pid=1758) /databricks/python/lib/python3.8/site-packages/torch/nn/modules/conv.py:294: UserWarning: Using padding='same' with even kernel lengths and odd dilation may require a zero-padded copy of the input be created (Triggered internally at  /pytorch/aten/src/ATen/native/Convolution.cpp:660.)
(ImplicitFunc pid=1758)   return F.conv1d(input, weight, bias, self.stride,
<IPython.core.display.HTML object>
(ImplicitFunc pid=1752) 2021-12-10 00:24:40,007	ERROR function_runner.py:268 -- Runner Thread raised error.
(ImplicitFunc pid=1752) Traceback (most recent call last):
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/function_runner.py", line 262, in run
(ImplicitFunc pid=1752)     self._entrypoint()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/function_runner.py", line 330, in entrypoint
(ImplicitFunc pid=1752)     return self._trainable_func(self.config, self._status_reporter,
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
(ImplicitFunc pid=1752)     return method(self, *_args, **_kwargs)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/function_runner.py", line 597, in _trainable_func
(ImplicitFunc pid=1752)     output = fn()
(ImplicitFunc pid=1752)   File "<command-2948585277627227>", line 84, in train_cnn
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
(ImplicitFunc pid=1752)     self._call_and_handle_interrupt(
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
(ImplicitFunc pid=1752)     return trainer_fn(*args, **kwargs)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
(ImplicitFunc pid=1752)     self._run(model, ckpt_path=ckpt_path)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
(ImplicitFunc pid=1752)     self._dispatch()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
(ImplicitFunc pid=1752)     self.training_type_plugin.start_training(self)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
(ImplicitFunc pid=1752)     self._results = trainer.run_stage()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stage
(ImplicitFunc pid=1752)     return self._run_train()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1307, in _run_train
(ImplicitFunc pid=1752)     self._run_sanity_check(self.lightning_module)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1371, in _run_sanity_check
(ImplicitFunc pid=1752)     self._evaluation_loop.run()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 151, in run
(ImplicitFunc pid=1752)     output = self.on_run_end()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 140, in on_run_end
(ImplicitFunc pid=1752)     self._on_evaluation_end()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 202, in _on_evaluation_end
(ImplicitFunc pid=1752)     self.trainer.call_hook("on_validation_end", *args, **kwargs)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1491, in call_hook
(ImplicitFunc pid=1752)     callback_fx(*args, **kwargs)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 221, in on_validation_end
(ImplicitFunc pid=1752)     callback.on_validation_end(self, self.lightning_module)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/integration/pytorch_lightning.py", line 118, in on_validation_end
(ImplicitFunc pid=1752)     self._handle(trainer, pl_module)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/integration/pytorch_lightning.py", line 200, in _handle
(ImplicitFunc pid=1752)     report_dict = self._get_report_dict(trainer, pl_module)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/integration/pytorch_lightning.py", line 177, in _get_report_dict
(ImplicitFunc pid=1752)     if trainer.running_sanity_check:
(ImplicitFunc pid=1752) AttributeError: 'Trainer' object has no attribute 'running_sanity_check'

Versions / Dependencies

pytorch-lightning==1.5.5
ray==1.9.0
torch==1.9.0+cpu
torchmetrics==0.6.1
torchvision==0.10.0

Reproduction script

from ray import tune
import torch
import pytorch_lightning as pl
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import MNIST
from torchvision import datasets, transforms
import os, sys

sys.stdout.fileno = lambda: False

class MNIST_Classifier(pl.LightningModule):

    def __init__(self, hidden_size = 128):
        super().__init__()
        self.hidden_layer = torch.nn.Linear(28 * 28, hidden_size)
        self.out_layer = torch.nn.Linear(hidden_size, 10)
        
    def forward(self, x):
        batch_size, channels, width, height = x.size()
        x = x.flatten(start_dim = 1)
        x = torch.relu(self.hidden_layer(x))
        return torch.log_softmax(self.out_layer(x), dim=1) 
          
    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        y_hat = self.forward(x)
        return torch.nn.functional.nll_loss(y_hat, y)

    def validation_step(self, val_batch, batch_idx):
        x, y = val_batch
        y_hat = self.forward(x)
        return torch.nn.functional.nll_loss(y_hat, y)
        
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-4)
        return optimizer

transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
mnist_train, mnist_test = MNIST(os.getcwd(), train=True, download=True, transform=transform), MNIST(os.getcwd(), train=False, download=True, transform=transform)
train_dataloader, val_dataloader = DataLoader(mnist_train, batch_size = 64), DataLoader(mnist_test, batch_size = 64)

def set_model_configs(config):
    model = MNIST_Classifier(hidden_size = config['hidden_size'])
    trainer = pl.Trainer(max_epochs = 10)
    trainer.fit(model, train_dataloader, val_dataloader)

analysis = tune.run(
             set_model_configs,
             config= {'hidden_size' : tune.choice([64, 128, 256, 512])},
             num_samples=2)

Anything else

Occurs 9/10 times.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

amogkam · 2021-12-10T00:26:17Z

@gg-aking the integration only works with PTL 1.4 or lower for now. We are currently working on support for PTL 1.5!

Closing this as duplicate of #20741 and #21000

gg-aking added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 10, 2021

amogkam closed this as completed Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Improper Property name in Pytorch_Lightning integration with tune #21002

[Bug] Improper Property name in Pytorch_Lightning integration with tune #21002

gg-aking commented Dec 10, 2021 •

edited

Loading

amogkam commented Dec 10, 2021 •

edited

Loading

[Bug] Improper Property name in Pytorch_Lightning integration with tune #21002

[Bug] Improper Property name in Pytorch_Lightning integration with tune #21002

Comments

gg-aking commented Dec 10, 2021 • edited Loading

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

amogkam commented Dec 10, 2021 • edited Loading

gg-aking commented Dec 10, 2021 •

edited

Loading

amogkam commented Dec 10, 2021 •

edited

Loading