Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Improper Property name in Pytorch_Lightning integration with tune #21002

Closed
2 tasks done
gg-aking opened this issue Dec 10, 2021 · 1 comment
Closed
2 tasks done
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@gg-aking
Copy link

gg-aking commented Dec 10, 2021

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

Crashes when running tune.run with pytorch_lightning. Apparently, the most recent version ofpytorch_lightning.Trainer has a property called sanity_checking, but tune.integration.pytorch_lightning (line 177) is trying to access running_sanity_check.

Log:

(ImplicitFunc pid=1758) /databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:116: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
(ImplicitFunc pid=1758)   rank_zero_warn(
(ImplicitFunc pid=1752) /databricks/python/lib/python3.8/site-packages/torch/nn/modules/conv.py:294: UserWarning: Using padding='same' with even kernel lengths and odd dilation may require a zero-padded copy of the input be created (Triggered internally at  /pytorch/aten/src/ATen/native/Convolution.cpp:660.)
(ImplicitFunc pid=1752)   return F.conv1d(input, weight, bias, self.stride,
(ImplicitFunc pid=1758) /databricks/python/lib/python3.8/site-packages/torch/nn/modules/conv.py:294: UserWarning: Using padding='same' with even kernel lengths and odd dilation may require a zero-padded copy of the input be created (Triggered internally at  /pytorch/aten/src/ATen/native/Convolution.cpp:660.)
(ImplicitFunc pid=1758)   return F.conv1d(input, weight, bias, self.stride,
<IPython.core.display.HTML object>
(ImplicitFunc pid=1752) 2021-12-10 00:24:40,007	ERROR function_runner.py:268 -- Runner Thread raised error.
(ImplicitFunc pid=1752) Traceback (most recent call last):
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/function_runner.py", line 262, in run
(ImplicitFunc pid=1752)     self._entrypoint()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/function_runner.py", line 330, in entrypoint
(ImplicitFunc pid=1752)     return self._trainable_func(self.config, self._status_reporter,
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
(ImplicitFunc pid=1752)     return method(self, *_args, **_kwargs)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/function_runner.py", line 597, in _trainable_func
(ImplicitFunc pid=1752)     output = fn()
(ImplicitFunc pid=1752)   File "<command-2948585277627227>", line 84, in train_cnn
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
(ImplicitFunc pid=1752)     self._call_and_handle_interrupt(
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
(ImplicitFunc pid=1752)     return trainer_fn(*args, **kwargs)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
(ImplicitFunc pid=1752)     self._run(model, ckpt_path=ckpt_path)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
(ImplicitFunc pid=1752)     self._dispatch()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
(ImplicitFunc pid=1752)     self.training_type_plugin.start_training(self)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
(ImplicitFunc pid=1752)     self._results = trainer.run_stage()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stage
(ImplicitFunc pid=1752)     return self._run_train()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1307, in _run_train
(ImplicitFunc pid=1752)     self._run_sanity_check(self.lightning_module)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1371, in _run_sanity_check
(ImplicitFunc pid=1752)     self._evaluation_loop.run()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 151, in run
(ImplicitFunc pid=1752)     output = self.on_run_end()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 140, in on_run_end
(ImplicitFunc pid=1752)     self._on_evaluation_end()
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 202, in _on_evaluation_end
(ImplicitFunc pid=1752)     self.trainer.call_hook("on_validation_end", *args, **kwargs)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1491, in call_hook
(ImplicitFunc pid=1752)     callback_fx(*args, **kwargs)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 221, in on_validation_end
(ImplicitFunc pid=1752)     callback.on_validation_end(self, self.lightning_module)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/integration/pytorch_lightning.py", line 118, in on_validation_end
(ImplicitFunc pid=1752)     self._handle(trainer, pl_module)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/integration/pytorch_lightning.py", line 200, in _handle
(ImplicitFunc pid=1752)     report_dict = self._get_report_dict(trainer, pl_module)
(ImplicitFunc pid=1752)   File "/databricks/python/lib/python3.8/site-packages/ray/tune/integration/pytorch_lightning.py", line 177, in _get_report_dict
(ImplicitFunc pid=1752)     if trainer.running_sanity_check:
(ImplicitFunc pid=1752) AttributeError: 'Trainer' object has no attribute 'running_sanity_check'

Versions / Dependencies

pytorch-lightning==1.5.5
ray==1.9.0
torch==1.9.0+cpu
torchmetrics==0.6.1
torchvision==0.10.0

Reproduction script

from ray import tune
import torch
import pytorch_lightning as pl
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import MNIST
from torchvision import datasets, transforms
import os, sys

sys.stdout.fileno = lambda: False

class MNIST_Classifier(pl.LightningModule):

    def __init__(self, hidden_size = 128):
        super().__init__()
        self.hidden_layer = torch.nn.Linear(28 * 28, hidden_size)
        self.out_layer = torch.nn.Linear(hidden_size, 10)
        
    def forward(self, x):
        batch_size, channels, width, height = x.size()
        x = x.flatten(start_dim = 1)
        x = torch.relu(self.hidden_layer(x))
        return torch.log_softmax(self.out_layer(x), dim=1) 
          
    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        y_hat = self.forward(x)
        return torch.nn.functional.nll_loss(y_hat, y)

    def validation_step(self, val_batch, batch_idx):
        x, y = val_batch
        y_hat = self.forward(x)
        return torch.nn.functional.nll_loss(y_hat, y)
        
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-4)
        return optimizer

transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
mnist_train, mnist_test = MNIST(os.getcwd(), train=True, download=True, transform=transform), MNIST(os.getcwd(), train=False, download=True, transform=transform)
train_dataloader, val_dataloader = DataLoader(mnist_train, batch_size = 64), DataLoader(mnist_test, batch_size = 64)

def set_model_configs(config):
    model = MNIST_Classifier(hidden_size = config['hidden_size'])
    trainer = pl.Trainer(max_epochs = 10)
    trainer.fit(model, train_dataloader, val_dataloader)

analysis = tune.run(
             set_model_configs,
             config= {'hidden_size' : tune.choice([64, 128, 256, 512])},
             num_samples=2)

Anything else

Occurs 9/10 times.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@gg-aking gg-aking added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 10, 2021
@amogkam
Copy link
Contributor

amogkam commented Dec 10, 2021

@gg-aking the integration only works with PTL 1.4 or lower for now. We are currently working on support for PTL 1.5!

Closing this as duplicate of #20741 and #21000

@amogkam amogkam closed this as completed Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants