Allow creation of pseudo devices for testing purposes #61654

avitase · 2021-07-14T15:17:37Z

🚀 Feature

Add a pseudo device cpu_testing for testing purposes.

Motivation

For machines with only a CPU there is currently no straight-forward way to test if torch.device is set properly for all tensors since cpu is the default. Having a dummy device, e.g., named cpu_testing which internally falls back to the default device but throws if two tensors with different devices are combined would allow the creation of unit tests.

Pitch

It happens that some machines purely used for CI don't have any fancy devices other than CPU. Since CPU is the default device errors frequently occur in production where the model missed to set tensors to, e.g., GPU and this case cannot be covered in a unit test. Creating a pseudo device which also works for machines with only CPUs could help to change this defect.

cc @ezyang @bhosmer @smessmer @ljk53 @bdhirsh

The text was updated successfully, but these errors were encountered:

bdhirsh · 2021-07-14T15:26:58Z

It sounds like the meta device might fit your use case? With the snag being that it doesn't work for all operators yet (although many of the most common ops have been ported already).

>>> a = torch.ones(2) # defaults to cpu
>>> b = torch.ones(2, device='meta')
>>> c = a + b
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, meta and cpu!

avitase · 2021-07-14T15:34:26Z

That's neat. Wasn't aware of this. Thanks for your help!

@bdhirsh could you explain what the purpose of the meta device was originally?

bdhirsh · 2021-07-14T16:03:53Z

the meta API lets you run dtype and shape inference. For example:

>>> a = torch.ones(2, device='meta', dtype=torch.float64)
>>> b = torch.ones((1, 2), device='meta', dtype=torch.float32)
>>> a + b
tensor(..., device='meta', size=(1, 2), dtype=torch.float64)

The above involved type promotion (to promote the output type to float64), and broadcasting (to broadcast the output shape to (1,2), but didn't allocate any storage and didn't actually "add" anything.

Tensors on the meta device don't have any storage, and calling operators on meta tensors doesn't involve any compute - it only involves input error checking + running the "meta" computation.

The original RFC is here if you're interested: https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#how-to-get-involved-with-structured-kernels. "Structured Kernels" are a new way of authoring kernels internally, that require you separate your "meta" computation from your core kernel logic. Which is what allows us to create a meta API.

avitase · 2021-07-14T17:18:32Z

Thanks @bdhirsh, this looks indeed promising but being able to test the device and the value at the same time could also be handy. I therefore leave this feature request open but I will definitely give meta a try!

ezyang · 2021-07-14T21:06:53Z

I think there are some slight semantics differences that would make meta more useful for your use case here, so definitely OK to keep this open.

ikamensh · 2023-05-19T13:06:46Z

For people waiting for this feature and using MacOS, you can use "mps" device (metal performance shaders). It turns out to be available even on macs without M1+ hardware, e.g. I have intel macbook with AMD gpu.

njzjz · 2024-02-08T21:22:06Z

For people waiting for this feature and using MacOS, you can use "mps" device (metal performance shaders). It turns out to be available even on macs without M1+ hardware, e.g. I have intel macbook with AMD gpu.

When I tried it, I found mps doesn't support float64, though.

chanind · 2024-02-15T15:48:52Z

This would be very helpful, since tests running in CI (e.g. Github Actions) usually don't have a GPU and this allows bugs related to mismatched tensors to slip through into production. The meta device doesn't work for anything involving actual calculation, which any integration test will necessarily run, and mps is only available on macs which are typically not what you're using for CI.

ezyang · 2024-02-19T00:21:14Z

It might be relatively simple to do this with privateuse1. The general idea would be to look at the sample code for how to add a new custom device type that tests privateuse1 right now, but then actually just implement a single fallback kernel for everything that just redispatches to cpu.

ezyang · 2024-02-19T00:21:20Z

cc @albanD

albanD · 2024-02-21T18:01:11Z

Ho yes we can definitely do that.
You can actually see the current testing we have for PrivateUse1 is doing part of this:

pytorch/test/test_cpp_extensions_open_device_registration.py

Line 30 in 8f3fd79

class DummyModule:

I would agree with Ed that we can make this a fully fledged extension that has fallback to CPU implementation if we need a version you can run actual compute tests against.

chanind · 2024-10-18T11:54:16Z

Is it possible to share an example of how to use PrivateUse1 to create a fake cpu device which can be used to validate that devices are set correctly in tests? I keep deploying broken code that passes tests in CI but fails on real devices because I can't find a way to test this in CI.

I tried the following from the pytorch test shared by @albanD, but it gives errors:

import torch
from typing import Union

class DummyModule:

    @staticmethod
    def device_count() -> int:
        return 1

    @staticmethod
    def get_rng_state(device: Union[int, str, torch.device] = "foo") -> torch.Tensor:
        # create a tensor using our custom device object.
        return torch.empty(4, 4, device="foo")

    @staticmethod
    def set_rng_state(
        new_state: torch.Tensor, device: Union[int, str, torch.device] = "foo"
    ) -> None:
        pass

    @staticmethod
    def is_available():
        return True

    @staticmethod
    def current_device():
        return 0


torch.utils.rename_privateuse1_backend("foo")
torch._register_device_module("foo", DummyModule)

x = torch.empty(4, 4, device="foo")

This returns the error:

NotImplementedError: Could not run 'aten::empty.memory_format' with arguments from the 'foo' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build).

I've tried registering the actual torch.cpu device as privateuse1 as well, but this still gives the same error. Is there an example somewhere that demonstrates how to do this for tests?

zou3519 added module: internals Related to internal abstractions in c10 and ATen triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 14, 2021

avitase closed this as completed Jul 14, 2021

avitase reopened this Jul 14, 2021

dfalbel mentioned this issue Feb 14, 2022

Request: provide to(device = "cuda_pseudo") to test that code runs on GPU mlverse/torch#785

Closed

ernestum mentioned this issue Jan 19, 2023

Fix GAIL with SAC learner on GPU HumanCompatibleAI/imitation#660

Merged

bittremieux mentioned this issue May 8, 2023

Fix CPU bug, overhaul model runner, and update to lightning >=2.0 Noble-Lab/casanovo#176

Merged

ezyang added the good first issue label Feb 19, 2024

McHaillet mentioned this issue Mar 14, 2024

unittesting device agnostic code teamtomo/libtilt#72

Closed

shijianjian mentioned this issue Nov 8, 2024

AugmentationSequential explicitly moves the output to the CPU if data_keys is given kornia/kornia#3066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow creation of pseudo devices for testing purposes #61654

Allow creation of pseudo devices for testing purposes #61654

avitase commented Jul 14, 2021 •

edited

Loading

bdhirsh commented Jul 14, 2021

avitase commented Jul 14, 2021 •

edited

Loading

bdhirsh commented Jul 14, 2021 •

edited

Loading

avitase commented Jul 14, 2021 •

edited

Loading

ezyang commented Jul 14, 2021

ikamensh commented May 19, 2023

njzjz commented Feb 8, 2024

chanind commented Feb 15, 2024

ezyang commented Feb 19, 2024

ezyang commented Feb 19, 2024

albanD commented Feb 21, 2024

chanind commented Oct 18, 2024

Allow creation of pseudo devices for testing purposes #61654

Allow creation of pseudo devices for testing purposes #61654

Comments

avitase commented Jul 14, 2021 • edited Loading

🚀 Feature

Motivation

Pitch

bdhirsh commented Jul 14, 2021

avitase commented Jul 14, 2021 • edited Loading

bdhirsh commented Jul 14, 2021 • edited Loading

avitase commented Jul 14, 2021 • edited Loading

ezyang commented Jul 14, 2021

ikamensh commented May 19, 2023

njzjz commented Feb 8, 2024

chanind commented Feb 15, 2024

ezyang commented Feb 19, 2024

ezyang commented Feb 19, 2024

albanD commented Feb 21, 2024

chanind commented Oct 18, 2024

avitase commented Jul 14, 2021 •

edited

Loading

avitase commented Jul 14, 2021 •

edited

Loading

bdhirsh commented Jul 14, 2021 •

edited

Loading

avitase commented Jul 14, 2021 •

edited

Loading