No TPU devices were found in a TPU pod env. #6692

jiasenwu · 2021-03-26T21:02:46Z

🐛 Bug

To Reproduce

Run in a GCP instance group of size 4 + a TPU v2-32.
Add tpu_cores=8 to the boringmodel

(py36) jiasen@instance-group-1-cntd:~$ diff bug_report_model.py the_boringmodel.py 
148a149,150
>         precision=16,
>         tpu_cores=8,

Command to run:

python -m torch_xla.distributed.xla_dist --tpu=pod --docker-image=gcr.io/tpu-pytorch/xla:r1.8 \
    --docker-run-flag=--rm=true \
    --docker-run-flag=--shm-size=16GB \
    --docker-run-flag=-v \
    --docker-run-flag=/home/jiasen:/app \
    --docker-run-flag=-w \
    --docker-run-flag=/app \
    --env=XLA_USE_BF16=1 \
    -- bash -c "pip install pytorch_lightning && python the_boringmodel.py"

The exception occurs immediately after the pytorch_lightning is installed. The exception repeats itself because it happens on each instance. Here I copy only one.

2021-03-26 20:38:10 10.164.0.42 [2] Traceback (most recent call last):
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 31, in inner_f
2021-03-26 20:38:10 10.164.0.42 [2]     queue.put(func(*args, **kwargs))
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 83, in _is_device_tpu
2021-03-26 20:38:10 10.164.0.42 [2]     device = xm.xla_device()
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 231, in xla_device
2021-03-26 20:38:10 10.164.0.42 [2]     devkind=devkind if devkind is not None else None)
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 136, in get_xla_supported_devices
2021-03-26 20:38:10 10.164.0.42 [2]     xla_devices = _DEVICES.value
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/utils/utils.py", line 32, in value
2021-03-26 20:38:10 10.164.0.42 [2]     self._value = self._gen_fn()
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 18, in <lambda>
2021-03-26 20:38:10 10.164.0.42 [2]     _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
2021-03-26 20:38:10 10.164.0.42 [2] RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:258 : Check failed: default_device_target != options_.global_device_map.end() 
2021-03-26 20:38:10 10.164.0.42 [2] *** Begin stack trace ***
2021-03-26 20:38:10 10.164.0.42 [2] 	tensorflow::CurrentStackTrace()
2021-03-26 20:38:10 10.164.0.42 [2] 	xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >, xla::XrtLocalService*)
2021-03-26 20:38:10 10.164.0.42 [2] 	xla::ComputationClient::Create()
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	xla::ComputationClient::Get()
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyCFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_GenericGetAttrWithDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_Call_Prepend
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_CallMethodIdObjArgs
2021-03-26 20:38:10 10.164.0.42 [2] 	PyImport_ImportModuleLevelObject
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_CallMethodIdObjArgs
2021-03-26 20:38:10 10.164.0.42 [2] 	PyImport_ImportModuleLevelObject
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] *** End stack trace ***
2021-03-26 20:38:10 10.164.0.42 [2] 
0it [00:00, ?it/s]0 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2] Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /root/anaconda3/envs/pytorch/lib/python3.6/site-packages/Datasets/MNIST/raw/train-images-idx3-ubyte.gz
2021-03-26 20:38:10 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2]                     ####
2021-03-26 20:38:10 10.164.0.42 [2]                 ###########
2021-03-26 20:38:10 10.164.0.42 [2]              ####################
2021-03-26 20:38:10 10.164.0.42 [2]          ############################
2021-03-26 20:38:10 10.164.0.42 [2]     #####################################
2021-03-26 20:38:10 10.164.0.42 [2] ##############################################
2021-03-26 20:38:10 10.164.0.42 [2] #########################  ###################
2021-03-26 20:38:10 10.164.0.42 [2] #######################    ###################
2021-03-26 20:38:10 10.164.0.42 [2] ####################      ####################
2021-03-26 20:38:10 10.164.0.42 [2] ##################       #####################
2021-03-26 20:38:10 10.164.0.42 [2] ################        ######################
2021-03-26 20:38:10 10.164.0.42 [2] #####################        #################
2021-03-26 20:38:10 10.164.0.42 [2] Traceback (most recent call last):
2021-03-26 20:38:10 10.164.0.42 [2] ######################     ###################
2021-03-26 20:38:10 10.164.0.42 [2]   File "the_boringmodel.py", line 153, in <module>
2021-03-26 20:38:10 10.164.0.42 [2] #####################    #####################
2021-03-26 20:38:10 10.164.0.42 [2]     test_run()
2021-03-26 20:38:10 10.164.0.42 [2] ####################   #######################
2021-03-26 20:38:10 10.164.0.42 [2]   File "the_boringmodel.py", line 145, in test_run
2021-03-26 20:38:10 10.164.0.42 [2] ###################  #########################
2021-03-26 20:38:10 10.164.0.42 [2]     tpu_cores=8,
2021-03-26 20:38:10 10.164.0.42 [2] ##############################################
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
2021-03-26 20:38:10 10.164.0.42 [2]     #####################################
2021-03-26 20:38:10 10.164.0.42 [2]     return fn(self, **kwargs)
2021-03-26 20:38:10 10.164.0.42 [2]          ############################
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 321, in __init__
2021-03-26 20:38:10 10.164.0.42 [2]              ####################
2021-03-26 20:38:10 10.164.0.42 [2]     replace_sampler_ddp, deterministic, precision, amp_backend, amp_level, plugins
2021-03-26 20:38:10 10.164.0.42 [2]                   ##########
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 91, in __init__
2021-03-26 20:38:10 10.164.0.42 [2]                      ####
2021-03-26 20:38:10 10.164.0.42 [2]     self.tpu_cores = device_parser.parse_tpu_cores(tpu_cores)
2021-03-26 20:38:10 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 113, in parse_tpu_cores
2021-03-26 20:38:10 10.164.0.42 [2]     raise MisconfigurationException('No TPU devices were found.')
2021-03-26 20:38:10 10.164.0.42 [2] pytorch_lightning.utilities.exceptions.MisconfigurationException: No TPU devices were found.

Expected behavior

TPU is definitely avaialbe.

Environment

IDE: Please, use our python bug_report_model.py template.

PyTorch Version: 1.2.5
OS: Ubuntu 20.04.2 LTS
Python version: 3.6
docker image: gcr.io/tpu-pytorch/xla:r1.8
xla: r1.8
How you installed PyTorch: provided in gcr.io/tpu-pytorch/xla:r1.8

Additional context

I have tried a simple workaround by setting _TPU_AVAILABLE = True in https://github.com/PyTorchLightning/pytorch-lightning/blob/0e45220263f4e2045dfe7f68e3e0eaac0b2033d5/pytorch_lightning/utilities/__init__.py#L52. And it works. No more exceptions and model trains perfectly!

I think the logic of TPU detection in a pod environment is wrong or out-dated w.r.t the current xla (note it works with single TPU device). I see the official xla code uses xmp.spawn to spawn a process to get the potential TPU device.

Besides, I think most places that checking _TPU_AVAILABLE (guarding to import XLA) at the top-level can be replaced by checking _XLA_AVAILABLE.

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-03-28T23:06:07Z

I think the logic of TPU detection in a pod environment is wrong or out-dated w.r.t the current xla (note it works with single TPU device). I see the official xla code uses xmp.spawn to spawn a process to get the potential TPU device.

Hey! Thanks for the suggestion. Could you point us to the place where they check the TPU device?
And are you interested in sending a PR with a fix?

tchaton · 2021-03-29T08:46:54Z

Dear @jiasenwu,

You are right !
I think using xpm.spawn would be more reliable.

Best,
T.C

jiasenwu · 2021-03-29T09:04:05Z

I think I was referring to some code pieces like here: https://github.com/pytorch/xla/blob/08ae1044c2a7e314895f9946104cbe399e096515/test/test_train_mp_mnist.py#L114

where xm.xla_device() is called in a process that spawned by xmp.spawn. I ran this example with a v2-32 and it works.

I will prepare a PR sometime soon (trying to make it this week). I currently only apply a simple workaround by setting the flag to true for some experiments, need to extend to a correct fix.

kaushikb11 · 2021-04-01T06:47:39Z

Issue has been resolved in #6767. Closing it

jiasenwu added bug Something isn't working help wanted Open to be worked on labels Mar 26, 2021

awaelchli added the accelerator: tpu Tensor Processing Unit label Mar 28, 2021

awaelchli added this to the 1.2.x milestone Mar 28, 2021

tchaton added the priority: 0 High priority task label Mar 29, 2021

tchaton self-assigned this Mar 29, 2021

tchaton mentioned this issue Mar 29, 2021

[TPU] update is_tpu_exists utils internal logic to rely on xmp.spawn #6719

Merged

11 tasks

tchaton closed this as completed in #6719 Mar 29, 2021

jiasenwu mentioned this issue Mar 30, 2021

[TPU] Correct the check for TPU device in a pod environment #6755

Closed

11 tasks

kaushikb11 mentioned this issue Mar 31, 2021

Update logic for checking TPUs availability #6767

Merged

11 tasks

edenlightning reopened this Mar 31, 2021

edenlightning assigned kaushikb11 and unassigned tchaton Mar 31, 2021

kaushikb11 closed this as completed Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No TPU devices were found in a TPU pod env. #6692

No TPU devices were found in a TPU pod env. #6692

jiasenwu commented Mar 26, 2021

awaelchli commented Mar 28, 2021

tchaton commented Mar 29, 2021

jiasenwu commented Mar 29, 2021 •

edited

Loading

kaushikb11 commented Apr 1, 2021

No TPU devices were found in a TPU pod env. #6692

No TPU devices were found in a TPU pod env. #6692

Comments

jiasenwu commented Mar 26, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

awaelchli commented Mar 28, 2021

tchaton commented Mar 29, 2021

jiasenwu commented Mar 29, 2021 • edited Loading

kaushikb11 commented Apr 1, 2021

jiasenwu commented Mar 29, 2021 •

edited

Loading