Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No TPU devices were found in a TPU pod env. #6692

Closed
jiasenwu opened this issue Mar 26, 2021 · 4 comments · Fixed by #6719
Closed

No TPU devices were found in a TPU pod env. #6692

jiasenwu opened this issue Mar 26, 2021 · 4 comments · Fixed by #6719
Assignees
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@jiasenwu
Copy link

🐛 Bug

To Reproduce

  • Run in a GCP instance group of size 4 + a TPU v2-32.
  • Add tpu_cores=8 to the boringmodel
(py36) jiasen@instance-group-1-cntd:~$ diff bug_report_model.py the_boringmodel.py 
148a149,150
>         precision=16,
>         tpu_cores=8,

Command to run:

python -m torch_xla.distributed.xla_dist --tpu=pod --docker-image=gcr.io/tpu-pytorch/xla:r1.8 \
    --docker-run-flag=--rm=true \
    --docker-run-flag=--shm-size=16GB \
    --docker-run-flag=-v \
    --docker-run-flag=/home/jiasen:/app \
    --docker-run-flag=-w \
    --docker-run-flag=/app \
    --env=XLA_USE_BF16=1 \
    -- bash -c "pip install pytorch_lightning && python the_boringmodel.py"

The exception occurs immediately after the pytorch_lightning is installed. The exception repeats itself because it happens on each instance. Here I copy only one.

2021-03-26 20:38:10 10.164.0.42 [2] Traceback (most recent call last):
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 31, in inner_f
2021-03-26 20:38:10 10.164.0.42 [2]     queue.put(func(*args, **kwargs))
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 83, in _is_device_tpu
2021-03-26 20:38:10 10.164.0.42 [2]     device = xm.xla_device()
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 231, in xla_device
2021-03-26 20:38:10 10.164.0.42 [2]     devkind=devkind if devkind is not None else None)
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 136, in get_xla_supported_devices
2021-03-26 20:38:10 10.164.0.42 [2]     xla_devices = _DEVICES.value
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/utils/utils.py", line 32, in value
2021-03-26 20:38:10 10.164.0.42 [2]     self._value = self._gen_fn()
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 18, in <lambda>
2021-03-26 20:38:10 10.164.0.42 [2]     _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
2021-03-26 20:38:10 10.164.0.42 [2] RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:258 : Check failed: default_device_target != options_.global_device_map.end() 
2021-03-26 20:38:10 10.164.0.42 [2] *** Begin stack trace ***
2021-03-26 20:38:10 10.164.0.42 [2] 	tensorflow::CurrentStackTrace()
2021-03-26 20:38:10 10.164.0.42 [2] 	xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >, xla::XrtLocalService*)
2021-03-26 20:38:10 10.164.0.42 [2] 	xla::ComputationClient::Create()
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	xla::ComputationClient::Get()
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyCFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_GenericGetAttrWithDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_Call_Prepend
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_CallMethodIdObjArgs
2021-03-26 20:38:10 10.164.0.42 [2] 	PyImport_ImportModuleLevelObject
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_CallMethodIdObjArgs
2021-03-26 20:38:10 10.164.0.42 [2] 	PyImport_ImportModuleLevelObject
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] *** End stack trace ***
2021-03-26 20:38:10 10.164.0.42 [2] 
0it [00:00, ?it/s]0 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2] Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /root/anaconda3/envs/pytorch/lib/python3.6/site-packages/Datasets/MNIST/raw/train-images-idx3-ubyte.gz
2021-03-26 20:38:10 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2]                     ####
2021-03-26 20:38:10 10.164.0.42 [2]                 ###########
2021-03-26 20:38:10 10.164.0.42 [2]              ####################
2021-03-26 20:38:10 10.164.0.42 [2]          ############################
2021-03-26 20:38:10 10.164.0.42 [2]     #####################################
2021-03-26 20:38:10 10.164.0.42 [2] ##############################################
2021-03-26 20:38:10 10.164.0.42 [2] #########################  ###################
2021-03-26 20:38:10 10.164.0.42 [2] #######################    ###################
2021-03-26 20:38:10 10.164.0.42 [2] ####################      ####################
2021-03-26 20:38:10 10.164.0.42 [2] ##################       #####################
2021-03-26 20:38:10 10.164.0.42 [2] ################        ######################
2021-03-26 20:38:10 10.164.0.42 [2] #####################        #################
2021-03-26 20:38:10 10.164.0.42 [2] Traceback (most recent call last):
2021-03-26 20:38:10 10.164.0.42 [2] ######################     ###################
2021-03-26 20:38:10 10.164.0.42 [2]   File "the_boringmodel.py", line 153, in <module>
2021-03-26 20:38:10 10.164.0.42 [2] #####################    #####################
2021-03-26 20:38:10 10.164.0.42 [2]     test_run()
2021-03-26 20:38:10 10.164.0.42 [2] ####################   #######################
2021-03-26 20:38:10 10.164.0.42 [2]   File "the_boringmodel.py", line 145, in test_run
2021-03-26 20:38:10 10.164.0.42 [2] ###################  #########################
2021-03-26 20:38:10 10.164.0.42 [2]     tpu_cores=8,
2021-03-26 20:38:10 10.164.0.42 [2] ##############################################
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
2021-03-26 20:38:10 10.164.0.42 [2]     #####################################
2021-03-26 20:38:10 10.164.0.42 [2]     return fn(self, **kwargs)
2021-03-26 20:38:10 10.164.0.42 [2]          ############################
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 321, in __init__
2021-03-26 20:38:10 10.164.0.42 [2]              ####################
2021-03-26 20:38:10 10.164.0.42 [2]     replace_sampler_ddp, deterministic, precision, amp_backend, amp_level, plugins
2021-03-26 20:38:10 10.164.0.42 [2]                   ##########
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 91, in __init__
2021-03-26 20:38:10 10.164.0.42 [2]                      ####
2021-03-26 20:38:10 10.164.0.42 [2]     self.tpu_cores = device_parser.parse_tpu_cores(tpu_cores)
2021-03-26 20:38:10 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 113, in parse_tpu_cores
2021-03-26 20:38:10 10.164.0.42 [2]     raise MisconfigurationException('No TPU devices were found.')
2021-03-26 20:38:10 10.164.0.42 [2] pytorch_lightning.utilities.exceptions.MisconfigurationException: No TPU devices were found.

Expected behavior

TPU is definitely avaialbe.

Environment

  • PyTorch Version: 1.2.5
  • OS: Ubuntu 20.04.2 LTS
  • Python version: 3.6
  • docker image: gcr.io/tpu-pytorch/xla:r1.8
  • xla: r1.8
  • How you installed PyTorch: provided in gcr.io/tpu-pytorch/xla:r1.8

Additional context

I have tried a simple workaround by setting _TPU_AVAILABLE = True in https://github.com/PyTorchLightning/pytorch-lightning/blob/0e45220263f4e2045dfe7f68e3e0eaac0b2033d5/pytorch_lightning/utilities/__init__.py#L52. And it works. No more exceptions and model trains perfectly!

I think the logic of TPU detection in a pod environment is wrong or out-dated w.r.t the current xla (note it works with single TPU device). I see the official xla code uses xmp.spawn to spawn a process to get the potential TPU device.

Besides, I think most places that checking _TPU_AVAILABLE (guarding to import XLA) at the top-level can be replaced by checking _XLA_AVAILABLE.

@jiasenwu jiasenwu added bug Something isn't working help wanted Open to be worked on labels Mar 26, 2021
@awaelchli
Copy link
Contributor

I think the logic of TPU detection in a pod environment is wrong or out-dated w.r.t the current xla (note it works with single TPU device). I see the official xla code uses xmp.spawn to spawn a process to get the potential TPU device.

Hey! Thanks for the suggestion. Could you point us to the place where they check the TPU device?
And are you interested in sending a PR with a fix?

@awaelchli awaelchli added the accelerator: tpu Tensor Processing Unit label Mar 28, 2021
@awaelchli awaelchli added this to the 1.2.x milestone Mar 28, 2021
@tchaton
Copy link
Contributor

tchaton commented Mar 29, 2021

Dear @jiasenwu,

You are right !
I think using xpm.spawn would be more reliable.

Best,
T.C

@tchaton tchaton added the priority: 0 High priority task label Mar 29, 2021
@jiasenwu
Copy link
Author

jiasenwu commented Mar 29, 2021

I think I was referring to some code pieces like here: https://github.com/pytorch/xla/blob/08ae1044c2a7e314895f9946104cbe399e096515/test/test_train_mp_mnist.py#L114

where xm.xla_device() is called in a process that spawned by xmp.spawn. I ran this example with a v2-32 and it works.

I will prepare a PR sometime soon (trying to make it this week). I currently only apply a simple workaround by setting the flag to true for some experiments, need to extend to a correct fix.

@kaushikb11
Copy link
Contributor

Issue has been resolved in #6767. Closing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
5 participants