"tf.distribute.cluster_resolver.TPUClusterResolver()" is not working #4699

PlutoSejin · 2024-07-16T13:10:22Z

3 months ago, I made model using following notebook. After tpu is changed TPU(deprecated) to TPU v2, I have error at tf.distribute.cluster_resolver.TPUClusterResolver() part which did not return value of TPU address. Specific code is below.

try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
  print('Running on TPU ', tpu.cluster_spec().as_dict())
  TPU_ADDRESS = tpu.get_master()
  print('Running on TPU:', TPU_ADDRESS)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

Output of above code is

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-35-bc480ed05132>](https://localhost:8080/#) in <cell line: 5>()
      5 try:
----> 6   tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
      7   print('Running on TPU ', tpu.cluster_spec().as_dict())

2 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/cluster_resolver/tpu/tpu_cluster_resolver.py](https://localhost:8080/#) in __init__(self, tpu, zone, project, job_name, coordinator_name, coordinator_address, credentials, service, discovery_url)
    234       # Default Cloud environment
--> 235       self._cloud_tpu_client = client.Client(
    236           tpu=tpu,

[/usr/local/lib/python3.10/dist-packages/cloud_tpu_client/client.py](https://localhost:8080/#) in __init__(self, tpu, zone, project, credentials, service, discovery_url)
    138     if tpu is None:
--> 139       raise ValueError('Please provide a TPU Name to connect to.')
    140 

ValueError: Please provide a TPU Name to connect to.

During handling of the above exception, another exception occurred:

BaseException                             Traceback (most recent call last)
[<ipython-input-35-bc480ed05132>](https://localhost:8080/#) in <cell line: 5>()
      9   print('Running on TPU:', TPU_ADDRESS)
     10 except ValueError:
---> 11   raise BaseException(
     12     'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
     13 #tf.config.experimental_connect_to_host(TPU_ADDRESS)

BaseException: ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!

I also add tpu='local' at tpu = tf.distribute.cluster_resolver.TPUClusterResolver() part.

try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')  # TPU detection
  print('Running on TPU ', tpu.cluster_spec().as_dict())
  TPU_ADDRESS = tpu.get_master()
  print('Running on TPU:', TPU_ADDRESS)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

But output is below:

Running on TPU  {}
Running on TPU: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-37-bbb33ea8b6de>](https://localhost:8080/#) in <cell line: 18>()
     16 auth.authenticate_user()
     17 tf.enable_eager_execution()
---> 18 tf.config.experimental_connect_to_host(TPU_ADDRESS)
     19 tensorflow_gcs_config.configure_gcs_from_colab_auth()
     20 

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/remote.py](https://localhost:8080/#) in connect_to_remote_host(remote_host, job_name)
     65   """
     66   if not remote_host:
---> 67     raise ValueError("Must provide at least one remote_host")
     68 
     69   remote_hosts = nest.flatten(remote_host)

ValueError: Must provide at least one remote_host

I can't get to TPU address. Also when I use tpu='local', Return value of tpu.cluster_spec().as_dict() is empty. When I use tpu.cluster_spec().as_dict()['worker'], above error occur.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-38-a02f5f80852d>](https://localhost:8080/#) in <cell line: 5>()
      5 try:
      6   tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')  # TPU detection
----> 7   print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
      8   TPU_ADDRESS = tpu.get_master()
      9   print('Running on TPU:', TPU_ADDRESS)

KeyError: 'worker'

Do I have to use Cloud TPU API to get TPU address? or there is other way to get TPU address?

The text was updated successfully, but these errors were encountered:

mayankmalik-colab · 2024-07-18T10:38:04Z

Similar issue - #4686. Check the comments from one of our team members.

PlutoSejin · 2024-07-18T12:30:09Z

I already done using

"!pip install https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/tensorflow/tf-2.15.0/tensorflow-2.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl"

at my notebook which is mentioned in #4686. But still not work.

EvanWiederspan · 2024-07-18T16:41:52Z

Tracking internally as b/353976964

sagelywizard · 2024-07-23T16:02:18Z

The "TPU v2" runtimes are no longer on the "TPU Node" architecture. This means the notebook VM has direct access to the TPU, rather than the TPU residing on a remote worker machine. You're not seeing any workers because there's no worker VM on the new TPU VM architecture.

You can see the TPUs attached to your VM with tpu.num_accelerators().

sagelywizard · 2024-07-23T16:13:39Z

You can find more information about the TPU Node and TPU VM architecture differences here: https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_architectures

PlutoSejin · 2024-07-23T17:22:31Z

I check tpu.num_accelerators() and it return 8.

Then how to get tpu device name?

import tensorflow_gcs_config
import tensorflow.compat.v1 as tf
TPU_TOPOLOGY = "2x2"
try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')
  tf.config.experimental_connect_to_cluster(tpu)
  #tf.tpu.experimental.initialize_tpu_system(tpu)
  TPU_ADDRESS = tpu.get_master()
  print('Running on TPU:', TPU_ADDRESS)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

tf.config.experimental_connect_to_host(TPU_ADDRESS)
tensorflow_gcs_config.configure_gcs_from_colab_auth()

I use above code, but result and error is below.

Running on TPU: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-4-98ea436009aa>](https://localhost:8080/#) in <cell line: 23>()
     21 #tf.tpu.experimental.initialize_tpu_system(tpu)
     22 #strategy = tf.distribute.experimental.TPUStrategy(tpu)
---> 23 tf.config.experimental_connect_to_host(TPU_ADDRESS)
     24 #with strategy.scope():
     25 tensorflow_gcs_config.configure_gcs_from_colab_auth()

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/remote.py](https://localhost:8080/#) in connect_to_remote_host(remote_host, job_name)
     65   """
     66   if not remote_host:
---> 67     raise ValueError("Must provide at least one remote_host")
     68 
     69   remote_hosts = nest.flatten(remote_host)

ValueError: Must provide at least one remote_host

tpu.get_master() return nothing.
I also see other issue that cannot proceed due to tensorflow version, but my tensorflow, tensorflow-gcs-config, tensorflow-text version is 2.15.0.

sagelywizard · 2024-07-25T16:45:24Z

Then how to get tpu device name?

I think you're referring to the TPU network address. The new TPUs on the new TPU VMs are not attached to the network, so they don't have a network address. That's why tpu.get_master() returns '' (i.e. there's no network address).

Can you delete tf.config.experimental_connect_to_host(TPU_ADDRESS) and all the references to TPU_ADDRESS?

PlutoSejin · 2024-07-25T17:33:57Z

I deleted all the references to TPU_ADDRESS which is below code

import tensorflow_gcs_config
import tensorflow.compat.v1 as tf
TPU_TOPOLOGY = "2x2"
try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')  # TPU detection
  tf.config.experimental_connect_to_cluster(tpu)
  tf.tpu.experimental.initialize_tpu_system(tpu)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

tensorflow_gcs_config.configure_gcs_from_colab_auth()

Then below error occured.

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
[<ipython-input-4-01c48cbb3b56>](https://localhost:8080/#) in <cell line: 22>()
     21 strategy = tf.distribute.TPUStrategy(tpu)
     22 with strategy.scope():
---> 23   tensorflow_gcs_config.configure_gcs_from_colab_auth()
     24 
     25 tf.disable_v2_behavior()

11 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure_gcs_from_colab_auth(device)
    128   with open(adc_filename) as f:
    129     data = json.load(f)
--> 130   return configure_gcs(credentials=data, device=device)
    131 

[/usr/local/lib/python3.10/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure_gcs(credentials, block_cache, device)
    116   if device:
    117     with ops.device(device):
--> 118       return configure(credentials, block_cache)
    119   return configure(credentials, block_cache)
    120 

[/usr/local/lib/python3.10/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure(credentials, block_cache)
    100       if isinstance(credentials, dict):
    101         credentials = json.dumps(credentials)
--> 102       creds = gcs_configure_credentials(credentials)
    103     else:
    104       creds = tf.constant(0)

<string> in gcs_configure_credentials(json, name)

<string> in gcs_configure_credentials_eager_fallback(json, name, ctx)

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/profiler/trace.py](https://localhost:8080/#) in wrapped(*args, **kwargs)
    181         with Trace(trace_name, **trace_kwargs):
    182           return func(*args, **kwargs)
--> 183       return func(*args, **kwargs)
    184 
    185     return wrapped

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py](https://localhost:8080/#) in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
    694   # TODO(b/142518781): Fix all call-sites and remove redundant arg
    695   preferred_dtype = preferred_dtype or dtype_hint
--> 696   return tensor_conversion_registry.convert(
    697       value, dtype, name, as_ref, preferred_dtype, accepted_result_types
    698   )

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/tensor_conversion_registry.py](https://localhost:8080/#) in convert(value, dtype, name, as_ref, preferred_dtype, accepted_result_types)
    232 
    233     if ret is None:
--> 234       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    235 
    236     if ret is NotImplemented:

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    333                                          as_ref=False):
    334   _ = as_ref
--> 335   return constant(v, dtype=dtype, name=name)
    336 
    337 # Register the conversion function for the "unconvertible" types

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/weak_tensor_ops.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    140   def wrapper(*args, **kwargs):
    141     if not ops.is_auto_dtype_conversion_enabled():
--> 142       return op(*args, **kwargs)
    143     bound_arguments = signature.bind(*args, **kwargs)
    144     bound_arguments.apply_defaults()

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in constant(value, dtype, shape, name)
    269     ValueError: if called on a symbolic tensor.
    270   """
--> 271   return _constant_impl(value, dtype, shape, name, verify_shape=False,
    272                         allow_broadcast=True)
    273 

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    282       with trace.Trace("tf.constant"):
    283         return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
--> 284     return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    285 
    286   const_tensor = ops._create_graph_constant(  # pylint: disable=protected-access

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    294 ) -> ops._EagerTensorBase:
    295   """Creates a constant on the current device."""
--> 296   t = convert_to_eager_tensor(value, ctx, dtype)
    297   if shape is None:
    298     return t

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in convert_to_eager_tensor(value, ctx, dtype)
    101       dtype = dtypes.as_dtype(dtype).as_datatype_enum
    102   ctx.ensure_initialized()
--> 103   return ops.EagerTensor(value, ctx.device_name, dtype)
    104 
    105 

InvalidArgumentError: /job:worker/replica:0/task:0/device:CPU:0 unknown device.

It had error at tensorflow_gcs_config.configure_gcs_from_colab_auth()

PlutoSejin · 2024-08-13T09:13:20Z

Isn't there are any solution? tensorflow_gcs_config.configure_gcs_from_colab_auth() give same error even though I use TPU without using TPU_ADDRESS

sagelywizard · 2024-08-14T19:41:35Z

We don't own the tensorflow_gcs_config library, so I'd recommend contacting to the library owners (I believe that's the tensorflow team) and asking them for help.

But from looking at the code, it looks like there's a default kwarg to configure_gcs_from_colab_auth, device="/job:worker/replica:0/task:0/device:CPU:0". That looks incorrect to me, and I suspect you'll want to pass in a different device name. I'm not sure if it wants a physical or logical device, but you can list the physical devices on the system with: tf.config.list_physical_devices() and you can list the logical devices on the system with tf.config.list_logical_devices()

Hope that's helpful!

PlutoSejin · 2024-08-16T13:36:52Z

Thank you for your comment. I solved it by referring to your advice.

PlutoSejin added the bug label Jul 16, 2024

EvanWiederspan added the triaged label Jul 18, 2024

PlutoSejin closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"tf.distribute.cluster_resolver.TPUClusterResolver()" is not working #4699

"tf.distribute.cluster_resolver.TPUClusterResolver()" is not working #4699

PlutoSejin commented Jul 16, 2024 •

edited

Loading

mayankmalik-colab commented Jul 18, 2024

PlutoSejin commented Jul 18, 2024

EvanWiederspan commented Jul 18, 2024

sagelywizard commented Jul 23, 2024

sagelywizard commented Jul 23, 2024

PlutoSejin commented Jul 23, 2024

sagelywizard commented Jul 25, 2024

PlutoSejin commented Jul 25, 2024

PlutoSejin commented Aug 13, 2024

sagelywizard commented Aug 14, 2024

PlutoSejin commented Aug 16, 2024

"tf.distribute.cluster_resolver.TPUClusterResolver()" is not working #4699

"tf.distribute.cluster_resolver.TPUClusterResolver()" is not working #4699

Comments

PlutoSejin commented Jul 16, 2024 • edited Loading

mayankmalik-colab commented Jul 18, 2024

PlutoSejin commented Jul 18, 2024

EvanWiederspan commented Jul 18, 2024

sagelywizard commented Jul 23, 2024

sagelywizard commented Jul 23, 2024

PlutoSejin commented Jul 23, 2024

sagelywizard commented Jul 25, 2024

PlutoSejin commented Jul 25, 2024

PlutoSejin commented Aug 13, 2024

sagelywizard commented Aug 14, 2024

PlutoSejin commented Aug 16, 2024

PlutoSejin commented Jul 16, 2024 •

edited

Loading