Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"tf.distribute.cluster_resolver.TPUClusterResolver()" is not working #4699

Closed
PlutoSejin opened this issue Jul 16, 2024 · 11 comments
Closed

Comments

@PlutoSejin
Copy link

PlutoSejin commented Jul 16, 2024

3 months ago, I made model using following notebook. After tpu is changed TPU(deprecated) to TPU v2, I have error at tf.distribute.cluster_resolver.TPUClusterResolver() part which did not return value of TPU address. Specific code is below.

try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
  print('Running on TPU ', tpu.cluster_spec().as_dict())
  TPU_ADDRESS = tpu.get_master()
  print('Running on TPU:', TPU_ADDRESS)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

Output of above code is

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-35-bc480ed05132>](https://localhost:8080/#) in <cell line: 5>()
      5 try:
----> 6   tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
      7   print('Running on TPU ', tpu.cluster_spec().as_dict())

2 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/cluster_resolver/tpu/tpu_cluster_resolver.py](https://localhost:8080/#) in __init__(self, tpu, zone, project, job_name, coordinator_name, coordinator_address, credentials, service, discovery_url)
    234       # Default Cloud environment
--> 235       self._cloud_tpu_client = client.Client(
    236           tpu=tpu,

[/usr/local/lib/python3.10/dist-packages/cloud_tpu_client/client.py](https://localhost:8080/#) in __init__(self, tpu, zone, project, credentials, service, discovery_url)
    138     if tpu is None:
--> 139       raise ValueError('Please provide a TPU Name to connect to.')
    140 

ValueError: Please provide a TPU Name to connect to.

During handling of the above exception, another exception occurred:

BaseException                             Traceback (most recent call last)
[<ipython-input-35-bc480ed05132>](https://localhost:8080/#) in <cell line: 5>()
      9   print('Running on TPU:', TPU_ADDRESS)
     10 except ValueError:
---> 11   raise BaseException(
     12     'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
     13 #tf.config.experimental_connect_to_host(TPU_ADDRESS)

BaseException: ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!

I also add tpu='local' at tpu = tf.distribute.cluster_resolver.TPUClusterResolver() part.

try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')  # TPU detection
  print('Running on TPU ', tpu.cluster_spec().as_dict())
  TPU_ADDRESS = tpu.get_master()
  print('Running on TPU:', TPU_ADDRESS)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

But output is below:

Running on TPU  {}
Running on TPU: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-37-bbb33ea8b6de>](https://localhost:8080/#) in <cell line: 18>()
     16 auth.authenticate_user()
     17 tf.enable_eager_execution()
---> 18 tf.config.experimental_connect_to_host(TPU_ADDRESS)
     19 tensorflow_gcs_config.configure_gcs_from_colab_auth()
     20 

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/remote.py](https://localhost:8080/#) in connect_to_remote_host(remote_host, job_name)
     65   """
     66   if not remote_host:
---> 67     raise ValueError("Must provide at least one remote_host")
     68 
     69   remote_hosts = nest.flatten(remote_host)

ValueError: Must provide at least one remote_host

I can't get to TPU address. Also when I use tpu='local', Return value of tpu.cluster_spec().as_dict() is empty. When I use tpu.cluster_spec().as_dict()['worker'], above error occur.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-38-a02f5f80852d>](https://localhost:8080/#) in <cell line: 5>()
      5 try:
      6   tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')  # TPU detection
----> 7   print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
      8   TPU_ADDRESS = tpu.get_master()
      9   print('Running on TPU:', TPU_ADDRESS)

KeyError: 'worker'

Do I have to use Cloud TPU API to get TPU address? or there is other way to get TPU address?

@PlutoSejin PlutoSejin added the bug label Jul 16, 2024
@mayankmalik-colab
Copy link
Contributor

Similar issue - #4686. Check the comments from one of our team members.

@PlutoSejin
Copy link
Author

I already done using

"!pip install https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/tensorflow/tf-2.15.0/tensorflow-2.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl" 

at my notebook which is mentioned in #4686. But still not work.

@EvanWiederspan
Copy link

Tracking internally as b/353976964

@sagelywizard
Copy link
Member

The "TPU v2" runtimes are no longer on the "TPU Node" architecture. This means the notebook VM has direct access to the TPU, rather than the TPU residing on a remote worker machine. You're not seeing any workers because there's no worker VM on the new TPU VM architecture.

You can see the TPUs attached to your VM with tpu.num_accelerators().

@sagelywizard
Copy link
Member

You can find more information about the TPU Node and TPU VM architecture differences here: https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_architectures

@PlutoSejin
Copy link
Author

I check tpu.num_accelerators() and it return 8.

Then how to get tpu device name?

import tensorflow_gcs_config
import tensorflow.compat.v1 as tf
TPU_TOPOLOGY = "2x2"
try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')
  tf.config.experimental_connect_to_cluster(tpu)
  #tf.tpu.experimental.initialize_tpu_system(tpu)
  TPU_ADDRESS = tpu.get_master()
  print('Running on TPU:', TPU_ADDRESS)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

tf.config.experimental_connect_to_host(TPU_ADDRESS)
tensorflow_gcs_config.configure_gcs_from_colab_auth()

I use above code, but result and error is below.

Running on TPU: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-4-98ea436009aa>](https://localhost:8080/#) in <cell line: 23>()
     21 #tf.tpu.experimental.initialize_tpu_system(tpu)
     22 #strategy = tf.distribute.experimental.TPUStrategy(tpu)
---> 23 tf.config.experimental_connect_to_host(TPU_ADDRESS)
     24 #with strategy.scope():
     25 tensorflow_gcs_config.configure_gcs_from_colab_auth()

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/remote.py](https://localhost:8080/#) in connect_to_remote_host(remote_host, job_name)
     65   """
     66   if not remote_host:
---> 67     raise ValueError("Must provide at least one remote_host")
     68 
     69   remote_hosts = nest.flatten(remote_host)

ValueError: Must provide at least one remote_host

tpu.get_master() return nothing.
I also see other issue that cannot proceed due to tensorflow version, but my tensorflow, tensorflow-gcs-config, tensorflow-text version is 2.15.0.

@sagelywizard
Copy link
Member

Then how to get tpu device name?

I think you're referring to the TPU network address. The new TPUs on the new TPU VMs are not attached to the network, so they don't have a network address. That's why tpu.get_master() returns '' (i.e. there's no network address).

Can you delete tf.config.experimental_connect_to_host(TPU_ADDRESS) and all the references to TPU_ADDRESS?

@PlutoSejin
Copy link
Author

I deleted all the references to TPU_ADDRESS which is below code

import tensorflow_gcs_config
import tensorflow.compat.v1 as tf
TPU_TOPOLOGY = "2x2"
try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')  # TPU detection
  tf.config.experimental_connect_to_cluster(tpu)
  tf.tpu.experimental.initialize_tpu_system(tpu)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

tensorflow_gcs_config.configure_gcs_from_colab_auth()

Then below error occured.

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
[<ipython-input-4-01c48cbb3b56>](https://localhost:8080/#) in <cell line: 22>()
     21 strategy = tf.distribute.TPUStrategy(tpu)
     22 with strategy.scope():
---> 23   tensorflow_gcs_config.configure_gcs_from_colab_auth()
     24 
     25 tf.disable_v2_behavior()

11 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure_gcs_from_colab_auth(device)
    128   with open(adc_filename) as f:
    129     data = json.load(f)
--> 130   return configure_gcs(credentials=data, device=device)
    131 

[/usr/local/lib/python3.10/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure_gcs(credentials, block_cache, device)
    116   if device:
    117     with ops.device(device):
--> 118       return configure(credentials, block_cache)
    119   return configure(credentials, block_cache)
    120 

[/usr/local/lib/python3.10/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure(credentials, block_cache)
    100       if isinstance(credentials, dict):
    101         credentials = json.dumps(credentials)
--> 102       creds = gcs_configure_credentials(credentials)
    103     else:
    104       creds = tf.constant(0)

<string> in gcs_configure_credentials(json, name)

<string> in gcs_configure_credentials_eager_fallback(json, name, ctx)

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/profiler/trace.py](https://localhost:8080/#) in wrapped(*args, **kwargs)
    181         with Trace(trace_name, **trace_kwargs):
    182           return func(*args, **kwargs)
--> 183       return func(*args, **kwargs)
    184 
    185     return wrapped

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py](https://localhost:8080/#) in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
    694   # TODO(b/142518781): Fix all call-sites and remove redundant arg
    695   preferred_dtype = preferred_dtype or dtype_hint
--> 696   return tensor_conversion_registry.convert(
    697       value, dtype, name, as_ref, preferred_dtype, accepted_result_types
    698   )

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/tensor_conversion_registry.py](https://localhost:8080/#) in convert(value, dtype, name, as_ref, preferred_dtype, accepted_result_types)
    232 
    233     if ret is None:
--> 234       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    235 
    236     if ret is NotImplemented:

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    333                                          as_ref=False):
    334   _ = as_ref
--> 335   return constant(v, dtype=dtype, name=name)
    336 
    337 # Register the conversion function for the "unconvertible" types

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/weak_tensor_ops.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    140   def wrapper(*args, **kwargs):
    141     if not ops.is_auto_dtype_conversion_enabled():
--> 142       return op(*args, **kwargs)
    143     bound_arguments = signature.bind(*args, **kwargs)
    144     bound_arguments.apply_defaults()

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in constant(value, dtype, shape, name)
    269     ValueError: if called on a symbolic tensor.
    270   """
--> 271   return _constant_impl(value, dtype, shape, name, verify_shape=False,
    272                         allow_broadcast=True)
    273 

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    282       with trace.Trace("tf.constant"):
    283         return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
--> 284     return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    285 
    286   const_tensor = ops._create_graph_constant(  # pylint: disable=protected-access

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    294 ) -> ops._EagerTensorBase:
    295   """Creates a constant on the current device."""
--> 296   t = convert_to_eager_tensor(value, ctx, dtype)
    297   if shape is None:
    298     return t

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in convert_to_eager_tensor(value, ctx, dtype)
    101       dtype = dtypes.as_dtype(dtype).as_datatype_enum
    102   ctx.ensure_initialized()
--> 103   return ops.EagerTensor(value, ctx.device_name, dtype)
    104 
    105 

InvalidArgumentError: /job:worker/replica:0/task:0/device:CPU:0 unknown device.

It had error at tensorflow_gcs_config.configure_gcs_from_colab_auth()

@PlutoSejin
Copy link
Author

Isn't there are any solution? tensorflow_gcs_config.configure_gcs_from_colab_auth() give same error even though I use TPU without using TPU_ADDRESS

@sagelywizard
Copy link
Member

We don't own the tensorflow_gcs_config library, so I'd recommend contacting to the library owners (I believe that's the tensorflow team) and asking them for help.

But from looking at the code, it looks like there's a default kwarg to configure_gcs_from_colab_auth, device="/job:worker/replica:0/task:0/device:CPU:0". That looks incorrect to me, and I suspect you'll want to pass in a different device name. I'm not sure if it wants a physical or logical device, but you can list the physical devices on the system with: tf.config.list_physical_devices() and you can list the logical devices on the system with tf.config.list_logical_devices()

Hope that's helpful!

@PlutoSejin
Copy link
Author

Thank you for your comment. I solved it by referring to your advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants