Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU VM smoke test failed on latest master branch #2469

Closed
cblmemo opened this issue Aug 27, 2023 · 2 comments · Fixed by #2471
Closed

TPU VM smoke test failed on latest master branch #2469

cblmemo opened this issue Aug 27, 2023 · 2 comments · Fixed by #2471

Comments

@cblmemo
Copy link
Collaborator

cblmemo commented Aug 27, 2023

As indicated, the TPU VM smoke test (test_tpu_vm) failed on the latest master branch. Some logs:

I 08-27 03:10:34 log_lib.py:425] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.128.0.72']
(tpuvm_mnist, pid=11949) 2023-08-27 03:10:27.095717: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(tpuvm_mnist, pid=11949) Traceback (most recent call last):
(tpuvm_mnist, pid=11949)   File "main.py", line 29, in <module>
(tpuvm_mnist, pid=11949)     import train
(tpuvm_mnist, pid=11949)   File "/home/gcpuser/sky_workdir/flax/examples/mnist/train.py", line 25, in <module>
(tpuvm_mnist, pid=11949)     from flax import linen as nn
(tpuvm_mnist, pid=11949)   File "/home/gcpuser/sky_workdir/flax/flax/__init__.py", line 25, in <module>
(tpuvm_mnist, pid=11949)     from . import linen
(tpuvm_mnist, pid=11949)   File "/home/gcpuser/sky_workdir/flax/flax/linen/__init__.py", line 34, in <module>
(tpuvm_mnist, pid=11949)     from .activation import (
(tpuvm_mnist, pid=11949)   File "/home/gcpuser/sky_workdir/flax/flax/linen/activation.py", line 21, in <module>
(tpuvm_mnist, pid=11949)     from flax.linen.module import compact
(tpuvm_mnist, pid=11949)   File "/home/gcpuser/sky_workdir/flax/flax/linen/module.py", line 68, in <module>
(tpuvm_mnist, pid=11949)     from flax.linen import kw_only_dataclasses
(tpuvm_mnist, pid=11949)   File "/home/gcpuser/sky_workdir/flax/flax/linen/kw_only_dataclasses.py", line 126, in <module>
(tpuvm_mnist, pid=11949)     def _process_class(cls: type[M], extra_fields=None, **kwargs):
(tpuvm_mnist, pid=11949) TypeError: 'type' object is not subscriptable
ERROR: Job 1 failed with return code list: [1]
Shared connection to 35.226.224.20 closed.
Tailing logs of job 1 on cluster 't-tpu-vm-402b-84'...
+ sky logs t-tpu-vm-402b-84 1 --status
Getting job status...
Job 1: FAILED

Seems like some compatibility issues...

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 27, 2023

Tested on 285f4f50; test_tpu_vm_pod failed too for similar reason.

@infwinston
Copy link
Member

ah I just tried. it's due to the upgrade of flax library. if I downgrade the version to 0.6.11 then it works.. let me pin the version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants