-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requested backend tpu_driver, but it failed to initialize: DEADLINE_EXCEEDED using 0.1 drivers since 10/02/2023 #3405
Comments
I'm encountering the same issue when loading GPT-J. It was working fine until 24 hours ago approximately. |
Our code is based on MTJ which the original GPT-J runs on top off and happens prior to loading the model, both use the older V1 implementation of the model. |
Mine is trying to connect grpc://10.63.28.250:8470 and errors so it's pretty much everywhere. Furthermore, it's also taking unusually long to collect pathy and uvicorn, etc.. |
I have pinpointed the issue down to the driver version the projects use. It looks like the older ones are no longer working.
I can't find a list of all available drivers, but collected 3 from bug reports and other colabs. tpu_driver0.1_dev20210607 is used by us and produces the error, tpu_driver0.1-dev20211030 is newer and used by some examples where people recommend not to use the nightly, this also produces the error. tpu_driver_20221011 is being used by some stable diffusion colabs and that one works in my example above. But unfortunately does not work with our MTJ notebook. If someone knows a list of long term supported drivers I could test more of them and see if this fixes the issue for MTJ. Otherwise i'd like to politely request that the commonly used older drivers are restored in functionality. GPT-J and MTJ are still widely used but rely on older driver versions. Update: Seems to effect all the 0.1 drivers. |
GPT-J doesn't work with tpu_driver_20221011 either. |
GPT-J won't work with that indeed, but it does make a difference between connecting to the TPU and getting the deadline errors. |
Thank you @henk717 I heavily use GPT-J everyday for work so I'll need it running from Monday morning. I hope this is a temporary issue. |
I hope so to, but breaking the entire 0.1 driver ecosystem does not sound like the thing they did on purpose and won't be interested in fixing before this gets installed on things like Kaggle and Google Compute. My theory is the TPUv2 firmware update that causes this either has been spread everywhere and the TPUv3 is unaffected, or they used Colab as a testing ground to see if people would run into issues and we are the first to notice because we rely on a dependency from the 2021 TPU era. |
Is there a way to infer GPT-J from jupyter notebook on TPU machine of GCP? |
Tagging @ultrons since he is the project manager for the TPU's. He may be able to get this to the right person. Thousands depend on MTJ for inference since it can be used to automatically load some huggingface pytorch models on the TPU. But especially since this is a failure to initialize the TPU at a very basic level. With the 0.1 driver resulting in a broken unresponsive TPU I expect this effects more colab users than the ones depending on MTJ. And if this same firmware bug spreads outside of colab more TPU customers could be effected on the entire google cloud. |
I'm subscribing pro for TPU. If it stays uninitializable, it's no use.. |
Same error here, trying to run Colab on TPU. GPU alternatives are practically unusable for the stuff I'm doing, so I really need that TPU up and running. Otherwise, my Pro sub ain't worth much of anything. |
Can confirm this problem with the GPT models I use. I can't run them because of the same problem. |
@henk717 Thanks for reporting the issue and thanks for using Colab. I can confirm that specifying the 0.1dev does not work, but taking the default and specifying the 0.2 drivers does work. Tracking internally at b/269607171. |
You mean, full driver path would be:
? |
This is indeed correct , the 0.2 drivers and newer (including the ones that just use a 2022 version number without the other versioning) load fine. If your notebook is compatible with the newer drivers this can solve the issue for you, unfortunately a lot of the notebooks that directly call for a 0.1 driver will break when this is attempting because of incompatibilities. You can find a sample notebook here : https://colab.research.google.com/drive/1YDcZJ4EMOd3f_kuk0RnD5AJBEpUhMl2I#revisionId=0B7OnP7aLuFgXMXFiZU9sNDZnWmNpVmVzaWc1YlhYaEF6ZnAwPQ |
@metrizable Thank you for taking care of this issue. Initializing the default and the 0.2 drivers are possible, but it causes crash when creating network of GPT-J, and probably its derivatives. So unfortunately I don't think these can be a temporary remedy. |
when will this be implemented on henk.tech/ckds? looking to use with colab kobold deployment script as currently have no way of converting my mesh transformer jax weights to HF. |
You are commenting on a google issue not a Kobold issue. If it was as simple as changing versions I would have done so, but it is not the ckds script that decides this. Mesh Transformers Jax itself requests the broken driver and does so because it is not compatible with anything newer. Since its a very low level dependency issue I am unable to resolve that myself as it requires deep knowledge of the TPU itself. |
Our operations team uses MTJ on a daily basis and hasn't been able to since the TPUs went down. Really hoping this gets resolved |
@wingk10 I don't want to make this post lengthy but same here for me. Kaggle is affected and now the queue for TPU has over 70 users, which means we are likely to wait for 6 hours. We are even losing the alternatives. |
Yes, we're running into the same issue re: Kaggle. 64 users in the queue right now, and seems to have gotten worse suddenly over the past few days. Obviously, I can't do anything but post and go "hey, it's important to me too", with no alternatives (extra useless here). I hope we hear more soon. |
I decided to pay $10/h and tried to connect vertex AI to cloud TPU, but there was no available TPU in my bucket region. So, there are really no alternatives. |
Nightly is less desirable than any other newer driver since its always the newest one. It will be broken, back then nightly was a 0.1 driver. |
@mosmos6 I will also give you a bit of a recap so you can understand why I need the driver to be fixed, but why for you some alternatives might exist. Mesh Transformers Jax (MTJ) was the framework used to create GPT-J, so GPT-J in its original form runs on top of MTJ. It has been ported to other platforms, so you can also run it on a GPU using Huggingface Transformers for example. And that is how our own community runs GPT-J based models on colab now with a more limited context. For us the issue is an issue in RAM. The affordable colab GPU's for our AI hobby have 16GB of VRAM, while the TPU has 64GB of RAM. So while GPT-J-6B is possible to run on a GPU, we can not fit as much context as the TPU version could. In the past year VE-Forbryderne ported various formats, so his version of MTJ can run GPT-J, but also XGLM, OPT and even NeoX based models. And not just that, it can load those models using pytorch files without requiring conversion. This allowed us AI hobbyists to use models up to 20B very affordably on Google Colab which was why the TPU was so desirable for us. $10 a month (or limited free usage) is much better than having to pay $1 per hour on GPU rentals which is not affordable for open source hobbyists who wish to use the models. If all you want is GPT-J-6B inference I suggest you switch your usage to Huggingface since you will be able to enjoy much better more reliable support on Colab and beyond for the same price. Its when you want the higher model sizes or training that the TPU becomes necessary. And the only platform that has that kind of cost effectiveness is Colab combined with a modified MTJ. Unfortunately the original 0.1 driver removal happened one month after VE's disappearance, so our dependency is completely unmaintained. If someone in this topic does want the challenge of porting MTJ to a newer Jax version I highly recommend forking https://github.com/VE-FORBRYDERNE/mesh-transformer-jax since it is much more feature complete than the original MTJ, and also more efficient. It even has been used in training 20B NeoX models on TRC. |
@henk717 Thank you for recapping. Sorry, I didn't know MTJ means Mesh Transformers Jax. Then it's exactly GPT-J. |
@mosmos6 In your specific use case its worth checking if conversion scripts like https://github.com/VE-FORBRYDERNE/mesh-transformer-jax/blob/all/to_hf_weights.py are still functional (For example with CPU dependencies) so that you can get your model out of this platform to futureproof your model. If you are unable to you are stuck with the rest of the thousands of users that have no substitute for MTJ. As for the dependencies, I lack the ability to do this myself but if others can this would be very welcome. |
@henk717 Thank you for suggestion. I've been persistent with TPU because I think running on TPU or GPU makes significant differences in output quality for some reason, but I'll give it a try. |
3/22/2023 Connection error still persists. |
Hello everyone, I'm attempting to update MTJ that runs on TPU_driver0.2. However the code gets stuck at line 265 of transformer_shard.py, which is
Even though xmap doesn't show any clear error, apparently it's stuck around out_axis of init_xmap. The code is
I've been performing an intense research over the past days but I can't find any solution. I thought it's time to ask for everyone's wisdom. |
I think my statement about xmap is logical. It doesn’t even visibly error so I really don’t know what is wrong in the codes. The only thing I can think of is that maps.Mesh doesn’t pass all the info from devices to ResourceEnv on JAX0.3.5. So if I specified my question, it would be how to pass all the information from devices by maps.Mesh. |
Hello @metrizable @cperry-goog This issue has started where GPT-J and its derivatives could not be connected to TPU_driver0.1 anymore.
Hence, I updated my code so that it runs on JAX 0.3.5, which is compatible to TPU_driver0.2. However, the very same code errors at a particular point on colab so I would like you to take a look.
Before it runs down to the out_axes, code gets stuck at line 46, which is It seems to be a wrapper part of xmap. On TPU VM, this process finishes within a minute. I also tried pjit but it erros as As advised, I updated my code for the newer JAX and TPU_driver, which perfectly runs on TPU VM, but it gets stuck on colab. |
I'd thought I'd post what I got when trying to run the TPU models as it was a different result: |
AttributeError: module 'jax' has no attribute 'Array' is a new error related to chex doing a breaking change, I fixed this by pinning a suitable version in our requirements files. Now the error is back to the one reported in this issue tracker. |
Hello @metrizable @cperry-goog I upgraded my model to run on JAX 0.3.25 and colab managed to load the model for the first time in two weeks. However, when I try to infer with this model, the same issue as my previous comment occurs again. Namely, the code doesn't show error message but it's stuck at a certain point (related to xmap), which is the same operation as the previous comment. It's at >infer() > generate() > fun_mapped() > bind() > map_bind() > process() > process_call() > xmap_impl() > wrapper() > call() This makes no sense because the same code runs well on TPU VM v3-8 (v2-alpha) and the same operation was processed well when the model was loaded. I would like you to take a look. At this moment, the very same code cannot initialize TPU_driver0.2, Thank you for your attention. |
Hello @metrizable @cperry-goog I resolved it and the discussed model now runs on colab with TPU_driver0.2. |
@mosmos6 Can you share your changes? There is still an entire ecosystem broken. |
@henk717 ofc. Give me some minutes. Now I'm on my way to set up a repository as changes happened in multiple files. My test code needs clean up after one month experiments. |
For us the challenge will be getting this one running : https://github.com/VE-FORBRYDERNE/mesh-transformer-jax/tree/ck it is a heavily modified version that has a lot more additions and enhancements but the developer went missing. |
Could you please share the working colab notebook if you have one? |
@henk717 By casually looking, I suppose you need to update only line 383 - 419 of transformer_shard.py if you use new colab demo to infer followed by updating the breaking changes of jax. |
@somsomers Yes. Please let me clean up the mess before sharing. |
Thank you. |
Hello, First of all, I must apologize. This works only on high memory TPU runtime so you'll need pro or pro+ subscription of colab.... |
I fixed your AI. She's waiting for you to pick up in the garage. (https://github.com/mosmos6/Large-MTJ) Same as my GPT-J, it's adapted to JAX 0.3.25 so it runs on colab with TPU_driver0.2. Basically this should be now immunized to JAX upgrading except breaking changes. Sorry for the dorky name, I didn't know her name. I tested this only with my slim weights for GPT-J. If you run into an error with other types of weights, please post an issue. Important notes;
Enjoy |
@mosmos6 I tried applying the modifications to my test account here but the end result is gibberish. To test you can take this notebook and replace the version field with https://github.com/henk7171/koboldai. |
I saw your |
Due to the python upgrade of colab (3.9 -> 3.10), I further modified two of my modified mtj models, and requirements.txt and util.py of each are updated. |
So I'm not sure what happened, but it started working for a week, but just when I tried to use it tonight, some of the models ended up with the error again. |
If you are a Kobold user its because we implemented 2.0 support. TPU's have always been a bit unreliable and usually running the notebook again is enough. Are there people left who still depend on 0.1? Otherwise it no longer makes sense to keep this open. |
This issue is obsolete because the TPU runtimes are deprecated and were removed. |
Describe the current behavior
When running an older version of JAX, the TPU receives the following error:
Traceback (most recent call last):
File "aiserver.py", line 10214, in
load_model(initial_load=True)
File "aiserver.py", line 2806, in load_model
tpu_mtj_backend.load_model(vars.custmodpth, hf_checkpoint=vars.model not in ("TPUMeshTransformerGPTJ", "TPUMeshTransformerGPTNeoX") and vars.use_colab_tpu, **vars.modelconfig)
File "/content/KoboldAI-Client/tpu_mtj_backend.py", line 1194, in load_model
devices = np.array(jax.devices()[:cores_per_replica]).reshape(mesh_shape)
File "/usr/local/lib/python3.8/dist-packages/jax/_src/lib/xla_bridge.py", line 314, in devices
return get_backend(backend).devices()
File "/usr/local/lib/python3.8/dist-packages/jax/_src/lib/xla_bridge.py", line 258, in get_backend
return _get_backend_uncached(platform)
File "/usr/local/lib/python3.8/dist-packages/jax/_src/lib/xla_bridge.py", line 248, in _get_backend_uncached
raise RuntimeError(f"Requested backend {platform}, but it failed "
RuntimeError: Requested backend tpu_driver, but it failed to initialize: DEADLINE_EXCEEDED: Failed to connect to remote server at address: grpc://10.106.231.74:8470. Error from gRPC: Deadline Exceeded. Details:
This happens for all users of the notebook on Colab, while Kaggle is still working as intended.
Describe the expected behavior
Jax is correctly able to connect to the TPU and can then proceed with loading the user defined model.
What web browser you are using
This issue does not depend on a browser, but for completeness I am using an up to date Microsoft Edge.
Additional context
Here is an example of an effected notebook:
The relevant backend code can be found here : https://github.com/KoboldAI/KoboldAI-Client/blob/main/tpu_mtj_backend.py
This also makes use of a heavily modified MTJ with the following relevant dependencies:
jax == 0.2.21
jaxlib >= 0.1.69, <= 0.3.7
git+https://github.com/VE-FORBRYDERNE/mesh-transformer-jax@ck
MTJ uses tpu_driver0.1_dev20210607
The text was updated successfully, but these errors were encountered: