You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi Ayaka,
We are currently utilizing your Llama-70B implementation for generation on Cloud TPUs and have encountered a few challenges that we hope you might be able to assist us with. We experienced memory issues when attempting to convert the model to JAX format on the Cloud TPUs as that ran out of memory while converting. We managed to convert the model using a swap memory of 400GB through an attached disk (SSD). We are attaching a disk with the pre-converted model to all the hosts in TPU v3-32 in read-only mode.
When we tried to shard the 70B model across the TPUs, we ran out of TPU HBM. We also noticed that when running smaller models like Llama-13B, redundant responses were generated from all four hosts in the TPU slice (TPUv3-32). We would greatly appreciate any guidance you could provide on generating with Llama-70B on TPU v3-32, or any alternative methods for generation using a single host TPUv3-8.
We would like to express our gratitude for your exceptional repository. It has significantly accelerated our research. The speed of generation on these TPUs using your implementation, compared to GPUs, is remarkable! We plan to acknowledge your valuable contribution in our upcoming paper. Thank you once again for your outstanding work.
The text was updated successfully, but these errors were encountered:
Hi @divyapatel4, I am busy with other matters in January, so I may have little time to look into this issue. Have you tried the new Llama JAX implementation in the Hugging Face transformers library, and does that work for you?
Hi Ayaka,
We are currently utilizing your Llama-70B implementation for generation on Cloud TPUs and have encountered a few challenges that we hope you might be able to assist us with. We experienced memory issues when attempting to convert the model to JAX format on the Cloud TPUs as that ran out of memory while converting. We managed to convert the model using a swap memory of 400GB through an attached disk (SSD). We are attaching a disk with the pre-converted model to all the hosts in TPU v3-32 in read-only mode.
When we tried to shard the 70B model across the TPUs, we ran out of TPU HBM. We also noticed that when running smaller models like Llama-13B, redundant responses were generated from all four hosts in the TPU slice (TPUv3-32). We would greatly appreciate any guidance you could provide on generating with Llama-70B on TPU v3-32, or any alternative methods for generation using a single host TPUv3-8.
We would like to express our gratitude for your exceptional repository. It has significantly accelerated our research. The speed of generation on these TPUs using your implementation, compared to GPUs, is remarkable! We plan to acknowledge your valuable contribution in our upcoming paper. Thank you once again for your outstanding work.
The text was updated successfully, but these errors were encountered: