You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for publishing model on the physionet.org. I have downloaded the Me-LLaMA-13b-chat model on my device having 24GB GPU and tried to make inference on it. But I couldn't load even the model. In documentation I have found that it can be fine-tuned on at least 24 GB GPU. But I am little bit confused here. Model is not loaded in the 24GB GPU even, how fine-tuning is possible. I have tried by reducing the model to 16 bit precision as well but still couldn't load on 24GB. Is there are any other way to load the model on the 24GB GPU without performing the quantization operation?
I have tried to make the inference on the model using following source code:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
torch.cuda.empty_cache()
model_file = "./physionet.org/files/me-llama/1.0.0/MeLLaMA-13B-chat"
prompt = "I am suffering from flu, give me home remedies?"
# Check if GPU is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load the tokenizer and model from your local model directory.
tokenizer = AutoTokenizer.from_pretrained(model_file)
model = AutoModelForCausalLM.from_pretrained(model_file).to(device)
It trows out of memory error while loading the model. And my GPU memory have been fully utilized.
OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 81.69 MiB is free. Including non-PyTorch memory, this process has 23.61 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Following image shows the GPU image usage statistics:
Note: In CPU. it tooks around 49GB RAM on my PC.
Please help me if you have any idea about how to run the model on the 24 GB GPU efficiently. Thank you.
The text was updated successfully, but these errors were encountered:
Hello there,
Thank you for publishing model on the physionet.org. I have downloaded the
Me-LLaMA-13b-chat
model on my device having 24GB GPU and tried to make inference on it. But I couldn't load even the model. In documentation I have found that it can be fine-tuned on at least 24 GB GPU. But I am little bit confused here. Model is not loaded in the 24GB GPU even, how fine-tuning is possible. I have tried by reducing the model to 16 bit precision as well but still couldn't load on 24GB. Is there are any other way to load the model on the 24GB GPU without performing the quantization operation?I have tried to make the inference on the model using following source code:
It trows
out of memory error
while loading the model. And my GPU memory have been fully utilized.Following image shows the GPU image usage statistics:
Note: In CPU. it tooks around 49GB RAM on my PC.
Please help me if you have any idea about how to run the model on the 24 GB GPU efficiently. Thank you.
The text was updated successfully, but these errors were encountered: