-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Container instantly crashes when trying to load GGUF #45
Comments
Hi @ErrickVDW - GGUF is for CPU inference, so you probably don't want that if you are planning on using your GPU. There should definitely be logs - unfortunately, without them or a stack trace there's nothing I can offer in the way of help. However, the cause of the crash should definitely be recorded somewhere - try to find it! I suggest troubleshooting it interactively with the help of ChatGPT / Gemini / Claude / etc. to try and track down the issue. Perhaps somebody else who has run into the same problem may be able to offer some insight! |
Thanks so much for the quick reply! Unfortunately it looks as though the container crashes before anything is written to logs. I have also used tools such as netdata to hopefully get some insight but can't seem to find anything. |
Sorry about that - you are indeed correct! I don't do much with hybrid model loading - usually stick to either CPU or GPU. Another user got the container working on Unraid and you can read about it in #27 - might be helpful even though its ROCM. There should be a log from the docker daemon or other service that is managing the container itself, even if the container does not produce any logs. |
Amazing, thanks for the reference! I will dig through the issue you mentioned and see if I can spot anything that may help me. I will also look into the docker logging you mentioned and see if I can find anything from the crashes. Will update as soon as I can! Thank you again |
Hi @Atinoda, apologies for the late reply. I'm not sure how much help it is but I was able to find an additional error in the docker logs that was not visible in the log terminal: {"log":"/scripts/docker-entrypoint.sh: line 69: 90 Illegal instruction "${LAUNCHER[@]}"\n","stream":"stderr","time":"2024-03-13T17:15:10.659958918Z"} Fresh, default container and went straight to loading the model using llama.cpp. Please let me know if this is of any help |
Good job on finding the logs! That does help. Please try removing the quotes around your |
Thanks! I completely redid the config to make sure everything was correct. Here is it with the updated args Unfortunately it is still crashing but I'm am seeing some new errors which feels like a good sign! |
I have exactly the same problem |
I have removed all the folders on the host and let the container recreate them. Freshly downloaded models and not seeing the majority of those errors anymore. Just these: |
Unfortunately your recommendations didn't help :(
|
@wififun - thank you for also reporting your issue. If we can fix it for both of you then hopefully it's a good fix! The root problem is with how the script is (failing) to parse the launch arguments. I have been meaning to revisit that bit of the script because it has caused problems elsewhere... for now, it should be possible to get it up and running. Are you also including the @Steel-skull has posted a working Unraid template in issue #5 - does that help at all? The other thing to try is to leave EDIT: Fixed word salad. |
I have a fully updated version of the template that "Should" work, it works for me but everyones config is diffrent. (Ill post it soon.) I've noticed it doesn't always pull a version of the cuda toolkit that matches the unraid server when loading (causing dependentcy hell and nothing to work) but as long as you keep your unraid server at driver v545.29.06 it looks to works fine. Also, if you have a ggml or gptq that uses more vram than you have, it will crash the docker. Exl2 doesn't seem to have this issue. |
Here is the updated version:
This is the exact version I use on a 2x 3090ti server, so it should work with multiple cards. Use driver v545.29.06 |
Hi there @Steel-skull, I have created a new container using that exact config, only changing the API port to 5005 because mine is currently occupied and have still run into the same issue. Downloaded TheBloke/Llama-2-7B-Chat-GGUF, set llama.cpp to 10 layers and it unfortunately crashed immediately again.
|
That looks like a different problem now - because it got all the way to running the server means that the launcher arguments are being parsed, one way or the other. You should have a python stack trace in the docker logs or webui itself when the model loading causes the crash. Did it crash instantly when you tried to load or after a small delay? These are quite different events! |
Looks to be loading, but if it's crashing immediately, that might be a driver issue. Your output is not indicating an overall issue, tho. Also, what are your settings for the gguf? You "should" be able to fully load it into vram as 3ks only needs 5.45gb. Also, try Exl2, as it's a fully nvidia solution, Turboderp/Llama2-7B-exl2:4.65bpw Finally: Go to console on the docker app:
Then
Lmk the answers. |
I have two suggestions and one request. Try running a CPU-only model to see if that works, and try cranking the GPU layers to whatever the max of the slider is so it's definitely not splitting across CPU and GPU. Can you please post the stack trace from the crash when loading the model to GPU? |
Hi @Atinoda . I was able to get a CPU model up and running pretty easily. {"log":"18:58:58-347156 INFO Loading "le-vh_tinyllama-4bit-cpu" \n","stream":"stdout","time":"2024-03-14T18:58:58.348707113Z"} Although this could be expected behaviour with cpu models and llama.cpp. I also tried loading the gguf model with the maximum gpu layers (256) like you suggested and it crashed with similar logs: {"log":"19:02:01-154413 INFO Loading "llama-2-7b-chat.Q3_K_S.gguf" \n","stream":"stdout","time":"2024-03-14T19:02:01.155943934Z"} I would also like to note that I was able to succesfully load and use a GPTQ model using ExLlamav2_HF. These issues all seem to point to something with the llama.cpp transformer |
Scratch that I see the same error with ctransformers with max gpu layers: {"log":"19:08:34-830622 INFO Loading "llama-2-7b-chat.Q3_K_S.gguf" \n","stream":"stdout","time":"2024-03-14T19:08:34.832135355Z"} |
I'm starting to suspect an issue with either your system, or Unraid. However, given that other users are having success with Unraid it seems that the former is more likely. Another aspect of this is that your GPU is quite old, and perhaps Unfortunately what you shared is not a stack trace - it is just log outputs. You can see an example of a stack trace in the first post of #44 - it contains line numbers, modules, etc. If you can find and post that when it crashes, then I might be able to identify the problem. I may have a fix in mind if I can see what the issue is... |
I checked, and the Tesla P40s I have for inferencing use the same Pascal architecture as your 1070. I haven't used them in a while but I will fire them up and test - if I encounter the same problem as you, then it could be an issue with old hardware. |
Unfortunately I only know how to retrieve the logs from docker which are like the files I've provided in earlier posts. I've looked around and not entirely sure how to get the stack trace. Could someone possibly advise on where I could get this in unraid? |
That amazing! Thank you so much! |
|
Could you maybe tell me what specific setting you used to run it aside for the gpu layers. It's entirely possible that I've just misconfigured something along the way as well. Still digging around for the stack trace |
Hi there @Atinoda, I have now discovered that it is actually my CPU causing this error. My machine has a Xeon E5-1660 v2, which is quite old and therefore only supports the AVX instruction set instead of the newer ones required by GGUF models. If I had properly described my machine specs we would've found this earlier so again I apologize and thank you all for the quick and helpful guidance |
Hi @ErrickVDW - thank you for sharing your results - I appreciate it and it might also help other people! I had originally considered AVX-only CPUs to probably be an edge case... but perhaps that's not how it is. Did you manage to get it working? It is possible to build a version without the need for those instructions - I can help you with that, if you like. Seems like there is not a pre-baked version for nvidia without AVX2, but it might be possible to put one together. Another alternative is to spin up two containers - one CPU only without AVX2, and one normal nvidia container but avoiding models that use the CPU. |
Hey @Atinoda, Sorry for the silence on my end. After trying to figure out how to rebuild ooba for AVX and coming short, I was hoping to ask for you guidance and assitance in creating an AVX-only nvidia container for these GGUF and GGML models. Any help would be greatly appreciated |
Hi @ErrickVDW, no problem - we all get busy! Glad that you're back to the LLMs. Oobabooga has released a set of requirements for no AVX 2 - I have built an image for you to try out. Please try pulling the |
Wow @Atinoda , I've pulled the image and have instantly been able to run the GGUF model that I had the first issues with! I can even split between GPU and CPU! Very eager to get some new models going. I can't thank you enough for breathing new life into my old hardware! |
You're very welcome, and I'm glad that it worked for you! I'm a big supporter of keeping computers going - they've basically been crazy powerful for over a decade now and I've got plenty of mature gear still in operation myself. One of my inferencing rigs is P40-based, and although it struggles with newer quant methods - it's great value. Thank you for testing it, and I'll probably add the variant to the project later - but I'll wait for an upstream release including ####################
### BUILD IMAGES ###
####################
# COMMON
FROM ubuntu:22.04 AS app_base
# Pre-reqs
RUN apt-get update && apt-get install --no-install-recommends -y \
git vim build-essential python3-dev python3-venv python3-pip
# Instantiate venv and pre-activate
RUN pip3 install virtualenv
RUN virtualenv /venv
# Credit, Itamar Turner-Trauring: https://pythonspeed.com/articles/activate-virtualenv-dockerfile/
ENV VIRTUAL_ENV=/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN pip3 install --upgrade pip setuptools
# Copy and enable all scripts
COPY ./scripts /scripts
RUN chmod +x /scripts/*
### DEVELOPERS/ADVANCED USERS ###
# Clone oobabooga/text-generation-webui
RUN git clone https://github.com/oobabooga/text-generation-webui /src
# Use script to check out specific version
ARG VERSION_TAG
ENV VERSION_TAG=${VERSION_TAG}
RUN . /scripts/checkout_src_version.sh
# To use local source: comment out the git clone command then set the build arg `LCL_SRC_DIR`
#ARG LCL_SRC_DIR="text-generation-webui"
#COPY ${LCL_SRC_DIR} /src
#################################
# Copy source to app
RUN cp -ar /src /app
# NVIDIA-CUDA
# Base No AVX2
FROM app_base AS app_nvidia_avx
# Install pytorch for CUDA 12.1
RUN pip3 install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 \
--index-url https://download.pytorch.org/whl/cu121
# Install oobabooga/text-generation-webui
RUN ls /app
RUN pip3 install -r /app/requirements_noavx2.txt
# Extended No AVX2
FROM app_nvidia_avx AS app_nvidia_avx_x
# Install extensions
RUN chmod +x /scripts/build_extensions.sh && \
. /scripts/build_extensions.sh
######################
### RUNTIME IMAGES ###
######################
# COMMON
FROM ubuntu:22.04 AS run_base
# Runtime pre-reqs
RUN apt-get update && apt-get install --no-install-recommends -y \
python3-venv python3-dev git
# Copy app and src
COPY --from=app_base /app /app
COPY --from=app_base /src /src
# Instantiate venv and pre-activate
ENV VIRTUAL_ENV=/venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
# Finalise app setup
WORKDIR /app
EXPOSE 7860
EXPOSE 5000
EXPOSE 5005
# Required for Python print statements to appear in logs
ENV PYTHONUNBUFFERED=1
# Force variant layers to sync cache by setting --build-arg BUILD_DATE
ARG BUILD_DATE
ENV BUILD_DATE=$BUILD_DATE
RUN echo "$BUILD_DATE" > /build_date.txt
ARG VERSION_TAG
ENV VERSION_TAG=$VERSION_TAG
RUN echo "$VERSION_TAG" > /version_tag.txt
# Copy and enable all scripts
COPY ./scripts /scripts
RUN chmod +x /scripts/*
# Run
ENTRYPOINT ["/scripts/docker-entrypoint.sh"]
# Extended without AVX2
FROM run_base AS default-nvidia-avx
# Copy venv
COPY --from=app_nvidia_avx_x $VIRTUAL_ENV $VIRTUAL_ENV
# Variant parameters
RUN echo "Nvidia Extended (No AVX2)" > /variant.txt
ENV EXTRA_LAUNCH_ARGS=""
CMD ["python3", "/app/server.py"] and this is an example command you could use to build it: docker build \
--build-arg BUILD_DATE="Now" \
--build-arg VERSION_TAG="nightly" \
--target default-nvidia-avx -t text-generation-webui:default-nvidia-avx \
--progress=plain . |
I am currently running the container on unraid. I have used the docker compose file as well as maually creating the container and changing storage mounts. I am able to download the models from hf and when I select the GGUF model from the drop down it selects the llama.cpp transformer. I have tried many different variations of settings but no combination works. This is also true of ctransformers. As soon as I click load the container crashes with no logs. I am passing in my gtx 1070 with 8gb of VRAM and it is visible from within the container by running nvidia-smi. I have tried the DEFAULT, NVIDIA and even snapshots from 2023. I am not sure what I am doing wrong
The text was updated successfully, but these errors were encountered: