Can't run 3B models #4177

tk-master · 2023-11-22T22:24:10Z

I can't seem to be able to run 3B models (no issue with 7b and 13b). I'm getting the following error:

GGML_ASSERT: I:\llama-cpp\llama.cpp\ggml-cuda.cu:6709: ne00 == n_dims && "ne00 != n_dims is not implemented for CUDA yet"

I tried rocket-3b.Q5_K_M and akins-3b.Q6_K from thebloke

Tested with this on win10:
main.exe -ngl 35 -m rocket-3b.Q5_K_M.gguf -p Hello

The text was updated successfully, but these errors were encountered:

maddes8cht · 2023-11-22T23:22:44Z

That's the reason why i removed my akins model that i already converted.
Try some 3b models from
https://huggingface.co/collections/maddes8cht/stablelm-gguf-65552101dcf410fd072c6c87
or from
https://huggingface.co/collections/maddes8cht/open-llama-gguf-6554be6495885c0796c026a6

These should work with current Llama.cpp versions.

maddes8cht · 2023-11-22T23:25:48Z

Personally, i think https://huggingface.co/maddes8cht/NousResearch-Nous-Capybara-3B-V1.9-gguf seems to be one of the best stableLMs out there, but some others are rather good to in some fields.

slaren · 2023-11-22T23:25:57Z

Offloading up to 34 layers should already work. The linked PR solves the issue for all stablelm models.

maddes8cht · 2023-11-22T23:31:08Z

Actually I think the akins model didn't work for me too, but I'm currently not sure about the error it caused and cannot test it right now.
The other models in my collection should be tested and work up to 34 layers with cuda offloading.

maddes8cht · 2023-11-23T06:48:15Z

I found that the models akins, rocket and hermes could be converzed using the convert-hf-to-gguf.py script, but runnimg them resulted in an error

llm_load_print_meta: BOS token = 0 '<|endoftext|>'
llm_load_print_meta: EOS token = 0 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<|endoftext|>'
llm_load_print_meta: LF token  = 32 '?'
llm_load_tensors: ggml ctx size =    0.00 MiB
llm_load_tensors: using CUDA for GPU acceleration
error loading model: create_tensor: tensor 'token_embd.weight' not found
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'E:\hf\cxllin-StableHermes-3b-gguf\cxllin-StableHermes-3b-Q6_K.gguf'
main: error: unable to load model

This Error does have nothing to do with the 34 Layers offloading limit, it occurs with -ngl 0 as well.

You can find these not working models in this collection:
https://huggingface.co/collections/maddes8cht/not-functional-models-655ee9a96821269b27f528e5

A converted hermes is missing, i converted it but didn't upload.
the original model is
https://huggingface.co/cxllin/StableHermes-3b
while
https://huggingface.co/cxllin/StableMed-3b
available as gguf at
https://huggingface.co/maddes8cht/cxllin-StableMed-3b-gguf
seems to be working fine.

slaren · 2023-11-23T07:56:16Z

It may be better to open a different issue about that. I tested TheBloke's akins and rocket GGUFs and they work.

tk-master · 2023-11-23T08:38:45Z

It does indeed work with 34 layers but why? the cpu is doing some of the work that way, which is not ideal

maddes8cht · 2023-11-23T08:40:59Z

okay
Maybe i just need to re-convert with a recent build. I converted these models quite early when szablelm support was available.

KerfuffleV2 · 2023-11-24T15:39:34Z

I converted the Rocket 3B yesterday and still can't offload the last KV cache layer.

I know there are some models where the necessary support for offloading all layers (especially non-repeating layers) just isn't there. For example, it's the same with Persimmon - you can't offload the last KV non-repeating layer. These models are different architectures - Persimmon is its own (as far as I know), Rocket is Bloom (at least according to the metadata).

slaren · 2023-11-24T16:02:01Z

With #4156 it should be possible to offload everything, but it still needs to be merged.

maddes8cht · 2023-11-24T16:55:52Z

I reconverted the mentioned models with a current release, and now it works - currently uploading.
Now I'm not sure:
May there be a quality loss of the other models i already converted and that worked?
Should i reconvert these models too?

KerfuffleV2 · 2023-11-24T17:52:46Z

Should i reconvert these models too?

I think the only change was to inference, not conversion. So it wasn't an issue with making the models, just running them which seems to be fixed now that #4156 has been merged.

tk-master added the bug-unconfirmed label Nov 22, 2023

slaren linked a pull request Nov 22, 2023 that will close this issue

ggml-cuda : support stablelm rope #4156

Merged

slaren closed this as completed in #4156 Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't run 3B models #4177

Can't run 3B models #4177

tk-master commented Nov 22, 2023

maddes8cht commented Nov 22, 2023 •

edited

Loading

maddes8cht commented Nov 22, 2023

slaren commented Nov 22, 2023 •

edited

Loading

maddes8cht commented Nov 22, 2023

maddes8cht commented Nov 23, 2023

slaren commented Nov 23, 2023

tk-master commented Nov 23, 2023

maddes8cht commented Nov 23, 2023

KerfuffleV2 commented Nov 24, 2023 •

edited

Loading

slaren commented Nov 24, 2023

maddes8cht commented Nov 24, 2023

KerfuffleV2 commented Nov 24, 2023

Can't run 3B models #4177

Can't run 3B models #4177

Comments

tk-master commented Nov 22, 2023

maddes8cht commented Nov 22, 2023 • edited Loading

maddes8cht commented Nov 22, 2023

slaren commented Nov 22, 2023 • edited Loading

maddes8cht commented Nov 22, 2023

maddes8cht commented Nov 23, 2023

slaren commented Nov 23, 2023

tk-master commented Nov 23, 2023

maddes8cht commented Nov 23, 2023

KerfuffleV2 commented Nov 24, 2023 • edited Loading

slaren commented Nov 24, 2023

maddes8cht commented Nov 24, 2023

KerfuffleV2 commented Nov 24, 2023

maddes8cht commented Nov 22, 2023 •

edited

Loading

slaren commented Nov 22, 2023 •

edited

Loading

KerfuffleV2 commented Nov 24, 2023 •

edited

Loading