Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run 3B models #4177

Closed
tk-master opened this issue Nov 22, 2023 · 12 comments · Fixed by #4156
Closed

Can't run 3B models #4177

tk-master opened this issue Nov 22, 2023 · 12 comments · Fixed by #4156

Comments

@tk-master
Copy link

I can't seem to be able to run 3B models (no issue with 7b and 13b). I'm getting the following error:

GGML_ASSERT: I:\llama-cpp\llama.cpp\ggml-cuda.cu:6709: ne00 == n_dims && "ne00 != n_dims is not implemented for CUDA yet"

I tried rocket-3b.Q5_K_M and akins-3b.Q6_K from thebloke

Tested with this on win10:
main.exe -ngl 35 -m rocket-3b.Q5_K_M.gguf -p Hello

@slaren slaren linked a pull request Nov 22, 2023 that will close this issue
@maddes8cht
Copy link
Contributor

maddes8cht commented Nov 22, 2023

That's the reason why i removed my akins model that i already converted.
Try some 3b models from
https://huggingface.co/collections/maddes8cht/stablelm-gguf-65552101dcf410fd072c6c87
or from
https://huggingface.co/collections/maddes8cht/open-llama-gguf-6554be6495885c0796c026a6

These should work with current Llama.cpp versions.

@maddes8cht
Copy link
Contributor

Personally, i think https://huggingface.co/maddes8cht/NousResearch-Nous-Capybara-3B-V1.9-gguf seems to be one of the best stableLMs out there, but some others are rather good to in some fields.

@slaren
Copy link
Collaborator

slaren commented Nov 22, 2023

Offloading up to 34 layers should already work. The linked PR solves the issue for all stablelm models.

@maddes8cht
Copy link
Contributor

Actually I think the akins model didn't work for me too, but I'm currently not sure about the error it caused and cannot test it right now.
The other models in my collection should be tested and work up to 34 layers with cuda offloading.

@maddes8cht
Copy link
Contributor

I found that the models akins, rocket and hermes could be converzed using the convert-hf-to-gguf.py script, but runnimg them resulted in an error

llm_load_print_meta: BOS token = 0 '<|endoftext|>'
llm_load_print_meta: EOS token = 0 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<|endoftext|>'
llm_load_print_meta: LF token  = 32 '?'
llm_load_tensors: ggml ctx size =    0.00 MiB
llm_load_tensors: using CUDA for GPU acceleration
error loading model: create_tensor: tensor 'token_embd.weight' not found
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'E:\hf\cxllin-StableHermes-3b-gguf\cxllin-StableHermes-3b-Q6_K.gguf'
main: error: unable to load model

This Error does have nothing to do with the 34 Layers offloading limit, it occurs with -ngl 0 as well.

You can find these not working models in this collection:
https://huggingface.co/collections/maddes8cht/not-functional-models-655ee9a96821269b27f528e5

A converted hermes is missing, i converted it but didn't upload.
the original model is
https://huggingface.co/cxllin/StableHermes-3b
while
https://huggingface.co/cxllin/StableMed-3b
available as gguf at
https://huggingface.co/maddes8cht/cxllin-StableMed-3b-gguf
seems to be working fine.

@slaren
Copy link
Collaborator

slaren commented Nov 23, 2023

It may be better to open a different issue about that. I tested TheBloke's akins and rocket GGUFs and they work.

@tk-master
Copy link
Author

It does indeed work with 34 layers but why? the cpu is doing some of the work that way, which is not ideal

@maddes8cht
Copy link
Contributor

okay
Maybe i just need to re-convert with a recent build. I converted these models quite early when szablelm support was available.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Nov 24, 2023

I converted the Rocket 3B yesterday and still can't offload the last KV cache layer.

I know there are some models where the necessary support for offloading all layers (especially non-repeating layers) just isn't there. For example, it's the same with Persimmon - you can't offload the last KV non-repeating layer. These models are different architectures - Persimmon is its own (as far as I know), Rocket is Bloom (at least according to the metadata).

@slaren
Copy link
Collaborator

slaren commented Nov 24, 2023

With #4156 it should be possible to offload everything, but it still needs to be merged.

@maddes8cht
Copy link
Contributor

I reconverted the mentioned models with a current release, and now it works - currently uploading.
Now I'm not sure:
May there be a quality loss of the other models i already converted and that worked?
Should i reconvert these models too?

@KerfuffleV2
Copy link
Collaborator

Should i reconvert these models too?

I think the only change was to inference, not conversion. So it wasn't an issue with making the models, just running them which seems to be fixed now that #4156 has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants