Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot convert llama3 8b model to gguf #7021

Closed
Bedoshady opened this issue May 1, 2024 · 19 comments
Closed

Cannot convert llama3 8b model to gguf #7021

Bedoshady opened this issue May 1, 2024 · 19 comments

Comments

@Bedoshady
Copy link

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

I downloaded model from llama using steps provided and I have 14 gb .pth file I try to convert the model using convert.py but it fails giving RuntimeError: Internal: could not parse ModelProto from H:\Downloads\llama3-main\Meta-Llama-3-8B\tokenizer.model but when I added --vocab-type bpe it gives FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

If the bug concerns the server, please try to reproduce it first using the server test scenario framework.

@Galunid
Copy link
Collaborator

Galunid commented May 1, 2024

convert.py doesn't support llama 3 yet. You can use convert-hf-to-gguf.py with llama 3 downloaded from huggingface

@Galunid Galunid closed this as completed May 1, 2024
@Bedoshady
Copy link
Author

is there any thing that support llama 3, I just want to run the model

@oldgithubman
Copy link

convert.py doesn't support llama 3 yet. You can use convert-hf-to-gguf.py with llama 3 downloaded from huggingface

Thanks for telling us. I gotta say, it's getting real annoying wasting endless hours chasing these things down because the devs can't be bothered to update the relevant info in the main readme (which, BTW, makes no mention of "convert-hf-to-gguf.py" that I'm aware of). Seriously, I can't be the only one who is infuriated by this pattern of behavior in this community.

Devs: documentation matters. What would take you, what, five minutes to update, would save the community probably hundreds if not thousands of cumulative hours. We appreciate what you do (well, I do, anyway), but this is just dumb and lazy. How many botched ggufs are being proliferated because of this?

@teleprint-me
Copy link
Contributor

@oldmanjk You're welcome to contribute.

@oldgithubman
Copy link

@oldmanjk You're welcome to contribute.

And how am I supposed to do that if I don't know what's going on? Way to miss the point

@teleprint-me
Copy link
Contributor

@oldmanjk I understand the point perfectly fine. You can figure it out and then add it to the docs. If there aren't any docs, then create them. It's a fairly simple thought process. Complaining about it to people who are literally donating their time isn't productive or helpful. I have nothing else to say on the matter. Best of luck.

@oldgithubman
Copy link

@oldmanjk I understand the point perfectly fine. You can figure it out and then add it to the docs. If there aren't any docs, then create them. It's a fairly simple thought process. Complaining about it to people who are literally donating their time isn't productive or helpful. I have nothing else to say on the matter. Best of luck.

Clearly you don't understand. Development is a continuous process and things change quickly here. If you want users to keep up with development and keep the documentation updated, you've skipped CSci 101, where you would have been taught documentation is one of the most important things for a developer to do well. Since when do users write manuals? You'd basically have to become a dev to be able to do that. I don't understand how this is so hard to comprehend. You've also mistaken constructive criticism for complaining. I'm trying to help you devs understand the user perspective. My tone is intentional to convey the frustration many users feel but are too afraid to voice. If you don't see how this is immensely helpful, well, I guess I should have expected that. I don't really care what you think about me. If you want this project to thrive, you need better documentation. Telling the users to create it "isn't productive or helpful."

@ProjectAtlantis-dev
Copy link

ProjectAtlantis-dev commented May 2, 2024

convert.py doesn't support llama 3 yet. You can use convert-hf-to-gguf.py with llama 3 downloaded from huggingface

there's also recently created convert-hf-to-gguf-update.py but I think you must include your HF access token on the command line or else it will report a bunch of fails presumably when trying to pull from hf. To get an HF access token, you have to log into HF and go to your profile and then Settings ... Access Tokens

To recap, if you are reading this, you probably ended up here seeking the llama-bpe stuff in an effort to get rid of the strange error:

llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             

That means your gguf files will kinda work but the quality is crap compared to the bpe version. Hence, folks are trying to go back to orig safetensors and re-convert because most of the stuff uploaded to hf is sub-par. Note that you must run this whole convert process in a python 3.11 venv because attempting convert in 3.12 just throws errors about distutils. etc. You also need a ton of memory unless you also add the temp file stuff, which appears to be in the current convert hf script but not this update.py thing

There are a bunch of convert scripts and it would be nice if there was an easy way to sort them by last updated in github so it was obvious which are most relevant for llama3.

Most of the llama3 convert saga discussion can be found here
#6745

@oldgithubman
Copy link

oldgithubman commented May 2, 2024

convert.py doesn't support llama 3 yet. You can use convert-hf-to-gguf.py with llama 3 downloaded from huggingface

there's also recently created convert-hf-to-gguf-update.py but I think you must include your HF access token on the command line or else it will report a bunch of fails presumably when trying to pull from hf. To get an HF access token, you have to log into HF and go to your profile and then Settings ... Access Tokens

To recap, if you are reading this, you probably ended up here seeking the llama-bpe stuff in an effort to get rid of the strange error:

llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             

Note that you must run this whole convert process in a python 3.11 venv because 3.12 just throws errors about distutils. etc. You also need a ton of memory unless you also add the temp file stuff, which appears to be in the current convert hf script but not this update

There are a bunch of convert scripts and it would be nice if there was an easy way to sort them by last updated in github so it was obvious which are most relevant for llama3.

Most of the llama3 convert saga discussion can be found here #6745

Maybe you can't sort them by last updated, but it does at least say when they were last updated. Unfortunately, "convert.py - llama : support Llama 3 HF conversion" (which is what it currently says) is apparently false. Even worse, using convert.py on llama 3 does work (for me, at least), so people like me assume it was a good conversion and build things off of the conversion not knowing something might be broken. Then we find out maybe it was a bad conversion and we have to unravel days (if not worse) of work. It's a cancerous mess and hopefully the devs will do better going forward. Let me reinterate, I appreciate what the devs are doing. This isn't meant to be a "complaint," but constructive criticism. And it's not directed at any one dev in particular, or even just the devs of this project. Bad documentation practices appear to be largely endemic in the open-source community, it seems, and it needs to change (good documentation is even more important for open-source). My CSci profs beat it into our heads that good documentation was practically rule number 1. That was 20 years ago, though. Maybe things have changed (for the worse). Driving fast can get you to your destination quicker, sure, but what good is it if all your passengers fall out on the way? (Most of this comment wasn't really a reply to you, BTW - sorry to piggyback)

@Bedoshady
Copy link
Author

I have the original .pth file from meta not the safetensor files from huggingface

@oldgithubman
Copy link

I have the original .pth file from meta not the safetensor files from huggingface

At the moment, as far as I know, you need the safetensor files. I gave up on getting the .pth file converted and just deleted it and got the safetensor files instead

@ProjectAtlantis-dev
Copy link

I don't think anyone uses the pth files anymore due to security risks

@Bedoshady
Copy link
Author

Ok I converted the .pth file using huggingface transformers, but I dont know how to run it or if the conversion was correct

@oldgithubman
Copy link

I don't think anyone uses the pth files anymore due to security risks

You mean except for meta, who just released llama 3 in pth?

@ProjectAtlantis-dev
Copy link

Yeah I just saw that lol

@ProjectAtlantis-dev
Copy link

ProjectAtlantis-dev commented May 5, 2024

If you converted successfully, you should at least have a ...f16 gguf file that you may need to then run quantize to get it down further (8 bit at a min will cut it in half again). But every time you do that, quality suffers. Anything below 4 bit is pretty busted but I suspect 3 bit 70b is still better than 8 bit 8b. Quant sizes that are multiples of 2 tend to be faster inference, any odd number will be slower

If you have multiple filenames because you downloaded split files straight from HF, you can just supply the first file in the "1 of N" and it will load the rest of the series

@oldgithubman
Copy link

If you converted successfully, you should at least have a ...f16 gguf file that you may need to then run quantize to get it down further (8 bit at a min will cut it in half again). But every time you do that, quality suffers. Anything below 4 bit is pretty busted but I suspect 3 bit 70b is still better than 8 bit 8b. Quant sizes that are multiples of 2 tend to be faster inference, any odd number will be slower

If you have multiple filenames because you downloaded split files straight from HF, you can just supply the first file in the "1 of N" and it will load the rest of the series

I think you want to convert b16 to f32. It sounds like going from b16 to f16 might create significant losses. Then quant straight from f32 to keep things as lossless as possible (minus the quant, of course). Otherwise, you're going to get generational losses. For llama-3-70b-instruct, I went b16 -> f32 -> imatrix -> IQ2_XXS (which fits on a 4090 with full context) and the results seem subjectively decent. I haven't made any objective comparison to native llama-3-8b-instruct yet, which I really should do

@ProjectAtlantis-dev
Copy link

ProjectAtlantis-dev commented May 6, 2024

you mean there's more than bfloat16 on hf?

@oldgithubman
Copy link

oldgithubman commented May 6, 2024

you mean there's more than bfloat16 on hf?

Yes. As an example:
FP16 - https://huggingface.co/openai/whisper-large-v3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants