Cannot convert llama3 8b model to gguf #7021

Bedoshady · 2024-05-01T10:45:52Z

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

I downloaded model from llama using steps provided and I have 14 gb .pth file I try to convert the model using convert.py but it fails giving RuntimeError: Internal: could not parse ModelProto from H:\Downloads\llama3-main\Meta-Llama-3-8B\tokenizer.model but when I added --vocab-type bpe it gives FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

If the bug concerns the server, please try to reproduce it first using the server test scenario framework.

The text was updated successfully, but these errors were encountered:

Galunid · 2024-05-01T10:55:27Z

convert.py doesn't support llama 3 yet. You can use convert-hf-to-gguf.py with llama 3 downloaded from huggingface

Bedoshady · 2024-05-01T11:07:30Z

is there any thing that support llama 3, I just want to run the model

oldgithubman · 2024-05-02T05:16:36Z

convert.py doesn't support llama 3 yet. You can use convert-hf-to-gguf.py with llama 3 downloaded from huggingface

Thanks for telling us. I gotta say, it's getting real annoying wasting endless hours chasing these things down because the devs can't be bothered to update the relevant info in the main readme (which, BTW, makes no mention of "convert-hf-to-gguf.py" that I'm aware of). Seriously, I can't be the only one who is infuriated by this pattern of behavior in this community.

Devs: documentation matters. What would take you, what, five minutes to update, would save the community probably hundreds if not thousands of cumulative hours. We appreciate what you do (well, I do, anyway), but this is just dumb and lazy. How many botched ggufs are being proliferated because of this?

teleprint-me · 2024-05-02T15:22:28Z

@oldmanjk You're welcome to contribute.

oldgithubman · 2024-05-02T18:15:35Z

@oldmanjk You're welcome to contribute.

And how am I supposed to do that if I don't know what's going on? Way to miss the point

teleprint-me · 2024-05-02T18:40:16Z

@oldmanjk I understand the point perfectly fine. You can figure it out and then add it to the docs. If there aren't any docs, then create them. It's a fairly simple thought process. Complaining about it to people who are literally donating their time isn't productive or helpful. I have nothing else to say on the matter. Best of luck.

oldgithubman · 2024-05-02T19:01:31Z

@oldmanjk I understand the point perfectly fine. You can figure it out and then add it to the docs. If there aren't any docs, then create them. It's a fairly simple thought process. Complaining about it to people who are literally donating their time isn't productive or helpful. I have nothing else to say on the matter. Best of luck.

Clearly you don't understand. Development is a continuous process and things change quickly here. If you want users to keep up with development and keep the documentation updated, you've skipped CSci 101, where you would have been taught documentation is one of the most important things for a developer to do well. Since when do users write manuals? You'd basically have to become a dev to be able to do that. I don't understand how this is so hard to comprehend. You've also mistaken constructive criticism for complaining. I'm trying to help you devs understand the user perspective. My tone is intentional to convey the frustration many users feel but are too afraid to voice. If you don't see how this is immensely helpful, well, I guess I should have expected that. I don't really care what you think about me. If you want this project to thrive, you need better documentation. Telling the users to create it "isn't productive or helpful."

ProjectAtlantis-dev · 2024-05-02T20:49:47Z

convert.py doesn't support llama 3 yet. You can use convert-hf-to-gguf.py with llama 3 downloaded from huggingface

there's also recently created convert-hf-to-gguf-update.py but I think you must include your HF access token on the command line or else it will report a bunch of fails presumably when trying to pull from hf. To get an HF access token, you have to log into HF and go to your profile and then Settings ... Access Tokens

To recap, if you are reading this, you probably ended up here seeking the llama-bpe stuff in an effort to get rid of the strange error:

llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:

That means your gguf files will kinda work but the quality is crap compared to the bpe version. Hence, folks are trying to go back to orig safetensors and re-convert because most of the stuff uploaded to hf is sub-par. Note that you must run this whole convert process in a python 3.11 venv because attempting convert in 3.12 just throws errors about distutils. etc. You also need a ton of memory unless you also add the temp file stuff, which appears to be in the current convert hf script but not this update.py thing

There are a bunch of convert scripts and it would be nice if there was an easy way to sort them by last updated in github so it was obvious which are most relevant for llama3.

Most of the llama3 convert saga discussion can be found here
#6745

oldgithubman · 2024-05-02T22:39:33Z

convert.py doesn't support llama 3 yet. You can use convert-hf-to-gguf.py with llama 3 downloaded from huggingface

there's also recently created convert-hf-to-gguf-update.py but I think you must include your HF access token on the command line or else it will report a bunch of fails presumably when trying to pull from hf. To get an HF access token, you have to log into HF and go to your profile and then Settings ... Access Tokens

To recap, if you are reading this, you probably ended up here seeking the llama-bpe stuff in an effort to get rid of the strange error:
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
Note that you must run this whole convert process in a python 3.11 venv because 3.12 just throws errors about distutils. etc. You also need a ton of memory unless you also add the temp file stuff, which appears to be in the current convert hf script but not this update

There are a bunch of convert scripts and it would be nice if there was an easy way to sort them by last updated in github so it was obvious which are most relevant for llama3.

Most of the llama3 convert saga discussion can be found here #6745

Maybe you can't sort them by last updated, but it does at least say when they were last updated. Unfortunately, "convert.py - llama : support Llama 3 HF conversion" (which is what it currently says) is apparently false. Even worse, using convert.py on llama 3 does work (for me, at least), so people like me assume it was a good conversion and build things off of the conversion not knowing something might be broken. Then we find out maybe it was a bad conversion and we have to unravel days (if not worse) of work. It's a cancerous mess and hopefully the devs will do better going forward. Let me reinterate, I appreciate what the devs are doing. This isn't meant to be a "complaint," but constructive criticism. And it's not directed at any one dev in particular, or even just the devs of this project. Bad documentation practices appear to be largely endemic in the open-source community, it seems, and it needs to change (good documentation is even more important for open-source). My CSci profs beat it into our heads that good documentation was practically rule number 1. That was 20 years ago, though. Maybe things have changed (for the worse). Driving fast can get you to your destination quicker, sure, but what good is it if all your passengers fall out on the way? (Most of this comment wasn't really a reply to you, BTW - sorry to piggyback)

Bedoshady · 2024-05-03T02:09:44Z

I have the original .pth file from meta not the safetensor files from huggingface

oldgithubman · 2024-05-03T03:32:13Z

I have the original .pth file from meta not the safetensor files from huggingface

At the moment, as far as I know, you need the safetensor files. I gave up on getting the .pth file converted and just deleted it and got the safetensor files instead

ProjectAtlantis-dev · 2024-05-04T14:50:40Z

I don't think anyone uses the pth files anymore due to security risks

Bedoshady · 2024-05-04T15:59:25Z

Ok I converted the .pth file using huggingface transformers, but I dont know how to run it or if the conversion was correct

oldgithubman · 2024-05-04T20:39:47Z

I don't think anyone uses the pth files anymore due to security risks

You mean except for meta, who just released llama 3 in pth?

ProjectAtlantis-dev · 2024-05-05T09:18:02Z

Yeah I just saw that lol

ProjectAtlantis-dev · 2024-05-05T09:21:58Z

If you converted successfully, you should at least have a ...f16 gguf file that you may need to then run quantize to get it down further (8 bit at a min will cut it in half again). But every time you do that, quality suffers. Anything below 4 bit is pretty busted but I suspect 3 bit 70b is still better than 8 bit 8b. Quant sizes that are multiples of 2 tend to be faster inference, any odd number will be slower

If you have multiple filenames because you downloaded split files straight from HF, you can just supply the first file in the "1 of N" and it will load the rest of the series

oldgithubman · 2024-05-05T23:50:26Z

If you converted successfully, you should at least have a ...f16 gguf file that you may need to then run quantize to get it down further (8 bit at a min will cut it in half again). But every time you do that, quality suffers. Anything below 4 bit is pretty busted but I suspect 3 bit 70b is still better than 8 bit 8b. Quant sizes that are multiples of 2 tend to be faster inference, any odd number will be slower

If you have multiple filenames because you downloaded split files straight from HF, you can just supply the first file in the "1 of N" and it will load the rest of the series

I think you want to convert b16 to f32. It sounds like going from b16 to f16 might create significant losses. Then quant straight from f32 to keep things as lossless as possible (minus the quant, of course). Otherwise, you're going to get generational losses. For llama-3-70b-instruct, I went b16 -> f32 -> imatrix -> IQ2_XXS (which fits on a 4090 with full context) and the results seem subjectively decent. I haven't made any objective comparison to native llama-3-8b-instruct yet, which I really should do

ProjectAtlantis-dev · 2024-05-06T08:37:42Z

you mean there's more than bfloat16 on hf?

oldgithubman · 2024-05-06T19:39:29Z

you mean there's more than bfloat16 on hf?

Yes. As an example:
FP16 - https://huggingface.co/openai/whisper-large-v3

Bedoshady added the bug-unconfirmed label May 1, 2024

Galunid closed this as completed May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot convert llama3 8b model to gguf #7021

Cannot convert llama3 8b model to gguf #7021

Bedoshady commented May 1, 2024

Galunid commented May 1, 2024

Bedoshady commented May 1, 2024

oldgithubman commented May 2, 2024

teleprint-me commented May 2, 2024

oldgithubman commented May 2, 2024

teleprint-me commented May 2, 2024

oldgithubman commented May 2, 2024

ProjectAtlantis-dev commented May 2, 2024 •

edited

Loading

oldgithubman commented May 2, 2024 •

edited

Loading

Bedoshady commented May 3, 2024

oldgithubman commented May 3, 2024

ProjectAtlantis-dev commented May 4, 2024

Bedoshady commented May 4, 2024

oldgithubman commented May 4, 2024

ProjectAtlantis-dev commented May 5, 2024

ProjectAtlantis-dev commented May 5, 2024 •

edited

Loading

oldgithubman commented May 5, 2024

ProjectAtlantis-dev commented May 6, 2024 •

edited

Loading

oldgithubman commented May 6, 2024 •

edited

Loading

Cannot convert llama3 8b model to gguf #7021

Cannot convert llama3 8b model to gguf #7021

Comments

Bedoshady commented May 1, 2024

Galunid commented May 1, 2024

Bedoshady commented May 1, 2024

oldgithubman commented May 2, 2024

teleprint-me commented May 2, 2024

oldgithubman commented May 2, 2024

teleprint-me commented May 2, 2024

oldgithubman commented May 2, 2024

ProjectAtlantis-dev commented May 2, 2024 • edited Loading

oldgithubman commented May 2, 2024 • edited Loading

Bedoshady commented May 3, 2024

oldgithubman commented May 3, 2024

ProjectAtlantis-dev commented May 4, 2024

Bedoshady commented May 4, 2024

oldgithubman commented May 4, 2024

ProjectAtlantis-dev commented May 5, 2024

ProjectAtlantis-dev commented May 5, 2024 • edited Loading

oldgithubman commented May 5, 2024

ProjectAtlantis-dev commented May 6, 2024 • edited Loading

oldgithubman commented May 6, 2024 • edited Loading

ProjectAtlantis-dev commented May 2, 2024 •

edited

Loading

oldgithubman commented May 2, 2024 •

edited

Loading

ProjectAtlantis-dev commented May 5, 2024 •

edited

Loading

ProjectAtlantis-dev commented May 6, 2024 •

edited

Loading

oldgithubman commented May 6, 2024 •

edited

Loading