-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot convert llama3 8b model to gguf #7021
Comments
|
is there any thing that support llama 3, I just want to run the model |
Thanks for telling us. I gotta say, it's getting real annoying wasting endless hours chasing these things down because the devs can't be bothered to update the relevant info in the main readme (which, BTW, makes no mention of "convert-hf-to-gguf.py" that I'm aware of). Seriously, I can't be the only one who is infuriated by this pattern of behavior in this community. Devs: documentation matters. What would take you, what, five minutes to update, would save the community probably hundreds if not thousands of cumulative hours. We appreciate what you do (well, I do, anyway), but this is just dumb and lazy. How many botched ggufs are being proliferated because of this? |
@oldmanjk You're welcome to contribute. |
And how am I supposed to do that if I don't know what's going on? Way to miss the point |
@oldmanjk I understand the point perfectly fine. You can figure it out and then add it to the docs. If there aren't any docs, then create them. It's a fairly simple thought process. Complaining about it to people who are literally donating their time isn't productive or helpful. I have nothing else to say on the matter. Best of luck. |
Clearly you don't understand. Development is a continuous process and things change quickly here. If you want users to keep up with development and keep the documentation updated, you've skipped CSci 101, where you would have been taught documentation is one of the most important things for a developer to do well. Since when do users write manuals? You'd basically have to become a dev to be able to do that. I don't understand how this is so hard to comprehend. You've also mistaken constructive criticism for complaining. I'm trying to help you devs understand the user perspective. My tone is intentional to convey the frustration many users feel but are too afraid to voice. If you don't see how this is immensely helpful, well, I guess I should have expected that. I don't really care what you think about me. If you want this project to thrive, you need better documentation. Telling the users to create it "isn't productive or helpful." |
there's also recently created convert-hf-to-gguf-update.py but I think you must include your HF access token on the command line or else it will report a bunch of fails presumably when trying to pull from hf. To get an HF access token, you have to log into HF and go to your profile and then Settings ... Access Tokens To recap, if you are reading this, you probably ended up here seeking the llama-bpe stuff in an effort to get rid of the strange error:
That means your gguf files will kinda work but the quality is crap compared to the bpe version. Hence, folks are trying to go back to orig safetensors and re-convert because most of the stuff uploaded to hf is sub-par. Note that you must run this whole convert process in a python 3.11 venv because attempting convert in 3.12 just throws errors about distutils. etc. You also need a ton of memory unless you also add the temp file stuff, which appears to be in the current convert hf script but not this update.py thing There are a bunch of convert scripts and it would be nice if there was an easy way to sort them by last updated in github so it was obvious which are most relevant for llama3. Most of the llama3 convert saga discussion can be found here |
Maybe you can't sort them by last updated, but it does at least say when they were last updated. Unfortunately, "convert.py - llama : support Llama 3 HF conversion" (which is what it currently says) is apparently false. Even worse, using convert.py on llama 3 does work (for me, at least), so people like me assume it was a good conversion and build things off of the conversion not knowing something might be broken. Then we find out maybe it was a bad conversion and we have to unravel days (if not worse) of work. It's a cancerous mess and hopefully the devs will do better going forward. Let me reinterate, I appreciate what the devs are doing. This isn't meant to be a "complaint," but constructive criticism. And it's not directed at any one dev in particular, or even just the devs of this project. Bad documentation practices appear to be largely endemic in the open-source community, it seems, and it needs to change (good documentation is even more important for open-source). My CSci profs beat it into our heads that good documentation was practically rule number 1. That was 20 years ago, though. Maybe things have changed (for the worse). Driving fast can get you to your destination quicker, sure, but what good is it if all your passengers fall out on the way? (Most of this comment wasn't really a reply to you, BTW - sorry to piggyback) |
I have the original .pth file from meta not the safetensor files from huggingface |
At the moment, as far as I know, you need the safetensor files. I gave up on getting the .pth file converted and just deleted it and got the safetensor files instead |
I don't think anyone uses the pth files anymore due to security risks |
Ok I converted the .pth file using huggingface transformers, but I dont know how to run it or if the conversion was correct |
You mean except for meta, who just released llama 3 in pth? |
Yeah I just saw that lol |
If you converted successfully, you should at least have a ...f16 gguf file that you may need to then run quantize to get it down further (8 bit at a min will cut it in half again). But every time you do that, quality suffers. Anything below 4 bit is pretty busted but I suspect 3 bit 70b is still better than 8 bit 8b. Quant sizes that are multiples of 2 tend to be faster inference, any odd number will be slower If you have multiple filenames because you downloaded split files straight from HF, you can just supply the first file in the "1 of N" and it will load the rest of the series |
I think you want to convert b16 to f32. It sounds like going from b16 to f16 might create significant losses. Then quant straight from f32 to keep things as lossless as possible (minus the quant, of course). Otherwise, you're going to get generational losses. For llama-3-70b-instruct, I went b16 -> f32 -> imatrix -> IQ2_XXS (which fits on a 4090 with full context) and the results seem subjectively decent. I haven't made any objective comparison to native llama-3-8b-instruct yet, which I really should do |
you mean there's more than bfloat16 on hf? |
Yes. As an example: |
Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.
I downloaded model from llama using steps provided and I have 14 gb .pth file I try to convert the model using convert.py but it fails giving
RuntimeError: Internal: could not parse ModelProto from H:\Downloads\llama3-main\Meta-Llama-3-8B\tokenizer.model
but when I added --vocab-type bpe it givesFileNotFoundError: Could not find a tokenizer matching any of ['bpe']
If the bug concerns the server, please try to reproduce it first using the server test scenario framework.
The text was updated successfully, but these errors were encountered: