-
-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to resolve this error? trying to run cublas #218
Comments
same poblem |
@maxjust if u found the solution, pls post here. thx |
Hi! The issue arises in the gpt_params* create_gpt_params(const std::string& fname,const std::string& lora,const std::string& lora_base) {
gpt_params* lparams = new gpt_params;
fprintf(stderr, "%s: loading model %s\n", __func__, fname.c_str());
// Initialize the 'model' member with the 'fname' parameter
lparams->model = fname;
lparams->lora_base = lora_base;
lparams->lora_adapter = lora;
if (lparams->lora_adapter.empty()) {
fprintf(stderr, "no lora, disable mmap"); // <---- ?
lparams->use_mmap = false;
}
return lparams;
} And yes - on the surface, this doesn't make any sense. However, it's likely masking the root cause of the problem, at least on my machine (and as such should not be treated as a solution). It might influence compiler optimizations or the memory state at a given moment. Something is probably happening around the setting of the LORA stuff in gpt_params and its transfer from function to function. I'm still trying to figure it out. |
I have already had such issues in the past - that's the whole point of having the patch (I would have avoided at all, if possible). I've opened up a PR upstream trying to fix this in the correct way, but it was rejected due to code style. ggml-org/llama.cpp#1902. The copy-to-value all over the code seem to trigger misalignment of structures on different combinations of toolchains, triggering this. It looks a combination of nvcc version + gcc + go to trigger this - I used valgrind as well to debug this in the past to carefully trying to see the culprit, but there is nothing actually that seems to indicate what's behind the real issue code-wise, so we are back at hacks all the way long. I'll try to reproduce this on a GPU, however it really needs time and patience to play with valgrind and alikes. |
@gsiehien it's not working. edited common.cpp
even edited the patch. nothing works.
error
|
@mudler ok thx. can i buy u some coffee so you can speed up your cycles? pls help fix this. will buy u coffees for this. been waiting for a long time. other solutions to work with llama.cpp doesnt work e.g. gpt-llama.cpp. pls help. thx! |
See: #218 Signed-off-by: mudler <[email protected]>
@hiqsociety can you try #224 ? Clone again the repo with the specific branch:
|
PR was merged, try |
@mudler - I can confirm that it works. It's nice to have the workaround in the master branch, as it simplifies the builds. Thanks! |
Awesome ! Thanks for confirming 👍 |
@mudler you are a lifesaver! where can i buy u a coffee? really appreciate the work on this. |
@mudler pls check thx.
|
@mudler if i dont use the llama.cpp downloaded automatically but changed my own "working" llama.cpp", will get this instead.
|
either https://github.com/sponsors/mudler or https://www.buymeacoffee.com/mudler works
that should be fixed now in #228 |
can run but i cant seem to generate the same amount of context size tokens as without using golang. why? with 4060 rtx, i can do 1920 max tokens using pure llama.cpp cuda offload 100% on go-llama, i can only do around ctx size of 650 without oom @mudler do u know why? how do i fix this? |
Probably the batch size - have a look at the llama parameters and check out what are you setting when using go-llama. |
@mudler u are right! it is. do u have any ideas / clue which setting i have to set it to be "exact" same with this used in llama.cpp?
|
@MathiasGS, can u help with this pls?
The text was updated successfully, but these errors were encountered: