-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brave Leo AI using Ollama and Intel GPU #12248
Comments
@NikosDi you should do one at a time , make sure when you do, ollama run #model , runs succesfully on your gpu , if it works then its easy to use it with brave , you just need to , ollama serve in a terminal and go to brave , & add local model ، for using ollama just copie the a description link they give you in the bracket , then use leo & select your model ، &thats it , to automate this i.e to have ollama serve automatically on boot you need a windows service i dont know if its possible to write one , in linux i did it , & it works fantastically. |
@user7z thanks for your reply. i could be doing something wrong, but before I run "ollama serve" I use this script from cmd every time: python -m venv llm_env For model name I use "llama3.1" and the path of server endpoint is "http://localhost:11434/v1/chat/completions" Everything is offloaded to GPU and I get this error: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) TIA |
@NikosDi , youve miss guided or there is a windows bug ,read here carefully step by step , & see if you did something wrong , ollama should use llm-cpp as a backend wich it uses ipex-llm , plz forget about brave right know , & dont run the script , open the terminal yourself ، & serve , & try to chat on another terminal , this way you could debug the probleme more preciselly |
hi @NikosDi , could you please provide full logs that returned on ollama server side? And you may follow this install windows gpu document and install ollama document to prepare your environment. |
As I wrote above I have followed a different guide based on the PDF which doesn't install/ use conda environment. It's a PDF from Intel. My exact system specifications are: Windows 11 24H2 (26100.2033) - Intel ARC A380 - Drivers v6079. The full log file of Ollama server is this: Running the command line prompt as Nikos (administrator) Running the command line prompt as Administrator TIA |
@NikosDi you should install visual studio 2022 & select desktop C++ , as the guide suggest , then do ollama serve , then open another terminal and execute : ollama run llama3.2:1b & see if it runs |
What is your GPU running Leo AI ? All the guides are referring to A770, mine is A380. |
@NikosDi I think ipex-llm ollama supports A380, you may follow our github guides to try it. |
@sgwhat Number 14 says: Where should I add this -c parameter and what is -c xx ? |
Hi @NikosDi , Could you please provide the detailed log returned by ollama serve and the script you used to run ollama? |
Hello @sgwhat If you have already checked the text files above and didn't cover you, please give me some instructions how to provide the detailed log returned by ollama serve. In those texts above I have included all the commands I use to run ollama and the response of my Windows environment. But I can provide them again as a script, along with the detailed log if you assist with instructions. Thank you. |
Hi @NikosDi , I have checked your runtime log, please follow our github guide only to run ipex-llm ollama, and you may run it again without executing |
Hello @sgwhat. I have already installed Ollama from the official page https://ollama.com/download on two of my PCs (1 Windows 11, 1 Linux Ubuntu 24.04.1) and Brave Leo AI works like a charm on both of them using the CPU. Ollama installer has built-in support for NVIDIA CUDA and AMD ROCm but no Intel GPU support. For Intel GPU I followed the guide from June 2024, I don't know if it's already obsolete. I have already installed Intel's AI Playground on my Win 11 system, Ollama original CPU setup and the above Python environment for Intel GPU (Intel oneAPI, Python) The truth is that it would be a lot more convenient if there was a single setup like AI Playground or similar in order to avoid manual installations for Ollama Intel GPU setup, since it's not built-in. I'm not in a mood to follow another guide and install more environments (miniforge, conda etc) If troubleshooting is impossible following Intel's guide from June 2024, then I have to stop here because we are doing circles. Maybe it would be useful if you could tell me what I asked before, regarding the Native API failed error. I'm using LLama3.1 as model and I would like to try adding -c xx and see the results, if you could assist where to add this parameter. Thank you. |
Hi @NikosDi , I have tested running ipex-llm ollama on windows11 Arc380 laptop and it works fine. Also please notice that do not call |
@sgwhat It was just this "set OLLAMA_NUM_PARALLEL=1" parameter I have to add in the script. The utilization of A380 is more than 90% sometimes even 99% and the speed of response is unbelievable, compared to my Core i7 9700, many times faster. Also, the call of setvars.bat is mandatory, otherwise I get multiple missing DLL errors, like the one I post here. Thank you very much for the effort. |
@sgwhat Unfortunately, even using the exact same model (Llama 3.1 - 8B) for CPU and Intel GPU on the exact same page, the results are completely different using Leo AI to summarize the page. A380 is ~9x times faster than Core i7 9700, but the results are almost garbage sometimes. It hallucinates a lot and always give me very short summaries compared to CPU version which is perfect. Extremely slow but perfect, in both size and accuracy. I don't know if there is anything that I could change in the parameters to enhance Intel GPU results in terms of quality, even slowing down the speed. Thank you. |
Hi @NikosDi , based on my tests, I haven't observed any noticeable difference in answer quality when running ipex-llm ollama on GPU or CPU, nor have I encountered issues with poor output on the A380. Could you provide the detailed responses from Leo AI? |
Hello @sgwhat My comparison is between Ollama default installation which has built-in support for CUDA and ROCm, so it falls back to CPU when running on Intel GPU hardware and Intel GPU using IPEX-LLM It's default Ollama (CPU) vs IPEX (Intel GPU) Your comparison is IPEX (CPU) vs IPEX (Intel GPU) I'm interested in testing your comparison too, how could I run IPEX using my CPU ? I'll post all my results here with all configurations. Thank you. |
You may set |
Hello @sgwhat So, using my Windows 11 PC and Intel ARC A380 I downloaded four different LLMs or we should call them SLM (Small Language Models) Qwen2.5:7B, Mistral, LLama3.1:8B, Gemma2:9B Using CPU mode I can run all of them obviously (I have 32GB of RAM) Regarding speed the results are extremely obvious. IPEX CPU is definitely not on par with native Ollama CPU. I mean the difference is huge ~7 times faster Ollama CPU compared to IPEX CPU running on Intel Core i7 9700. Intel GPU on the other hand is also ~9 times faster than Ollama CPU, so the differences are huge and clear. Regarding quality, I can not say for sure. As a personal preference, probably I prefer Mistral as my favorite model running on Intel GPU of course. Two questions:
In order to run these models, I have to set again OLLAMA_NUM_PARALLEL=1 strictly. Is it possible to change the script in a way to include support for Mistral and the other models using the same script ?
I don't want to degrade my Kernel or install older Ubuntu version. Are there newer instructions for Ubuntu 24.04 regarding IPEX-LLM installation ? Thank you. |
Hi @NikosDi,
|
@sgwhat Maybe you could add in your research the possibility to embed IPEX-LLM inside official Ollama installer, just like nVidia CUDA and AMD ROCm. Thank you |
Hi @NikosDi. Regarding your previous question 1, we tested and found that when |
@sgwhat So, I found out the issue. I was changing models from inside Leo settings to test them and I was getting the error, but not only for Mistral. If you load a model with the script for "Ollama serve" you have to stick with this model using Intel GPU IPEX-LLM, until it unloads from VRAM memory. If you don't want to wait for the default 5 minutes, you can close and run again "Ollama serve" script or kill instantly the loaded model with this command: curl http://localhost:11434/api/generate -d "{"model": "$selected_model", "keep_alive": 0}" When I was trying CPU models using IPEX-LLM and "ollama serve" script using OLLAMA_NUM_GPU=0, I didn't have such issues. I don't know if the problem exists due to limited VRAM vs RAM (6GB vs 32GB) or it's a limitation of IPEX-LLM (GPU) vs IPEX-LLM (CPU) |
@NikosDi, it's because the A380 only has 6GB of VRAM available. |
But I'm really struggling to make these scripts run from inside a script in order to make that script a service (to run automatically) So this question is rather Linux related than Ollama or IPEX-LLM How do I make these scripts (either one or both) run from inside a script ?
|
@NikosDi i understand what your trying to do , i spend days on this reading systemd man pages , i well go to my laptop where i have the service & give to you here . |
@NikosDi ######################### [Service] #WorkingDirectory=/var/lib/ollama #Environment="HOME=/var/lib/ollama" Restart=on-failure [Install] systemd --user --daemon-reload you might want to change some enviroment variables or delete ones , edit it accourding to your needs , i like to store my models in /var/lib/ollama , so they can be used with the distrbution packaged ollama , but it causes file permissions issues & probably would need some permissions adjustement , so you propably dont need the commented env variables. |
After many hours of "chatting" with ChatGPT (the free version without subscription using model ChatGPT 4o mini) it finally works, using the Shell Script:
Service script:
EnvironmentFile called by service script:
@user7z |
@NikosDi you welcome , for security reasons it better to make it a user service |
@user7z Because I want to use other devices in the same LAN which have no dGPU inside (only extremely slow iGPUs) in order to have Ollama hardware acceleration. Have you tried to use Brave Leo AI via LAN (using the PC with dGPU as a server) ? Have you found out any nice GUI app using Ollama chatbot for the same use (client/ server) via LAN ? |
@NikosDi all what i did is for local use , just for my laptop ,but here is a nice portable gui i used before , its named open-webui |
@NikosDi if it is what your looking for & you use podman instead of docker , i have a nice config if you need so |
So, using directly Brave Leo AI from another PC in the same LAN is not possible due to the mandatory use of SSL protocol by Brave when using Ollama remote (it demands HTTPS for network connections using Ollama) Unfortunately, Ollama doesn't support HTTPS (only HTTP) so the only solution is an intermediate HTTPS proxy, which I'm not going to install. Regarding beautiful GUIs for desktop apps leveraging Ollama IPEX-LLM, I can suggest two: Those two can also be configured in a client/ server mode, so you can have an Ollama IPEX-LLM server with an Intel GPU in your LAN and other client devices can access the server via
|
For brave to use ollama it doesnt need https |
Hello.
I'm trying to use Brave Leo AI with Ollama using an Intel GPU.
The instructions from Brave using local LLMs via Ollama are here:
https://brave.com/blog/byom-nightly/
The instructions from Intel using Ollama with Intel GPU are here:
https://www.intel.com/content/www/us/en/content-details/826081/running-ollama-with-open-webui-on-intel-hardware-platform.html
How could I combine those ?
I want to use Brave Leo AI (not Open WebUI) running on Intel GPU via Ollama.
My system:
Windows 11/ Intel ARC A380
Thank you.
The text was updated successfully, but these errors were encountered: