-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: #3549, #3552 - Inference on CPU is slower on Jan 0.5.3 #3602
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Review in progress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great documentation of changes - I appreciate the effort put in on this.
I have linked this PR in janhq/cortex.cpp#1156, to document our use of llama.cpp |
@louis-jan Please merge and have @imtuyethan QA as part of 0.5.4 when you're ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
@louis-jan Btw, let's standardize on |
Oops, yeah period is not allowed, cortexcpp should be fine, but that comes from cortex releases. We will update it later when cortex.cpp is released. From Jan, it should just decompress and leave the file as is. |
@dan-homebrew |
The new build prevents me from loading AI models.
OS: Arch Linux Build: 624 |
|
@Castafers Could you please help share your app.log? I will investigate further. Update: This is likely a wrong body request from the GUI, not really related to the changes. Could you please help share a screenshot of the thread settings as well? |
Sure thing, thank you for your contribution! Note: Upon a factory reset, the program remains broken. |
Sorry, the log file is not accessible. Could you please help me attach it here or re-upload it? Thank you for your help. |
I'm sorry about that, it's fixed now. |
What a great find @Castafers, I have investigated and it turned out there is a regression in the latest dev build. We will fix it from there and update with a fixed build in this PR. Context: Regression found in this PR Reproduce: |
You're awesome 😎 Thank you for the quick find and analysis from the logs! |
fix: correct extension description fix: vulkan should not have instructions or run mode included fix: OS path delimiter test: add tests chore: should set run mode to CPU programmatically when ngl is not set chore: shorten download URL
chore: bump cpu-instructions - 0.0.13
d6b8682
to
9abed5f
Compare
Rebased |
Hi @Castafers, here is the build with the latest fix from the |
Thank you so much @Castafers! |
The new build 625 is working nominally and performance is excellent. Thank you so much for fixing the severe performance bug! I hardly was able to use the AI at all beforehand. |
Problem
In version 0.5.3, the app ships with non-AVX binaries only (previously AVX2), which causes significant performance degradation, especially in CPU mode.
Changes in this PR
1. Fixed the app's inference performance degradation when running in CPU mode. App bundles 7 binaries in total, including:
Windows & Linux
MacOS
This would address the degradation issues with different CPU instructions where the app would pick the correct binary to run on optimized CPU instructions.
When GPU Acceleration settings is ON, the app will use AVX2-Cuda binaries by default (12.0 or 11.7, depending on the Cuda version installed by the user). This structure is the same as that of the llama.cpp releases and previous app releases.
2. CPU acceleration when NGL is 0
This fix added a small enhancement where the app will fall back to CPU mode when NGL is not defined or set to 0. (See #3549 discussions)
But it's quite confusing for now since on UX we only allow a minimum of 1
3. Fixed the issue "The model can't start / The specified module could not be found"
There is a fix included in the PR that allows the extension to locate the engine's DLLs file accordingly, so users will no longer encounter the issue above.
Folder structure (the same for Linux but .so)
There are no longer win-cpu and win-cuda. This is aligned with the releases structure.
4. No more duplicate downloads of cortex-cpp binaries
In previous releases, we shipped cortex-cpp under different folders, e.g., win-cpu/cortex-cpp and win-cuda/cortex-cpp. Now we introduce more binaries without duplicating them. Also, the structure is now aligned with llama.cpp releases.
For Windows and Linux (on Linux there is no .exe suffix, also .so instead of dll)
For macOS, it will be a little bit different since cortex-cpp is built for a certain architecture. So it would be located under the platform-arch folder instead of bin.
5. Unit tests have been added to the extension.
It's likely no tests were added to the extension, there are now 30 tests.
6. Remove duplicated Cortex-CPP logics between the extension and API Server
The model start use case from the current API server is quite confusing. The user should start the model from the API Server page before starting the API Server, so the /model/start case is not valid here since it is not synced to the UX. If it were, it should be implemented from cortex-cpp. I removed the duplicated and confusing logics from there in this PR.
Fixes Issues
To test this PR
Run:
make clean
make dev
What could be improved
How to deal with CUDA binaries on CPUs that do not support AVX2 instruction sets?
The same for the Vulkan binary, which is currently built with AVX2 by default for offloading?
Self Checklist