Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: #3549, #3552 - Inference on CPU is slower on Jan 0.5.3 #3602

Merged
merged 3 commits into from
Sep 11, 2024

Conversation

louis-jan
Copy link
Contributor

@louis-jan louis-jan commented Sep 8, 2024

Problem

In version 0.5.3, the app ships with non-AVX binaries only (previously AVX2), which causes significant performance degradation, especially in CPU mode.

Changes in this PR

1. Fixed the app's inference performance degradation when running in CPU mode. App bundles 7 binaries in total, including:

Windows & Linux

  • NO-AVX
  • AVX
  • AVX2
  • AVX512
  • VULKAN
  • AVX2-CUDA-12-0
  • AVX2-CUDA-11-7

MacOS

  • ARM64
  • X64

This would address the degradation issues with different CPU instructions where the app would pick the correct binary to run on optimized CPU instructions.

When GPU Acceleration settings is ON, the app will use AVX2-Cuda binaries by default (12.0 or 11.7, depending on the Cuda version installed by the user). This structure is the same as that of the llama.cpp releases and previous app releases.

2. CPU acceleration when NGL is 0

This fix added a small enhancement where the app will fall back to CPU mode when NGL is not defined or set to 0. (See #3549 discussions)
But it's quite confusing for now since on UX we only allow a minimum of 1

3. Fixed the issue "The model can't start / The specified module could not be found"

There is a fix included in the PR that allows the extension to locate the engine's DLLs file accordingly, so users will no longer encounter the issue above.

Folder structure (the same for Linux but .so)

bin\
  cortex-cpp.exe
  [cortex-cpp].dll
  [engine-dependencies].dll
  win-avx\engines\engine.dll
  win-avx2\ngines\engine.dll
  win-cuda-12-0\ngines\engine.dll
  ...

There are no longer win-cpu and win-cuda. This is aligned with the releases structure.

4. No more duplicate downloads of cortex-cpp binaries
In previous releases, we shipped cortex-cpp under different folders, e.g., win-cpu/cortex-cpp and win-cuda/cortex-cpp. Now we introduce more binaries without duplicating them. Also, the structure is now aligned with llama.cpp releases.

For Windows and Linux (on Linux there is no .exe suffix, also .so instead of dll)

bin\
  cortex-cpp.exe
  [cortex-cpp].dll
  [engine-dependencies].dll
  win-noavx\engines\engine.dll  // Built with NOAVX flag
  win-avx\engines\engine.dll  // Built with AVX flag
  win-avx2\engines\engine.dll  // Built with AVX2 flag
  win-avx512\engines\engine.dll  // Built with AVX512 flag
  win-cuda-12-0\engines\engine.dll  // Built with AVX2, GGML_CUDA flag - CUDA 12.0 build env
  win-cuda-11-7\engines\engine.dll  // Built with AVX2, GGML_CUDA flag - CUDA 11.7 build env
  win-vulkan\engines\engine.dll  // Built with DGGML_VULKAN flag
  ...

For macOS, it will be a little bit different since cortex-cpp is built for a certain architecture. So it would be located under the platform-arch folder instead of bin.

bin\
  mac-arm64\
    cortex-cpp   // Build for ARM architecture
    engine.dylib
  mac-x64\
    cortex-cpp   // Build for X64 architecture
    engine.dylib

5. Unit tests have been added to the extension.
It's likely no tests were added to the extension, there are now 30 tests.

6. Remove duplicated Cortex-CPP logics between the extension and API Server
The model start use case from the current API server is quite confusing. The user should start the model from the API Server page before starting the API Server, so the /model/start case is not valid here since it is not synced to the UX. If it were, it should be implemented from cortex-cpp. I removed the duplicated and confusing logics from there in this PR.

Performance increased significantly
CPU Only - with the fix > 14t/s - I9 13900K
Screenshot 2024-09-08 174450
CPU Only without the fix ~ 4t/s - I9 13900K
Screenshot 2024-09-08 174542
GPU Accelerated - The same ~120t/s - RTX 3090
Screenshot 2024-09-08 174500
Offloaded - NGL 15 - The same ~23/s - RTX 3090
Screenshot 2024-09-08 230608
Offloaded - NGL 1 - The same ~14/s - RTX 3090
Screenshot 2024-09-08 230628
Vulkan - The same ~54/s - RTX 3090
Screenshot 2024-09-08 230914
MacOS - The same ~29/s - M2 Pro
Screenshot 2024-09-08 at 21 40 22
Linux - No AVX
Screenshot 2024-09-09 130650
Linux - AVX 2
image

Fixes Issues

To test this PR

Run:

  • Remove.jan extensions under Jan Data Folder (since a build for dev will not bump the version, no extensions migration is applied).
  • make clean
  • make dev
  • Chat with a model

What could be improved

How to deal with CUDA binaries on CPUs that do not support AVX2 instruction sets?

Add NoAVX-CUDA and add 2 more heavy binaries
Update UX to not allow GPU offload on unsupported instruction sets, or will it just crash without offload?

The same for the Vulkan binary, which is currently built with AVX2 by default for offloading?

Self Checklist

  • Added relevant comments, esp in complex areas
  • Updated docs (for bug fixes / features)
  • Created issues for follow-up changes or refactoring needed

@github-actions github-actions bot added the type: bug Something isn't working label Sep 8, 2024
@louis-jan louis-jan marked this pull request as draft September 8, 2024 15:26
Copy link
Contributor

github-actions bot commented Sep 8, 2024

Barecheck - Code coverage report

Total: 52.61%

Your code coverage diff: -1.48% ▾

Uncovered files and lines
FileLines
core/src/node/api/restful/helper/startStopModel.ts13, 15, 19, 31-33, 42-43, 51, 61, 70-71, 73

@louis-jan louis-jan marked this pull request as ready for review September 8, 2024 15:59
@louis-jan louis-jan changed the title fix: #3549, #3552 - Inference on CPU is slower fix: #3549, #3552 - Inference on CPU is slower since 0.5.3 Sep 8, 2024
@louis-jan louis-jan changed the title fix: #3549, #3552 - Inference on CPU is slower since 0.5.3 fix: #3549, #3552 - Inference on CPU is slower on Jan 0.5.3 Sep 8, 2024
Copy link
Collaborator

@hiento09 hiento09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dan-menlo
Copy link
Contributor

Review in progress

Copy link
Contributor

@dan-menlo dan-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great documentation of changes - I appreciate the effort put in on this.

@dan-menlo
Copy link
Contributor

I have linked this PR in janhq/cortex.cpp#1156, to document our use of llama.cpp

@dan-menlo
Copy link
Contributor

@louis-jan Please merge and have @imtuyethan QA as part of 0.5.4 when you're ready

Copy link
Contributor

@dan-menlo dan-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 9, 2024

@louis-jan Btw, let's standardize on cortex.cpp or cortexcpp (if no period is allowed) instead of Cortex-cpp

@louis-jan
Copy link
Contributor Author

louis-jan commented Sep 9, 2024

@louis-jan Btw, let's standardize on cortex.cpp or cortexcpp (if no period is allowed) instead of Cortex-cpp

Oops, yeah period is not allowed, cortexcpp should be fine, but that comes from cortex releases. We will update it later when cortex.cpp is released. From Jan, it should just decompress and leave the file as is.

@hiento09
Copy link
Collaborator

hiento09 commented Sep 9, 2024

@louis-jan Btw, let's standardize on cortex.cpp or cortexcpp (if no period is allowed) instead of Cortex-cpp

@dan-homebrew
image
Can I ask that you are suggesting to change the release name from cortex-cpp-<version>-<os>-<arch>.tar.gz to cortexcpp-<version>-<os>-<arch>.tar.gz?

@Castafers
Copy link

Castafers commented Sep 9, 2024

Feature test build:

https://github.com/janhq/jan/actions/runs/10768992783

The new build prevents me from loading AI models.

20240909 08:32:19.710231 UTC 18979 ERROR Unhandled exception in /inferences/server/loadmodel, what(): Value is not convertible to Int. - HttpAppFrameworkImpl.cc:124

OS: Arch Linux
Package: AppImage
GPU: RX 7900 XT
CPU: Ryzen 7 7900X3D
RAM: 32GB

Build: 624

@Castafers
Copy link

https://github.com/janhq/jan/actions/runs/10768992783

Hi @Castafers, could you please take a screenshot of the download options you're seeing? I don't think the link is accessible without permission.

image

@louis-jan
Copy link
Contributor Author

louis-jan commented Sep 9, 2024

@Castafers Could you please help share your app.log? I will investigate further.

Update: This is likely a wrong body request from the GUI, not really related to the changes. Could you please help share a screenshot of the thread settings as well?

@Castafers
Copy link

Castafers commented Sep 9, 2024

@Castafers Could you please help share your app.log? I will investigate further.

Sure thing, thank you for your contribution!
Here's the logs.

Note: Upon a factory reset, the program remains broken.

@louis-jan
Copy link
Contributor Author

louis-jan commented Sep 9, 2024

logs

Sorry, the log file is not accessible. Could you please help me attach it here or re-upload it? Thank you for your help.

@Castafers
Copy link

Castafers commented Sep 9, 2024

logs

Sorry the log file is not accessible.

I'm sorry about that, it's fixed now.
I was attempting to obscure any personal information & set a deletion time, but that didn't work :/

@louis-jan
Copy link
Contributor Author

louis-jan commented Sep 9, 2024

What a great find @Castafers, I have investigated and it turned out there is a regression in the latest dev build. We will fix it from there and update with a fixed build in this PR.

Context:
Using any builds after #3538, entering any model load settings components will result in the model load failing.

Regression found in this PR
#3538

Reproduce:
Input any context length or NGL from the model settings input box reproduce a request with string values instead of numbers.

@Castafers
Copy link

What a great find @Castafers, I have investigated and it turned out there is a regression in the latest dev build. We will fix it from there and update with a fixed build in this PR.

Regression found in this PR #3538

Reproduce: Adjust any context length or NGL reproduce a request with string values instead of numbers.

You're awesome 😎 Thank you for the quick find and analysis from the logs!

fix: correct extension description

fix: vulkan should not have instructions or run mode included

fix: OS path delimiter

test: add tests

chore: should set run mode to CPU programmatically when ngl is not set

chore: shorten download URL
chore: bump cpu-instructions - 0.0.13
@louis-jan louis-jan force-pushed the fix/3549-inference-on-cpu-is-slower branch from d6b8682 to 9abed5f Compare September 9, 2024 09:47
@louis-jan
Copy link
Contributor Author

Rebased dev

@louis-jan
Copy link
Contributor Author

louis-jan commented Sep 9, 2024

Hi @Castafers, here is the build with the latest fix from the dev branch, if you don't mind, please try this build and factory reset once again. Just to make sure there is no cache from a previous broken dev build. Thank you!
https://github.com/janhq/jan/actions/runs/10770763664

@dan-menlo
Copy link
Contributor

Thank you so much @Castafers!

@Castafers
Copy link

Castafers commented Sep 9, 2024

Hi @Castafers, here is the build with the latest fix from the dev branch, if you don't mind, please try this build and factory reset once again. Just to make sure there is no cache from a previous broken dev build. Thank you! https://github.com/janhq/jan/actions/runs/10770763664

The new build 625 is working nominally and performance is excellent.
Tokens gracefully fall from 102t/s to 68t/s, then begin fluctuating from 68t/s ~ 80t/s which I presume is normal now.

Thank you so much for fixing the severe performance bug! I hardly was able to use the AI at all beforehand.
You are all awesome, from support to programmers to all contributors. This is a highly successful project & team ❤️

@louis-jan louis-jan merged commit 5217437 into dev Sep 11, 2024
19 checks passed
@louis-jan louis-jan deleted the fix/3549-inference-on-cpu-is-slower branch September 11, 2024 07:03
@github-actions github-actions bot added this to the v.0.6.0 milestone Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
6 participants