fix: #3549, #3552 - Inference on CPU is slower on Jan 0.5.3 #3602

louis-jan · 2024-09-08T15:23:23Z

Problem

In version 0.5.3, the app ships with non-AVX binaries only (previously AVX2), which causes significant performance degradation, especially in CPU mode.

Changes in this PR

1. Fixed the app's inference performance degradation when running in CPU mode. App bundles 7 binaries in total, including:

Windows & Linux

NO-AVX
AVX
AVX2
AVX512
VULKAN
AVX2-CUDA-12-0
AVX2-CUDA-11-7

MacOS

ARM64
X64

This would address the degradation issues with different CPU instructions where the app would pick the correct binary to run on optimized CPU instructions.

When GPU Acceleration settings is ON, the app will use AVX2-Cuda binaries by default (12.0 or 11.7, depending on the Cuda version installed by the user). This structure is the same as that of the llama.cpp releases and previous app releases.

2. CPU acceleration when NGL is 0

This fix added a small enhancement where the app will fall back to CPU mode when NGL is not defined or set to 0. (See #3549 discussions)
But it's quite confusing for now since on UX we only allow a minimum of 1

3. Fixed the issue "The model can't start / The specified module could not be found"

There is a fix included in the PR that allows the extension to locate the engine's DLLs file accordingly, so users will no longer encounter the issue above.

Folder structure (the same for Linux but .so)

bin\
  cortex-cpp.exe
  [cortex-cpp].dll
  [engine-dependencies].dll
  win-avx\engines\engine.dll
  win-avx2\ngines\engine.dll
  win-cuda-12-0\ngines\engine.dll
  ...

There are no longer win-cpu and win-cuda. This is aligned with the releases structure.

4. No more duplicate downloads of cortex-cpp binaries
In previous releases, we shipped cortex-cpp under different folders, e.g., win-cpu/cortex-cpp and win-cuda/cortex-cpp. Now we introduce more binaries without duplicating them. Also, the structure is now aligned with llama.cpp releases.

For Windows and Linux (on Linux there is no .exe suffix, also .so instead of dll)

bin\
  cortex-cpp.exe
  [cortex-cpp].dll
  [engine-dependencies].dll
  win-noavx\engines\engine.dll  // Built with NOAVX flag
  win-avx\engines\engine.dll  // Built with AVX flag
  win-avx2\engines\engine.dll  // Built with AVX2 flag
  win-avx512\engines\engine.dll  // Built with AVX512 flag
  win-cuda-12-0\engines\engine.dll  // Built with AVX2, GGML_CUDA flag - CUDA 12.0 build env
  win-cuda-11-7\engines\engine.dll  // Built with AVX2, GGML_CUDA flag - CUDA 11.7 build env
  win-vulkan\engines\engine.dll  // Built with DGGML_VULKAN flag
  ...

For macOS, it will be a little bit different since cortex-cpp is built for a certain architecture. So it would be located under the platform-arch folder instead of bin.

bin\
  mac-arm64\
    cortex-cpp   // Build for ARM architecture
    engine.dylib
  mac-x64\
    cortex-cpp   // Build for X64 architecture
    engine.dylib

5. Unit tests have been added to the extension.
It's likely no tests were added to the extension, there are now 30 tests.

6. Remove duplicated Cortex-CPP logics between the extension and API Server
The model start use case from the current API server is quite confusing. The user should start the model from the API Server page before starting the API Server, so the /model/start case is not valid here since it is not synced to the UX. If it were, it should be implemented from cortex-cpp. I removed the duplicated and confusing logics from there in this PR.

Performance increased significantly
CPU Only - with the fix > 14t/s - I9 13900K

CPU Only without the fix ~ 4t/s - I9 13900K

GPU Accelerated - The same ~120t/s - RTX 3090

Offloaded - NGL 15 - The same ~23/s - RTX 3090

Offloaded - NGL 1 - The same ~14/s - RTX 3090

Vulkan - The same ~54/s - RTX 3090

MacOS - The same ~29/s - M2 Pro

Linux - No AVX

Linux - AVX 2

Fixes Issues

To test this PR

Run:

Remove.jan extensions under Jan Data Folder (since a build for dev will not bump the version, no extensions migration is applied).
make clean
make dev
Chat with a model

What could be improved

How to deal with CUDA binaries on CPUs that do not support AVX2 instruction sets?

Add NoAVX-CUDA and add 2 more heavy binaries
Update UX to not allow GPU offload on unsupported instruction sets, or will it just crash without offload?

The same for the Vulkan binary, which is currently built with AVX2 by default for offloading?

Self Checklist

Added relevant comments, esp in complex areas
Updated docs (for bug fixes / features)
Created issues for follow-up changes or refactoring needed

github-actions · 2024-09-08T15:39:08Z

Barecheck - Code coverage report

Total: 52.61%

Your code coverage diff: -1.48% ▾

Uncovered files and lines

File	Lines
core/src/node/api/restful/helper/startStopModel.ts	13, 15, 19, 31-33, 42-43, 51, 61, 70-71, 73

hiento09

lgtm

extensions/inference-nitro-extension/src/node/execute.ts

dan-menlo · 2024-09-09T02:09:03Z

Review in progress

dan-menlo

Great documentation of changes - I appreciate the effort put in on this.

dan-menlo · 2024-09-09T02:16:13Z

I have linked this PR in janhq/cortex.cpp#1156, to document our use of llama.cpp

dan-menlo · 2024-09-09T06:19:16Z

@louis-jan Please merge and have @imtuyethan QA as part of 0.5.4 when you're ready

dan-menlo

Lgtm!

dan-menlo · 2024-09-09T06:20:44Z

@louis-jan Btw, let's standardize on cortex.cpp or cortexcpp (if no period is allowed) instead of Cortex-cpp

louis-jan · 2024-09-09T06:43:22Z

@louis-jan Btw, let's standardize on cortex.cpp or cortexcpp (if no period is allowed) instead of Cortex-cpp

Oops, yeah period is not allowed, cortexcpp should be fine, but that comes from cortex releases. We will update it later when cortex.cpp is released. From Jan, it should just decompress and leave the file as is.

hiento09 · 2024-09-09T08:25:16Z

@louis-jan Btw, let's standardize on cortex.cpp or cortexcpp (if no period is allowed) instead of Cortex-cpp

@dan-homebrew

Can I ask that you are suggesting to change the release name from cortex-cpp-<version>-<os>-<arch>.tar.gz to cortexcpp-<version>-<os>-<arch>.tar.gz?

Castafers · 2024-09-09T08:34:18Z

Feature test build:

https://github.com/janhq/jan/actions/runs/10768992783

The new build prevents me from loading AI models.

20240909 08:32:19.710231 UTC 18979 ERROR Unhandled exception in /inferences/server/loadmodel, what(): Value is not convertible to Int. - HttpAppFrameworkImpl.cc:124

OS: Arch Linux
Package: AppImage
GPU: RX 7900 XT
CPU: Ryzen 7 7900X3D
RAM: 32GB

Build: 624

Castafers · 2024-09-09T08:48:44Z

https://github.com/janhq/jan/actions/runs/10768992783

Hi @Castafers, could you please take a screenshot of the download options you're seeing? I don't think the link is accessible without permission.

louis-jan · 2024-09-09T08:50:10Z

@Castafers Could you please help share your app.log? I will investigate further.

Update: This is likely a wrong body request from the GUI, not really related to the changes. Could you please help share a screenshot of the thread settings as well?

Castafers · 2024-09-09T08:52:35Z

@Castafers Could you please help share your app.log? I will investigate further.

Sure thing, thank you for your contribution!
Here's the logs.

Note: Upon a factory reset, the program remains broken.

louis-jan · 2024-09-09T08:54:08Z

logs

Sorry, the log file is not accessible. Could you please help me attach it here or re-upload it? Thank you for your help.

Castafers · 2024-09-09T08:57:39Z

logs

Sorry the log file is not accessible.

I'm sorry about that, it's fixed now.
I was attempting to obscure any personal information & set a deletion time, but that didn't work :/

louis-jan · 2024-09-09T09:11:43Z

What a great find @Castafers, I have investigated and it turned out there is a regression in the latest dev build. We will fix it from there and update with a fixed build in this PR.

Context:
Using any builds after #3538, entering any model load settings components will result in the model load failing.

Regression found in this PR
#3538

Reproduce:
Input any context length or NGL from the model settings input box reproduce a request with string values instead of numbers.

Castafers · 2024-09-09T09:16:27Z

What a great find @Castafers, I have investigated and it turned out there is a regression in the latest dev build. We will fix it from there and update with a fixed build in this PR.

Regression found in this PR #3538

Reproduce: Adjust any context length or NGL reproduce a request with string values instead of numbers.

You're awesome 😎 Thank you for the quick find and analysis from the logs!

fix: correct extension description fix: vulkan should not have instructions or run mode included fix: OS path delimiter test: add tests chore: should set run mode to CPU programmatically when ngl is not set chore: shorten download URL

chore: bump cpu-instructions - 0.0.13

louis-jan · 2024-09-09T09:48:27Z

Rebased dev

louis-jan · 2024-09-09T10:59:55Z

Hi @Castafers, here is the build with the latest fix from the dev branch, if you don't mind, please try this build and factory reset once again. Just to make sure there is no cache from a previous broken dev build. Thank you!
https://github.com/janhq/jan/actions/runs/10770763664

dan-menlo · 2024-09-09T11:16:51Z

Thank you so much @Castafers!

Castafers · 2024-09-09T12:58:59Z

Hi @Castafers, here is the build with the latest fix from the dev branch, if you don't mind, please try this build and factory reset once again. Just to make sure there is no cache from a previous broken dev build. Thank you! https://github.com/janhq/jan/actions/runs/10770763664

The new build 625 is working nominally and performance is excellent.
Tokens gracefully fall from 102t/s to 68t/s, then begin fluctuating from 68t/s ~ 80t/s which I presume is normal now.

Thank you so much for fixing the severe performance bug! I hardly was able to use the AI at all beforehand.
You are all awesome, from support to programmers to all contributors. This is a highly successful project & team ❤️

github-actions bot assigned louis-jan Sep 8, 2024

github-actions bot added the type: bug Something isn't working label Sep 8, 2024

louis-jan marked this pull request as draft September 8, 2024 15:26

louis-jan marked this pull request as ready for review September 8, 2024 15:59

louis-jan requested review from hiento09, urmauur, dan-menlo, irfanpena and vansangpfiev September 8, 2024 16:16

louis-jan changed the title ~~fix: #3549, #3552 - Inference on CPU is slower~~ fix: #3549, #3552 - Inference on CPU is slower since 0.5.3 Sep 8, 2024

louis-jan changed the title ~~fix: #3549, #3552 - Inference on CPU is slower since 0.5.3~~ fix: #3549, #3552 - Inference on CPU is slower on Jan 0.5.3 Sep 8, 2024

hiento09 approved these changes Sep 9, 2024

View reviewed changes

louis-jan commented Sep 9, 2024

View reviewed changes

extensions/inference-nitro-extension/src/node/execute.ts Show resolved Hide resolved

vansangpfiev approved these changes Sep 9, 2024

View reviewed changes

dan-menlo approved these changes Sep 9, 2024

View reviewed changes

urmauur approved these changes Sep 9, 2024

View reviewed changes

dan-menlo approved these changes Sep 9, 2024

View reviewed changes

louis-jan temporarily deployed to production September 9, 2024 07:58 — with GitHub Actions Inactive

louis-jan added 3 commits September 9, 2024 16:47

fix: 3549 inference on cpu is lower since 0.5.3

16fb2f1

fix: correct extension description fix: vulkan should not have instructions or run mode included fix: OS path delimiter test: add tests chore: should set run mode to CPU programmatically when ngl is not set chore: shorten download URL

fix: linux NOAVX binary should have -noavx suffix

f52ebb2

chore: standardize cortex.cpp module name

9abed5f

chore: bump cpu-instructions - 0.0.13

louis-jan force-pushed the fix/3549-inference-on-cpu-is-slower branch from d6b8682 to 9abed5f Compare September 9, 2024 09:47

louis-jan temporarily deployed to production September 9, 2024 09:49 — with GitHub Actions Inactive

louis-jan merged commit 5217437 into dev Sep 11, 2024
19 checks passed

louis-jan deleted the fix/3549-inference-on-cpu-is-slower branch September 11, 2024 07:03

github-actions bot added this to the v.0.6.0 milestone Sep 11, 2024

dan-menlo mentioned this pull request Sep 11, 2024

fix: prevent value empty string on slider right panel #3635

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: #3549, #3552 - Inference on CPU is slower on Jan 0.5.3 #3602

fix: #3549, #3552 - Inference on CPU is slower on Jan 0.5.3 #3602

louis-jan commented Sep 8, 2024 •

edited

Loading

github-actions bot commented Sep 8, 2024 •

edited

Loading

hiento09 left a comment

dan-menlo commented Sep 9, 2024

dan-menlo left a comment

dan-menlo commented Sep 9, 2024

dan-menlo commented Sep 9, 2024

dan-menlo left a comment

dan-menlo commented Sep 9, 2024 •

edited

Loading

louis-jan commented Sep 9, 2024 •

edited

Loading

hiento09 commented Sep 9, 2024 •

edited

Loading

Castafers commented Sep 9, 2024 •

edited

Loading

Castafers commented Sep 9, 2024

louis-jan commented Sep 9, 2024 •

edited

Loading

Castafers commented Sep 9, 2024 •

edited

Loading

louis-jan commented Sep 9, 2024 •

edited

Loading

Castafers commented Sep 9, 2024 •

edited

Loading

louis-jan commented Sep 9, 2024 •

edited

Loading

Castafers commented Sep 9, 2024

louis-jan commented Sep 9, 2024

louis-jan commented Sep 9, 2024 •

edited

Loading

dan-menlo commented Sep 9, 2024

Castafers commented Sep 9, 2024 •

edited

Loading

fix: #3549, #3552 - Inference on CPU is slower on Jan 0.5.3 #3602

fix: #3549, #3552 - Inference on CPU is slower on Jan 0.5.3 #3602

Conversation

louis-jan commented Sep 8, 2024 • edited Loading

Problem

Changes in this PR

Fixes Issues

To test this PR

What could be improved

Self Checklist

github-actions bot commented Sep 8, 2024 • edited Loading

Barecheck - Code coverage report

hiento09 left a comment

Choose a reason for hiding this comment

dan-menlo commented Sep 9, 2024

dan-menlo left a comment

Choose a reason for hiding this comment

dan-menlo commented Sep 9, 2024

dan-menlo commented Sep 9, 2024

dan-menlo left a comment

Choose a reason for hiding this comment

dan-menlo commented Sep 9, 2024 • edited Loading

louis-jan commented Sep 9, 2024 • edited Loading

hiento09 commented Sep 9, 2024 • edited Loading

Castafers commented Sep 9, 2024 • edited Loading

Castafers commented Sep 9, 2024

louis-jan commented Sep 9, 2024 • edited Loading

Castafers commented Sep 9, 2024 • edited Loading

louis-jan commented Sep 9, 2024 • edited Loading

Castafers commented Sep 9, 2024 • edited Loading

louis-jan commented Sep 9, 2024 • edited Loading

Castafers commented Sep 9, 2024

louis-jan commented Sep 9, 2024

louis-jan commented Sep 9, 2024 • edited Loading

dan-menlo commented Sep 9, 2024

Castafers commented Sep 9, 2024 • edited Loading

louis-jan commented Sep 8, 2024 •

edited

Loading

github-actions bot commented Sep 8, 2024 •

edited

Loading

dan-menlo commented Sep 9, 2024 •

edited

Loading

louis-jan commented Sep 9, 2024 •

edited

Loading

hiento09 commented Sep 9, 2024 •

edited

Loading

Castafers commented Sep 9, 2024 •

edited

Loading

louis-jan commented Sep 9, 2024 •

edited

Loading

Castafers commented Sep 9, 2024 •

edited

Loading

louis-jan commented Sep 9, 2024 •

edited

Loading

Castafers commented Sep 9, 2024 •

edited

Loading

louis-jan commented Sep 9, 2024 •

edited

Loading

louis-jan commented Sep 9, 2024 •

edited

Loading

Castafers commented Sep 9, 2024 •

edited

Loading