Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Cortex.cpp Engine Dependencies Architecture (e.g. CUDA Toolkit) #1046

Closed
2 tasks done
namchuai opened this issue Aug 29, 2024 · 10 comments
Closed
2 tasks done
Assignees
Labels
type: planning Opening up a discussion

Comments

@namchuai
Copy link
Collaborator

namchuai commented Aug 29, 2024

Motivation

  1. Do we package the cuda toolkit to the engine?
    Yes? Then will have to do the same for llamacpp, tensorrt-llm and onnx?
    No? Will download separatedly

  2. Folder structures (e.g if user have llamacpp, tensorrt at the same time)?

Resources
Llamacpp release
Currently we are downloading toolkit dependency via https://catalog.jan.ai/dist/cuda-dependencies/<version>/<platform>/cuda.tar.gz

cc @vansangpfiev @nguyenhoangthuan99 @dan-homebrew

Update sub-tasks:

@vansangpfiev
Copy link
Contributor

vansangpfiev commented Aug 29, 2024

On my perspective, we should download CUDA toolkit separately. We support multiple engines: cortex.llamacpp and cortex.tensorrt-llm, both need CUDA toolkit to run. CUDA is backward compatible so we only need the latest CUDA toolkit version that supported by nvidia-driver version.
For example:

  • Nvidia driver: "527.41" compatible with CUDA 12.4
  • cortex.llamacpp: depends on CUDA 12.2
  • cortex.tensorrt-llm: depends on CUDA 12.4
    We only need to download CUDA 12.4 to support both engines .

Edit: I just checked the cuda matrix compatibility and it is incorrect that CUDA is always backward compatible
image

Related ticket: https://github.com/janhq/cortex/issues/1047

Edit 2: The above image is forward compatibility between cuda and nvidia-version
image

From CUDA 11 onwards, applications compiled with a CUDA Toolkit release from within a CUDA major release family can run

So yes, CUDA is backward compatible within a CUDA major release
reference: https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility

@nguyenhoangthuan99
Copy link
Contributor

I also think we need to Download CUDA toolkit separately, both tensorrt llm and llamacpp require cuda, plus:
inside tensorrt llm package (~ 1GB) doesn't include cuda toolkit lib (cublas, cuparse, ... which is very heavy ~400Mb), if we decided to pack everything in 1 package for both tensorrt-llm and llamacpp, the size will increase.
Screenshot from 2024-08-29 10-46-02

@namchuai
Copy link
Collaborator Author

I'm referring this table to check for the compatibility between driver and toolkit
https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html#gpu-cuda-toolkit-and-cuda-driver-requirements
Screenshot 2024-08-29 at 11 27 22

@dan-menlo
Copy link
Contributor

dan-menlo commented Aug 29, 2024

Can I verify my understanding of the issue:

Decision
For Nvidia GPU users, the different engines have CUDA dependencies that are large 200-400mb downloads.

  1. Per-engine CUDA dependencies (i.e. install separately)
  2. Download 1 CUDA Toolkit for all engines

My initial thoughts

  • I think per-engine CUDA dependencies are more sustainable architecture
  • It is less efficient, but easier to manage long-term
  • i.e. llama.cpp packages its cudart files that have been verified
  • i.e. TensorRT-LLM packages its own CUDA dependencies (may change in the future)

This will be disk-space inefficient. However, the alternative seems to be dependency hell, which I think is even worse.

Folder Structure

  • My ideal outcome for Cortex is where each engine is its own submodule, and manages its own folder structure.
  • We have invested in making cortex.llama.cpp a separate module; it should ideally be independent and package with its dependencies
/cortex
    /engines
        /llama.cpp-extension
            /deps                               # CUDA dlls
        /tensorrt-llm-extension
            /deps                               # CUDA dlls

That said, am open to all ideas, especially @vansangpfiev's

@vansangpfiev
Copy link
Contributor

If disk-space inefficient is acceptable, I think we can go with option 1.
Please note that we will have some blockers for this option:

  • dynamic library search path: we will have two paths for llamacpp and tensorrt-llm, a potential issue can happen when we mix them together. For ubuntu and MacOS, I think we can solve that issue by compiling with rpath flag. For windows, I have created related issue

  • CI changes for cortex.llamacpp and cortex.tensorrt-llm to pack CUDA dependencies

@namchuai
Copy link
Collaborator Author

namchuai commented Aug 30, 2024

Thanks @vansangpfiev and @dan-homebrew

I'm confirming that we agree with:
Question 1: Packaging CUDA toolkit dependencies into corresponding engine.
Caveats:

  • CI changes for cortex.llamacpp and cortex.tensorrt-llm to pack CUDA dependencies

Question 2: Storing CUDA dependencies under corresponding engines.

/cortex
    /engines
        /cortex.llamacpp
            /deps                               # CUDA dlls
        /cortex.tensorrt-llm
            /deps                               # CUDA dlls

Caveats:

  • dynamic library search path on windows

Additional thought
@vansangpfiev , I think when we change the CI for engine, could we associated a file which contains the versions of the engine and info of its dependencies. This will help for engine list command in the future.
wdyt? cc @nguyenhoangthuan99

@imtuyethan imtuyethan moved this to Planning in Menlo Sep 2, 2024
@imtuyethan imtuyethan added the type: planning Opening up a discussion label Sep 2, 2024
@dan-menlo dan-menlo self-assigned this Sep 3, 2024
@dan-menlo dan-menlo moved this from Planning to Scheduled in Menlo Sep 3, 2024
@dan-menlo dan-menlo moved this from Scheduled to Need Investigation in Menlo Sep 3, 2024
@dan-menlo dan-menlo changed the title Discussion: Cuda toolkit and engines Discussion: Cortex Engine Dependencies Architecture (e.g. CUDA Toolkit) Sep 3, 2024
@dan-menlo dan-menlo changed the title Discussion: Cortex Engine Dependencies Architecture (e.g. CUDA Toolkit) Discussion: Cortex.cpp Engine Dependencies Architecture (e.g. CUDA Toolkit) Sep 4, 2024
@freelerobot
Copy link
Contributor

freelerobot commented Sep 4, 2024

  1. What if llamacpp vs tensorrtllm dependencies start to conflict?

  2. Do we care about engine portability. And does doing a dynamic library search path on windows affect portability.

  3. How will we do maintenance and updates? i.e.

  • cortex update requires dependency bumps
  • cortex update doesn't require dependency bump (easier)
  1. Is this a dumb idea: store CUDA dependencies in a central location, such as a separate deps directory at the project root, and then use symbolic links or environment variables to point to the engine-specific dependencies.
/.cortex
    /deps
        /cuda
            cuda-11.5 or whatever versioning
    /engines
        /cortex.llamacpp
            /bin
        /cortex.tensorrt-llm
            /bin
  1. Are there dependency mgmt tools we can use to manage this better?

@namchuai
Copy link
Collaborator Author

namchuai commented Sep 4, 2024

@0xSage , here's my thought. Please correct me if I'm wrong @nguyenhoangthuan99 @vansangpfiev

  1. What if llamacpp vs tensorrtllm dependencies start to conflict?
  • Yeah, that's why we separating dependencies for cortex.llamacpp and cortex.tensorrt-llm
    /engines
        /cortex.llamacpp
            /deps
        /cortex.tensorrt-llm
            /deps
  1. Do we care about engine portability. And does doing a dynamic library search path on windows affect portability.
  • Hmm, I might not really get what do you meant by portability. Regarding the dynamic library search path, it's because, in windows, the program will search for the DLLs at the current path (IIRC, same path as the executable). But we are about to put the dependencies under cortex.llamacpp/deps and cortex.tenssorrt-llm/deps, we need to tell the OS where to look for the DLLs.
  1. How will we do maintenance and updates?
    I think this is good question. We haven't decided on this yet. WDYT? @vansangpfiev @dan-homebrew @nguyenhoangthuan99 @hiento09
  2. Separating CUDA dependencies?
    I think this is good idea. But separates DLLs might cost us some efforts to handle it properly? I'm not sure. WDYT? @vansangpfiev @dan-homebrew @nguyenhoangthuan99
  3. Are there dependency mgmt tools we can use to manage this better?
    I think no. Currently, I think we only have cortexcpp and the default behavior of install is replace.

@vansangpfiev
Copy link
Contributor

For 3, I think we can do the maintenance and updates by versioning: generate a file (for example version.txt) for each release, which has metadata for engine version and cuda version. We will update cuda dependencies if needed.
For 4, I think it is easier for us to locate all cuda dependencies in the same folder as engine because we don't need to check which cuda version is using for which engine version

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 4, 2024

@vansangpfiev @namchuai @0xSage Quick responses:

Per-Engine Dependencies

Is this a dumb idea: store CUDA dependencies in a central location, such as a separate deps directory at the project root, and then use symbolic links or environment variables to point to the engine-specific dependencies.

  • From an architecture perspective, I would like for us to approach this from "each engine manages its own dependencies"
  • I would like to optimize for architectural simplicity at this stage of our library
  • I would like to re-use llama.cpp's bundled CUDA dependencies
  • Shared dependencies can be a subsequent optimization, that we tackle down the road as llama.cpp and TensorRT-LLM stabilize. From my POV, they are still changing very rapidly

I also agree with @vansangpfiev: let's co-locate all CUDA dependencies with the engine folder.

Simple > Complex, especially since model files are >4gb.

Updating Engines

For 3, I think we can do the maintenance and updates by versioning: generate a file (for example version.txt) for each release, which has metadata for engine version and cuda version. We will update cuda dependencies if needed.

I also think we need to think through the CLI and API commands:

cortex engines update tensorrt-llm
PUT <API URL>? 

Naming

I wonder whether it is better for us to have clearer naming for Cortex engines:

  • llamacpp-engine
  • onnx-engine
  • tensorrt-llm-engine

This articulates the concept of Cortex engines more clearly. Hopefully, with a clear API, the community can also step in to help build backends.

We would need to reason through cortex.python separately.

  • I think engine extensions can be in C++ or in Python
  • cortex.python might be more of a "Python base template", where Engine Extension dev can define Python version to bundle

@janhq janhq locked and limited conversation to collaborators Sep 5, 2024
@dan-menlo dan-menlo converted this issue into discussion #1117 Sep 5, 2024
@github-project-automation github-project-automation bot moved this from Need Investigation to Completed in Menlo Sep 5, 2024
@dan-menlo dan-menlo moved this from Completed to Discontinued in Menlo Sep 6, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
type: planning Opening up a discussion
Projects
Archived in project
Development

No branches or pull requests

6 participants