Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: remove max context length size (currently defaults to 4096) #3796

Closed
1 of 3 tasks
thonore75 opened this issue Oct 14, 2024 · 17 comments
Closed
1 of 3 tasks

bug: remove max context length size (currently defaults to 4096) #3796

thonore75 opened this issue Oct 14, 2024 · 17 comments
Labels
category: model settings Inference params, presets, templates category: threads & chat Threads & chat UI UX issues type: bug Something isn't working

Comments

@thonore75
Copy link

thonore75 commented Oct 14, 2024

Jan version

0.5.6

Describe the Bug

When using a model with 128k context, it's not possible to increase the context size.

Steps to Reproduce

  • Open Jan
  • If not present, download the model "granite-8b-code-instruct-128k-q5_k_m" or any other with 128k context size
  • Select this model in Jan, still use the default parameters for the Engine Settings
  • Tell "Hello" => Correct answer
  • In the Model tab, increase the context size at maximum (128000)
  • Tell "Hello" => Issue!!!
  • In the Model tab, replace the initial context size
  • Tell "Hello" => Issue : Answer is "."

Screenshots / Logs

ScreenHunter.49.mp4

What is your OS?

  • MacOS
  • Windows
  • Linux
@thonore75 thonore75 added the type: bug Something isn't working label Oct 14, 2024
@freelerobot freelerobot added category: threads & chat Threads & chat UI UX issues P1: important Important feature / fix labels Oct 14, 2024
@freelerobot
Copy link
Contributor

Model contexts are getting larger now. We should remove the max context size constraint in the UI

@freelerobot freelerobot added the needs designs Needs designs label Oct 14, 2024
@freelerobot freelerobot changed the title bug: Issue when changing the context size bug: remove max context length size (currently defaults to 4096) Oct 14, 2024
@github-project-automation github-project-automation bot moved this to Investigating in Menlo Oct 14, 2024
@louis-menlo
Copy link
Contributor

louis-menlo commented Oct 14, 2024

Hi @0xSage, It is not limited by the UI. It is retrieved and restricted by the model. As you can observe, it can extend to 128K; however, the issue arises when the model cannot be loaded with such a large context size, as there may be an OOM problem.

@thonore75 Could you kindly assist by sharing the log file here so we can see what is the problem?. Also, your device specs.

If that is the case, would be great to have

  1. A clearer problem explanation (error message)
  2. Smart recommended hardware constraint (See the maximum context length that can be applied to the model when it runs with the hardware)

@thonore75
Copy link
Author

ScreenHunter.52.mp4

app - CPU.log
app - GPU0_GPU1.log

Device specs:

  • Ryzen 9 5950X - 16-Core at 4.0 GHz
  • 128 GB DDR4
  • GPU0 : NVIDIA GeForce RTX 3060 12GB
  • GPU1 : NVIDIA GeForce RTX 3060 12GB

Jan is installed (models too) on a Samsung SSD 980 PRO 2TB

When doing my tests, I forgot to close MySql Workbench, it's using GPUs, and the context length limit was 4096 with correct answer. When close, I was able to reach 64000 context length limit.

I also tried in disabling the hardware acceleration, so in CPU mode, the model is always loaded, anyway the context length used, but the answer is only correct with 4096, above the answer is just a point "."

The more strange, if I used 64000 for context length, it was working, I changed to 96000, it was failing, but when changed it again in 64000, it was not working anymore. I closed Jan and restarted it, and it was working!

@louis-menlo
Copy link
Contributor

Hi @thonore75 It's due to VRAM OOM. It couldn't allocate resources on your remaining VRAM. The model size and context length should match your remaining VRAM, resulting in varying outcomes across stages, especially when other applications are also utilizing the GPU.

allocating 3823.03 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 4008738816

@freelerobot freelerobot added the category: model settings Inference params, presets, templates label Oct 15, 2024
@thonore75
Copy link
Author

Hi @louis-jan,
Thanks for your explanations!
I understand to set the context length too high use more memory, but why it continues to fail when I set the context length back on 4096 ? I need to restart Jan and it's working

@thonore75
Copy link
Author

app.log
ScreenHunter 233

Hi @louis-jan,
I modified the behavior about memory allocation by using the "NVIDIA Control Panel" like showed in the screenshot, it's slower but it's working with the context length set on 128000

ScreenHunter.53.mp4

@louis-menlo
Copy link
Contributor

Oh wow, thanks @thonore75. I didn't know that

@imtuyethan
Copy link
Contributor

The response seems like Prompt Template issue too @louis-jan

@imtuyethan
Copy link
Contributor

Operating System: MacOS Sonoma 14.2
Processor: Apple M2
RAM: 16GB


Cannot reproduce this bug on my side (using Llama 3.2 1B Instruct Q8), maybe it's because of the model @thonore75 use?
Can you send me the link to download your model, that would help a lot, thank you.

Screen.Recording.2024-10-17.at.7.00.25.PM.mov

@thonore75
Copy link
Author

Hi @imtuyethan,

Here the link where I downloaded the model : https://huggingface.co/maxwellb-hf/granite-8b-code-instruct-128k-Q5_K_M-GGUF

I will try the same model but with different quantification to see if there is a difference.

@thonore75
Copy link
Author

thonore75 commented Oct 17, 2024

When I reached the context length limit, if I set a correct value, where it was working before, after the model restarted, the answer is always a point ".". All time I will still on the same Thread, the issue will occurs. If I start a new Thread, the model is reloaded and the answer is correct. I little workaround to avoid to restart Jan each time the issue occurs

@imtuyethan
Copy link
Contributor

Operating System: MacOS Sonoma 14.2
Processor: Apple M2
RAM: 16GB


Thanks @thonore75.

I can reproduce the part where model replies nonsense when context length is max:

Screen.Recording.2024-10-18.at.2.12.05.PM.mov

However for this part i cannot reproduce, it stills works when i set the context length back to 4096.

Hi @louis-jan, Thanks for your explanations! I understand to set the context length too high use more memory, but why it continues to fail when I set the context length back on 4096 ? I need to restart Jan and it's working

Screen.Recording.2024-10-18.at.2.13.01.PM.mov

High chances are this is not a bug from Jan but the model itself, the default prompt template is definitely not correct. And user's device can't handle large context length which led to a nonsense response.

Need @louis-jan to investigate further.

@imtuyethan imtuyethan removed P1: important Important feature / fix needs designs Needs designs labels Oct 18, 2024
@thonore75
Copy link
Author

thonore75 commented Oct 18, 2024

ScreenHunter.57.mp4

app.log

@thonore75
Copy link
Author

But it was working with the model "Phi-3-medium-128k-instruct-Q5_0"

@imtuyethan
Copy link
Contributor

ScreenHunter.57.mp4
app.log

Hey @thonore75 from the logs, it seems like OOM issue (out of memory):

cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
  1. Model and Hardware Details:
  • Model: granite-8b-code-instruct-128k.Q8_0.gguf (about 8GB model)
  • GPU: NVIDIA GeForce RTX 3060 x2
  • Required CUDA buffer:
    • CUDA0 buffer: ~4202.71 MiB
    • CUDA1 buffer: ~3964.34 MiB
    • Trying to allocate 9500.00 MiB but failing
  1. The Sequence of Events:
  • Cortex tries to load the model
  • Initially attempts with ctx_len=128000 (fails)
  • Retries with reduced ctx_len=4096 (succeeds)
  • The system automatically adjusts and recovers by reducing the context length
  1. Root Cause:
    The initial attempt to load the model with a large context length (128K) requires more GPU memory than available on the RTX 3060. When it fails, Jan/Cortex automatically retries with a smaller context length (4K) which succeeds.

I recommend:

  1. Either continue using the model with reduced context length (4K)
  2. Or reduce the number of GPU layers using the ngl parameter

This isn't exactly a bug - it's more of a hardware limitation where the initial requested configuration exceeds available GPU memory, but the system successfully recovers by falling back to a more conservative configuration.

wdwad

@github-project-automation github-project-automation bot moved this from Investigating to Review + QA in Menlo Nov 4, 2024
@imtuyethan imtuyethan moved this from Review + QA to Completed in Menlo Nov 4, 2024
@imtuyethan
Copy link
Contributor

imtuyethan commented Nov 4, 2024

However, on the other side, when you import your own models to use:

Own your model configurations, use at your own risk.
Misconfigurations may result in lower quality or unexpected outputs. 

It looks like the model you used in the video has wrong prompt template.

@thonore75
Copy link
Author

thonore75 commented Nov 4, 2024

Hi @imtuyethan ,

I understand the hardware limitation, but the message could be more explicit.
I more, in my NVidia Control Panel I set the parameter to use the RAM when there is not enough available vRAM, it's working fine for many other programs but not with Jan apparently. When using the RAM instead the vRAM, it's slower but working for other programs. With Jan, it's depending the used model ((

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: model settings Inference params, presets, templates category: threads & chat Threads & chat UI UX issues type: bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

4 participants