bug: remove max context length size (currently defaults to 4096) #3796

thonore75 · 2024-10-14T09:37:15Z

Jan version

0.5.6

Describe the Bug

When using a model with 128k context, it's not possible to increase the context size.

Steps to Reproduce

Open Jan
If not present, download the model "granite-8b-code-instruct-128k-q5_k_m" or any other with 128k context size
Select this model in Jan, still use the default parameters for the Engine Settings
Tell "Hello" => Correct answer
In the Model tab, increase the context size at maximum (128000)
Tell "Hello" => Issue!!!
In the Model tab, replace the initial context size
Tell "Hello" => Issue : Answer is "."

Screenshots / Logs

ScreenHunter.49.mp4

What is your OS?

MacOS
Windows
Linux

freelerobot · 2024-10-14T14:32:42Z

Model contexts are getting larger now. We should remove the max context size constraint in the UI

louis-menlo · 2024-10-14T16:26:35Z

Hi @0xSage, It is not limited by the UI. It is retrieved and restricted by the model. As you can observe, it can extend to 128K; however, the issue arises when the model cannot be loaded with such a large context size, as there may be an OOM problem.

@thonore75 Could you kindly assist by sharing the log file here so we can see what is the problem?. Also, your device specs.

If that is the case, would be great to have

A clearer problem explanation (error message)
Smart recommended hardware constraint (See the maximum context length that can be applied to the model when it runs with the hardware)

thonore75 · 2024-10-14T17:33:09Z

ScreenHunter.52.mp4

app - CPU.log
app - GPU0_GPU1.log

Device specs:

Ryzen 9 5950X - 16-Core at 4.0 GHz
128 GB DDR4
GPU0 : NVIDIA GeForce RTX 3060 12GB
GPU1 : NVIDIA GeForce RTX 3060 12GB

Jan is installed (models too) on a Samsung SSD 980 PRO 2TB

When doing my tests, I forgot to close MySql Workbench, it's using GPUs, and the context length limit was 4096 with correct answer. When close, I was able to reach 64000 context length limit.

I also tried in disabling the hardware acceleration, so in CPU mode, the model is always loaded, anyway the context length used, but the answer is only correct with 4096, above the answer is just a point "."

The more strange, if I used 64000 for context length, it was working, I changed to 96000, it was failing, but when changed it again in 64000, it was not working anymore. I closed Jan and restarted it, and it was working!

louis-menlo · 2024-10-15T06:36:33Z

Hi @thonore75 It's due to VRAM OOM. It couldn't allocate resources on your remaining VRAM. The model size and context length should match your remaining VRAM, resulting in varying outcomes across stages, especially when other applications are also utilizing the GPU.

allocating 3823.03 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 4008738816

thonore75 · 2024-10-15T08:20:09Z

Hi @louis-jan,
Thanks for your explanations!
I understand to set the context length too high use more memory, but why it continues to fail when I set the context length back on 4096 ? I need to restart Jan and it's working

thonore75 · 2024-10-15T08:34:35Z

app.log

Hi @louis-jan,
I modified the behavior about memory allocation by using the "NVIDIA Control Panel" like showed in the screenshot, it's slower but it's working with the context length set on 128000

ScreenHunter.53.mp4

louis-menlo · 2024-10-15T13:58:14Z

Oh wow, thanks @thonore75. I didn't know that

imtuyethan · 2024-10-17T11:59:07Z

The response seems like Prompt Template issue too @louis-jan

imtuyethan · 2024-10-17T12:02:31Z

Operating System: MacOS Sonoma 14.2
Processor: Apple M2
RAM: 16GB

Cannot reproduce this bug on my side (using Llama 3.2 1B Instruct Q8), maybe it's because of the model @thonore75 use?
Can you send me the link to download your model, that would help a lot, thank you.

Screen.Recording.2024-10-17.at.7.00.25.PM.mov

thonore75 · 2024-10-17T16:18:27Z

Hi @imtuyethan,

Here the link where I downloaded the model : https://huggingface.co/maxwellb-hf/granite-8b-code-instruct-128k-Q5_K_M-GGUF

I will try the same model but with different quantification to see if there is a difference.

thonore75 · 2024-10-17T16:39:08Z

When I reached the context length limit, if I set a correct value, where it was working before, after the model restarted, the answer is always a point ".". All time I will still on the same Thread, the issue will occurs. If I start a new Thread, the model is reloaded and the answer is correct. I little workaround to avoid to restart Jan each time the issue occurs

imtuyethan · 2024-10-18T07:16:32Z

Operating System: MacOS Sonoma 14.2
Processor: Apple M2
RAM: 16GB

Thanks @thonore75.

I can reproduce the part where model replies nonsense when context length is max:

Screen.Recording.2024-10-18.at.2.12.05.PM.mov

However for this part i cannot reproduce, it stills works when i set the context length back to 4096.

Hi @louis-jan, Thanks for your explanations! I understand to set the context length too high use more memory, but why it continues to fail when I set the context length back on 4096 ? I need to restart Jan and it's working

Screen.Recording.2024-10-18.at.2.13.01.PM.mov

High chances are this is not a bug from Jan but the model itself, the default prompt template is definitely not correct. And user's device can't handle large context length which led to a nonsense response.

Need @louis-jan to investigate further.

thonore75 · 2024-10-18T12:19:36Z

ScreenHunter.57.mp4

app.log

thonore75 · 2024-10-18T12:22:17Z

But it was working with the model "Phi-3-medium-128k-instruct-Q5_0"

imtuyethan · 2024-11-04T11:04:40Z

ScreenHunter.57.mp4
app.log

Hey @thonore75 from the logs, it seems like OOM issue (out of memory):

cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache

Model and Hardware Details:

Model: granite-8b-code-instruct-128k.Q8_0.gguf (about 8GB model)
GPU: NVIDIA GeForce RTX 3060 x2
Required CUDA buffer:
- CUDA0 buffer: ~4202.71 MiB
- CUDA1 buffer: ~3964.34 MiB
- Trying to allocate 9500.00 MiB but failing

The Sequence of Events:

Cortex tries to load the model
Initially attempts with ctx_len=128000 (fails)
Retries with reduced ctx_len=4096 (succeeds)
The system automatically adjusts and recovers by reducing the context length

Root Cause:
The initial attempt to load the model with a large context length (128K) requires more GPU memory than available on the RTX 3060. When it fails, Jan/Cortex automatically retries with a smaller context length (4K) which succeeds.

I recommend:

Either continue using the model with reduced context length (4K)
Or reduce the number of GPU layers using the ngl parameter

This isn't exactly a bug - it's more of a hardware limitation where the initial requested configuration exceeds available GPU memory, but the system successfully recovers by falling back to a more conservative configuration.

imtuyethan · 2024-11-04T11:07:01Z

However, on the other side, when you import your own models to use:

Own your model configurations, use at your own risk.
Misconfigurations may result in lower quality or unexpected outputs.

It looks like the model you used in the video has wrong prompt template.

thonore75 · 2024-11-04T11:49:12Z

Hi @imtuyethan ,

I understand the hardware limitation, but the message could be more explicit.
I more, in my NVidia Control Panel I set the parameter to use the RAM when there is not enough available vRAM, it's working fine for many other programs but not with Jan apparently. When using the RAM instead the vRAM, it's slower but working for other programs. With Jan, it's depending the used model ((

thonore75 added the type: bug Something isn't working label Oct 14, 2024

freelerobot added category: threads & chat Threads & chat UI UX issues P1: important Important feature / fix labels Oct 14, 2024

freelerobot added the needs designs Needs designs label Oct 14, 2024

freelerobot changed the title ~~bug: Issue when changing the context size~~ bug: remove max context length size (currently defaults to 4096) Oct 14, 2024

freelerobot added this to Menlo Oct 14, 2024

github-project-automation bot moved this to Investigating in Menlo Oct 14, 2024

freelerobot mentioned this issue Oct 14, 2024

enhancement: [UX enhancement] Improve sidebar settings hierarchy and better UX handling of Context Length/Max Tokens relationship #3738

Open

freelerobot added the category: model settings Inference params, presets, templates label Oct 15, 2024

imtuyethan removed P1: important Important feature / fix needs designs Needs designs labels Oct 18, 2024

imtuyethan mentioned this issue Oct 22, 2024

chore: Structure Icebox in Github Projects #3840

Closed

imtuyethan mentioned this issue Nov 4, 2024

bug: Characters with accent are not displayed into the history #3800

Closed

3 tasks

imtuyethan closed this as completed Nov 4, 2024

github-project-automation bot moved this from Investigating to Review + QA in Menlo Nov 4, 2024

imtuyethan moved this from Review + QA to Completed in Menlo Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: remove max context length size (currently defaults to 4096) #3796

bug: remove max context length size (currently defaults to 4096) #3796

thonore75 commented Oct 14, 2024 •

edited

Loading

freelerobot commented Oct 14, 2024

louis-menlo commented Oct 14, 2024 •

edited

Loading

thonore75 commented Oct 14, 2024

louis-menlo commented Oct 15, 2024

thonore75 commented Oct 15, 2024

thonore75 commented Oct 15, 2024

louis-menlo commented Oct 15, 2024

imtuyethan commented Oct 17, 2024

imtuyethan commented Oct 17, 2024

thonore75 commented Oct 17, 2024

thonore75 commented Oct 17, 2024 •

edited

Loading

imtuyethan commented Oct 18, 2024

thonore75 commented Oct 18, 2024 •

edited

Loading

thonore75 commented Oct 18, 2024

imtuyethan commented Nov 4, 2024

imtuyethan commented Nov 4, 2024 •

edited

Loading

thonore75 commented Nov 4, 2024 •

edited

Loading

bug: remove max context length size (currently defaults to 4096) #3796

bug: remove max context length size (currently defaults to 4096) #3796

Comments

thonore75 commented Oct 14, 2024 • edited Loading

Jan version

Describe the Bug

Steps to Reproduce

Screenshots / Logs

What is your OS?

freelerobot commented Oct 14, 2024

louis-menlo commented Oct 14, 2024 • edited Loading

thonore75 commented Oct 14, 2024

louis-menlo commented Oct 15, 2024

thonore75 commented Oct 15, 2024

thonore75 commented Oct 15, 2024

louis-menlo commented Oct 15, 2024

imtuyethan commented Oct 17, 2024

imtuyethan commented Oct 17, 2024

thonore75 commented Oct 17, 2024

thonore75 commented Oct 17, 2024 • edited Loading

imtuyethan commented Oct 18, 2024

thonore75 commented Oct 18, 2024 • edited Loading

thonore75 commented Oct 18, 2024

imtuyethan commented Nov 4, 2024

imtuyethan commented Nov 4, 2024 • edited Loading

thonore75 commented Nov 4, 2024 • edited Loading

thonore75 commented Oct 14, 2024 •

edited

Loading

louis-menlo commented Oct 14, 2024 •

edited

Loading

thonore75 commented Oct 17, 2024 •

edited

Loading

thonore75 commented Oct 18, 2024 •

edited

Loading

imtuyethan commented Nov 4, 2024 •

edited

Loading

thonore75 commented Nov 4, 2024 •

edited

Loading