-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: If device layers requested exceed model layers, host layers overflow #329
Comments
Bench OOMs (while
|
Hi @polarathene! Thank you for reporting this bug. I think that the UX part of device mapping isn't great currently, and hopefully #332 will improve that. Additionally, device mapping is not really optimized, and essentially consists of loading layers on different devices and then running them there. We copy the hidden states from device to host, which is probably the bottleneck as it requires a GPU-CPU synchronization.
Can you please confirm that it prints it is loading on device and host (so we know it is loading on the GPU)? #332 should make that clearer.
I think it should be. We don't have prompt chunking yet, and that's something I want to implement. The infrastructure is there, it was implemented by #242.
I'm not sure why this is happening. If you use device mapping, it definitely uses the GPU for calculations, though. I'll take a look.
I checked the |
@polarathene , this should be fixed now. |
Thanks, I'll give it a try within the next few days 👍
I did share above this log line at the end of my last response:
This will need to wait until later, but I do know that my vRAM went up notably while the bench command was running, and as per the logs it had found the GPU just fine so that was all working. It just wasn't utilizing the GPU for the benching for some reason.
It may have been something specific to the Docker or WSL2 environment, when I have the time I'll look into better reproducing and can verify with builds on WSL2 and the Windows 11 host to see if it still occurs in those environments too.
Oh ok, well if it hasn't already been tackled a more helpful error about that might be better UX? It wasn't clear that it was referring to a single file input. If you don't need a response and consider this solved feel free to close, otherwise I'll do that later this week. |
Yes, this error can be improved. Generally, we should start to remove some of the I'll wait until you think it's good to close this issue. |
@polarathene, does this overflow error still occur? |
The CLI still won't load files locally atm, I'd have to do a build again that bypasses the 401, but I think I'll confirm when the CLI works without that workaround, which is probably more meaningful 😅 |
It fails when the given value exceeds the layers still, whereas it should just know it has enough and clip it to the amount needed? $ /mist/target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 40 gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
2024-05-31T00:27:10.222222Z INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
2024-05-31T00:27:10.259228Z INFO mistralrs_core::pipeline::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `unigram`, num tokens: 32032, num added tokens: 0, num merges: 0, num scores: 32032
Error: Expected the number of GPU (40) and host layers (0) to sum to the number of model hidden layers (32)
0: candle_core::error::Error::bt
1: mistralrs_core::device_map::DeviceMapMetadata::into_mapper
2: mistralrs_core::models::quantized_llama::ModelWeights::from_gguf
3: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
5: mistralrs_bench::main
6: std::sys_common::backtrace::__rust_begin_short_backtrace
7: std::rt::lang_start::{{closure}}
8: std::rt::lang_start_internal
9: main
10: <unknown>
11: __libc_start_main
12: _start
Stack backtrace:
0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: mistralrs_bench::main
4: std::sys_common::backtrace::__rust_begin_short_backtrace
5: std::rt::lang_start::{{closure}}
6: std::rt::lang_start_internal
7: main
8: <unknown>
9: __libc_start_main
10: _start It also seems to complain about the Is the bench utility not in sync with the server command for local GGUF loading with the tokenizer? $ RUST_BACKTRACE=1 target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 40 gguf -m . -f /models/Hermes-2-Pro-Mistral-7B.Q4_K_M/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
File "tokenizer_config.json" not found at model id "."
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: mistralrs_bench::main |
When providing exact layers it loads into vRAM, but then fails after hitting the earlier mentioned duplicate field for this model: $ RUST_BACKTRACE=1 /mist/target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 32 gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
2024-05-31T00:32:56.871358Z INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
2024-05-31T00:32:56.909242Z INFO mistralrs_core::pipeline::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `unigram`, num tokens: 32032, num added tokens: 0, num merges: 0, num scores: 32032
2024-05-31T00:32:57.056662Z INFO mistralrs_core::device_map: Model has 32 repeating layers.
2024-05-31T00:32:57.056724Z INFO mistralrs_core::device_map: Using 32 repeating layers on GPU and 0 repeating layers on host.
thread 'main' panicked at mistralrs-core/src/pipeline/mod.rs:1323:80:
called `Result::unwrap()` on an `Err` value: Error("duplicate field `clean_up_tokenization_spaces`", line: 290, column: 32)
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::result::unwrap_failed
3: mistralrs_core::pipeline::get_chat_template
4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
6: mistralrs_bench::main After editing $ /mist/target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 32 gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
2024-05-31T01:08:36.913971Z INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
2024-05-31T01:08:36.942180Z INFO mistralrs_core::pipeline::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `unigram`, num tokens: 32032, num added tokens: 0, num merges: 0, num scores: 32032
2024-05-31T01:08:37.062847Z INFO mistralrs_core::device_map: Model has 32 repeating layers.
2024-05-31T01:08:37.062905Z INFO mistralrs_core::device_map: Using 32 repeating layers on GPU and 0 repeating layers on host.
2024-05-31T01:08:38.552757Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "<|im_end|>", "<|im_end|>", unk_tok = <unk>
2024-05-31T01:08:38.558662Z INFO mistralrs_bench: Model loaded.
2024-05-31T01:08:38.575306Z INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-05-31T01:08:38.575453Z INFO mistralrs_bench: Starting warmup run.
2024-05-31T01:08:38.876483Z INFO mistralrs_bench: Finished warmup run.
2024-05-31T01:08:38.876528Z INFO mistralrs_bench: Starting benchmarks.
+-------+---------+--------+----------------+-------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+-------+---------+--------+----------------+-------------+-------------+--------------+
| . | CUDA | pp 512 | 1064.449±0.000 | 0.939±0.000 | 1 | 1064.4491 |
+-------+---------+--------+----------------+-------------+-------------+--------------+ Result: The bench went quite quickly, I'm not quite sure how to compare it against a longer llama-bench -m /tmp/model.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 64,128,256,512,1024,2048,4096 |
@polarathene I just merged #367 which clamps the number of device layers to the number of model layers. I think the functionality should work fully now? |
Yes, I can provide a higher number than needed and no failures now 👍 I did go down a rabbit hole thinking a regression was introduced due to performance being halved, but that seems to be a WSL2 quirk with weird memory management from when I tested models that my system couldn't handle 😓 NOTE:
$ target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 40 gguf -m . -f /models/Hermes-2-Pro-Mistral-7B.Q4_K_M/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:294:58:
File "tokenizer_config.json" not found at model id "."
# File does exist:
$ ls /models/Hermes-2-Pro-Mistral-7B.Q4_K_M/
Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf added_tokens.json config.json generation_config.json special_tokens_map.json tokenizer.model tokenizer_config.json Required: # `-m` must point to directory of `-f`:
target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 40 gguf -m /models/Hermes-2-Pro-Mistral-7B.Q4_K_M/ -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
# Unless you run the command from that location:
cd /models/Hermes-2-Pro-Mistral-7B.Q4_K_M
/app/target/release/mistralrs-bench -p 512 -g 0 -r 1 -c 1 --num-device-layers 40 gguf -m . -f /models/Hermes-2-Pro-Mistral-7B.Q4_K_M/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf Probably the same for Collapsed - Resolved temporary regressionUPDATE: While writing this response, I've no idea what changed but bench performance is back in line with prior results.
Original response follows. Additionally I'm not sure why, but the performance of the bench is half of what I reported previously: $ mistralrs-bench -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf \
-r 1 -c 1 -p 512 -g 0 gguf -m . -t .
+-------+---------+--------+-----------------+-------------+-------------+--------------+
| model | backend | test | t/s | ms/t | concurrency | throughput/s |
+-------+---------+--------+-----------------+-------------+-------------+--------------+
| . | CUDA | pp 512 | 614.646±0.000 | 1.627±0.000 | 1 | 614.6458 |
+-------+---------+--------+-----------------+-------------+-------------+--------------+ $ llama-bench -m Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf \
-r 1 -p 512 -n 0 -b 512 -pg 0,0 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
| model | size | params | backend | ngl | n_batch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------: | ---------------: |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | CUDA | 99 | 512 | 0 | pp512 | 1313.10 ± 0.00 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | CUDA | 99 | 512 | 1 | pp512 | 1344.06 ± 0.00 | For reference options:
-h, --help
-m, --model <filename> (default: models/7B/ggml-model-q4_0.gguf)
-p, --n-prompt <n> (default: 512)
-n, --n-gen <n> (default: 128)
-pg <pp,tg> (default: 512,128)
-b, --batch-size <n> (default: 2048)
-ub, --ubatch-size <n> (default: 512)
-ctk, --cache-type-k <t> (default: f16)
-ctv, --cache-type-v <t> (default: f16)
-t, --threads <n> (default: 8)
-ngl, --n-gpu-layers <n> (default: 99)
-sm, --split-mode <none|layer|row> (default: layer)
-mg, --main-gpu <i> (default: 0)
-nkvo, --no-kv-offload <0|1> (default: 0)
-fa, --flash-attn <0|1> (default: 0)
-mmp, --mmap <0|1> (default: 1)
--numa <distribute|isolate|numactl> (default: disabled)
-embd, --embeddings <0|1> (default: 0)
-ts, --tensor-split <ts0/ts1/..> (default: 0)
-r, --repetitions <n> (default: 5)
-o, --output <csv|json|md|sql> (default: md)
-v, --verbose (default: 0) vs current options for Options:
-p, --n-prompt <N_PROMPT>
Number of prompt tokens to run [default: 512]
-g, --n-gen <N_GEN>
Number of generations tokens to run [default: 128]
-c, --concurrency <CONCURRENCY>
Number of concurrent requests to run. Default is 1
-r, --repetitions <REPETITIONS>
Number of times to repeat each test [default: 5]
-n, --num-device-layers <NUM_DEVICE_LAYERS>
Number of device layers to load and run on the device. All others will be on the CPU I tried to see if this was a regression by reverting back to an earlier commit, but prior to your commit for the GGUF tokenizer, I cannot load that Hermes model with Author of the model states they don't know how to create
Command + Additional infoAll commands output the same model config lines, so I'll omit that from the output examples: INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000 The command is the same for each with the exception of appending the # Short version:
mistralrs-bench -n 32 -p 512 -g 0 -r 1 -c 1 gguf -m . -t . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
# Options expanded for clarity:
/app/target/release/mistralrs-bench \
--num-device-layers 32 \
--n-prompt 512 \
--n-gen 0 \
--repetitions 1 \
--concurrency 1 \
gguf -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf -m . -t . Master branch (currently commit:
|
let raw_fixed = serde_json::to_vec_pretty(&tokenizer).unwrap(); | |
std::fs::write(fixed_path, raw_fixed).unwrap(); |
mistral.rs/mistralrs-core/src/utils/tokenizer.rs
Lines 19 to 20 in 24b33b1
let fixed_path = format!("{}_mistralrs_fixed", p.as_ref().display()); | |
let fixed_path = Path::new(&fixed_path); |
Seems to be applied unconditionally to satisfy a potential problem: huggingface/tokenizers#1528
Should probably be a CLI option, or even a subcommand to fix a local config if necessary?
- I see that it's not run again if the file exists already (not sure how compatible that is when a model updates though?).
- The filename is not handled properly,
.json
should be the extension, but presently it'stokenizer.json_mistralrs_fixed
when it should instead be more liketokenizer.patched_by_mistralrs.json
, ortokenizer.json_mistralrs_fixed.json
(if you must keep thetokenizer.json
prefix).
head -n 30 tokenizer.json
:
{
"version": "1.0",
"truncation": null,
"padding": null,
"added_tokens": [
{
"id": 0,
"content": "<unk>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 1,
"content": "<s>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 2,
"content": "</s>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
head -n 30 tokenizer.json_mistralrs_fixed
:
{
"added_tokens": [
{
"content": "<|begin_of_text|>",
"id": 128000,
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
{
"content": "<|end_of_text|>",
"id": 128001,
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
{
"content": "<|reserved_special_token_0|>",
"id": 128002,
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
{
$ wc -l tokenizer.json tokenizer.json_mistralrs_fixed
91121 tokenizer.json
410759 tokenizer.json_mistralrs_fixed
$ jq '.added_tokens[].id' tokenizer.json
0
1
2
# Original ID not present:
$ jq '.added_tokens[].id' tokenizer.json_mistralrs_fixed
128000
128001
# ...
128255
# ID of the token with content '<unk>':
$ jq '.added_tokens[] | select(.content == "<unk>") | .id' tokenizer.json
0
# References:
$ grep '<unk>' tokenizer.json
"content": "<unk>",
"unk_token": "<unk>",
"<unk>": 0,
# Token doesn't exist anymore (neither produce output):
$ jq '.added_tokens[] | select(.content == "<unk>") | .id' tokenizer.json_mistralrs_fixed
$ grep '<unk>' tokenizer.json_mistralrs_fixed
So I'm not quite sure what this file with similar name is doing, is it intended to remove data that was there?
I tried a tokenizer.json
from a llama 3 based model with an older commit before the logic above was implemented, even though it was not compatible with the Hermes model and resulted in the same failure, it did output the warning log lines:
WARN tokenizers::tokenizer::serialization: Warning: Token '<|begin_of_text|>' was expected to have ID '128000' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|end_of_text|>' was expected to have ID '128001' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|reserved_special_token_0|>' was expected to have ID '128002' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|reserved_special_token_1|>' was expected to have ID '128003' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|reserved_special_token_2|>' was expected to have ID '128004' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|reserved_special_token_3|>' was expected to have ID '128005' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|start_header_id|>' was expected to have ID '128006' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|end_header_id|>' was expected to have ID '128007' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|reserved_special_token_4|>' was expected to have ID '128008' but was given ID 'None'
WARN tokenizers::tokenizer::serialization: Warning: Token '<|eot_id|>' was expected to have ID '128009' but was given ID 'None'
Which makes sense for the error now I guess 😅 The tokenizer.json
files I've been randomly trying from HF had the other mentioned tokens (bos_tokens
/ unk_token
) but since im_end
wasn't present the failure shouldn't be a surprise 😑
# No output
$ grep 'im_end' tokenizer.json
So on the tokenizer.json
that was emitting those errors we can lookup that entry for the last warning of <|eot_id|>
, the id
field is what was expected, not sure why the warning says it got None
🤔
$ jq '.added_tokens[] | select(.content == "<|eot_id|>") | .' tokenizer.json
{
"id": 128009,
"content": "<|eot_id|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
}
EDIT: I should have read the utils/tokenizer.rs
code, mapping of vocab keys as added_tokens[].content
to their respective id
was missing:
mistral.rs/mistralrs-core/src/utils/tokenizer.rs
Lines 29 to 38 in 24b33b1
for token in added_tokens { | |
if !vocab.contains_key(&token.content) { | |
tokenizer["model"]["vocab"] | |
.as_object_mut() | |
.unwrap() | |
.insert(token.content, token.id.into()) | |
.ok_or(()) | |
.unwrap_err(); | |
} | |
} |
I still don't think it should be needing to enforce the write as a way to cache though?
The token in this case perhaps due to different layout with tokenizer.model
instead of tokenizer.json
had the required tokens defined in a separate added_tokens.json
file. Presumably when constructing a tokenizer.json
it'd get the tokens from that?
{
"<|im_end|>": 32000,
"<|im_start|>": 32001,
// ...
}
Git history on master branch
I went back a bit further to when I would get the 401 unauth issue, and applied my single line opt-out workaround I shared at the end of this comment.
At this point I realized that I was selecting a commit that wasn't technically part of the master branch, it was interleaved with commits from other branches that were merged into master. So navigating the branch history via Github's web UI to select commits isn't that helpful 😓
So the main branch history has stuff like this (commits that are noisy in history, rather than meaningful):
EDIT: Seems like you've recently switched to using squash merge since #362 😎
Anyway, I went further back before gguf_tokenizer.rs
was introduced and still had issues with tokenizer.json
, so I must have found a some-what compatible one somewhere on HF previously, but I can't recall where I got it from 🤦♂️
@polarathene sorry that it doesn't work out of the box yet! I responded to some of your comments:
Perhaps your HF token is not set? If it is set, though, this is strange and I cannot reproduce this.
No, it uses the same internal API. As documented here, you should provide a chat template with
Hmm, can you please open an issue for that?
Ah, ok. I'll fix that.
That was indeed a bug and should be fixed now. |
I did not have an HF account at the time. I'm still new to this domain so I'd only been trying popular models that didn't require an account to login. I'm not sure why an HF token would be required when the HF repo for a model is publicly accessible? You can Personally that's not a great UX, if no token is present but one is required for the API, yet I have the files locally why is HF required? I don't know if I'd ever train my own model as a learning experience, but it gives the impression that I'd have to redundantly upload to HF (or make a GGUF). I don't know the correct term, that hermes one is apparently based of mistral, but that sort of thing I guess where using the existing support for model architectures in
The
Possibly.
🎉 (I have not had time to verify this yet, but great news!)
?
It only seemed relevant for
I'll need to run a new build again to verify I guess 😅 If I need more than the GGUF file itself, is there a public HF repo you can refer me to for those files to go with the GGUF? Or do I need to create an HF account just to get access to such content for GGUF? (which AFAIK the local GGUF support is meant to remove the dependency on supplementary files) |
Hi @polarathene! Since the bug here is fixed, I'll close this issue. Please feel free to reopen though!
We have made a few changes in that area of the code since, and that behavior is handled now.
We typically use the term "fine tuning" to describe creating a new, better model based on an, aptly named, base model :).
To use HF Hub you need to create an account, I think. We have had fully local GGUF support for a bit now, though. |
Describe the bug
If they number of device layers exceed the models, then the host layers to assign seems to wrap/overflow instead of the expected
0
.NOTE: With
llama-cpp
you can configure a larger number of layers and host layers will remain0
while only the needed layers are used as device layers.Context:
Q4_KM
GGUF model used, but this applies to any model where the layers is exact.mistral.rs
seems to enforce loading first through HuggingFace API calls, I've worked around that by allowing 401 (Unauthorized) panic as described here, while unlikellama-cpp
additional config files is enforced... I sourced those from here, but it lacked atokenizer.json
file so I gave it one from another model (this has no relevance on the error encountered).Additional feedback
mistral.rs
output doesn't information like layers as clear to me asllama-cpp
, and I don't know if there's some sort ofinspect
command to output/query the metadata?I had thought it was 33 layers, but looking over the
llama-cpp
output again I see it's 32 with an extra layer appended afterwards:I find this sort of information quite helpful, so if
mistral.rs
could communicate that better too that'd be nice 👍Better communicating the device/GPU like above would also be nice vs what it currently displays:
Latest commit
v0.1.8: ca9bf7d
The text was updated successfully, but these errors were encountered: