Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container image fails to start with 'Unable to dynamically load the "cuda" shared library' #478

Closed
sammcj opened this issue Jun 24, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@sammcj
Copy link
Contributor

sammcj commented Jun 24, 2024

Describe the bug

When the mistralrs container using the official Dockerfile.cuda-all starts it crashes with:

thread 'main' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.6/src/lib.rs:98:5:
Unable to dynamically load the "cuda" shared library - searched for library names: ["cuda", "nvcuda"]. Ensure that `LD_LIBRARY_PATH` has the correct path to the installed library. If the shared library is present on the system under a different name than one of those listed above, please open a GitHub issue.
services:
  &name mistralrs:
    build:
      context: https://github.com/EricLBuehler/mistral.rs.git#master
      dockerfile: Dockerfile.cuda-all
      args:
        - FEATURES=cuda,flash-attn,cudnn
        - CUDA_COMPUTE_CAP=86
    container_name: *name
    hostname: *name
    ports:
      - 80
    volumes:
      - /mnt/llm/mistralrs/data:/data
      - /mnt/llm/models:/models
    command: gguf -m . -f /models/DeepSeek-Coder-V2-Instruct.IQ2_XXS.gguf
	devices:
	  - /dev/nvidia0:/dev/nvidia0
	  - /dev/nvidia1:/dev/nvidia1
	  - /dev/nvidia2:/dev/nvidia2
	runtime: nvidia
	deploy:
	  resources:
	    reservations:
	      devices:
	        - driver: nvidia
	          count: all
	          capabilities: ["compute", "utility", "graphics"]
	environment:
	  - NVIDIA_VISIBLE_DEVICES=all
	  - NVIDIA_DRIVER_CAPABILITIES=all

Latest commit

  • 615a10e (master as of 2024-06-25 9:32AM AEST)
@sammcj sammcj added the bug Something isn't working label Jun 24, 2024
@DerTiedemann
Copy link

I have a similar issue, see this comment for the details: huggingface/candle#353 (comment)

@EricLBuehler
Copy link
Owner

@sammcj have you been able to reproduce this?

@sammcj
Copy link
Contributor Author

sammcj commented Nov 28, 2024

The latest mistralrs container image seems to fail to start with the following when using any qwen2.5 models which are all I'm running at the moment (just as they're so much better than anything else):

2024-11-28T20:15:04.430523Z  INFO mistralrs_server: avx: false, neon: false, simd128: false, f16c: false
2024-11-28T20:15:04.430541Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-11-28T20:15:04.430551Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-11-28T20:15:04.430592Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-11-28T20:15:04.430635Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-11-28T20:15:04.430645Z  INFO mistralrs_core::pipeline::paths: Loading `/models/Qwen2.5-Coder-7B-Instruct-128k-Q6_K.gguf` locally at `/models/Qwen2.5-Coder-7B-Instruct-128k-Q6_K.gguf`
2024-11-28T20:15:04.430856Z  INFO mistralrs_core::pipeline::gguf: Loading model `.` on cuda[0].
Error: Unknown GGUF architecture `qwen2`

Stack backtrace:
   0: anyhow::error::<impl anyhow::Error>::msg
   1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: mistralrs_server::main::{{closure}}
   4: tokio::runtime::park::CachedParkThread::block_on
   5: tokio::runtime::context::runtime::enter_runtime
   6: tokio::runtime::runtime::Runtime::block_on
   7: mistralrs_server::main
   8: std::sys::backtrace::__rust_begin_short_backtrace
   9: std::rt::lang_start::{{closure}}
  10: std::rt::lang_start_internal
  11: main
  12: <unknown>
  13: __libc_start_main
  14: _start

I'll download an older model and see if it works, will let you know.

@sammcj
Copy link
Contributor Author

sammcj commented Nov 28, 2024

Oh I also use mistral-large, just realised I had a GGUF for that - it fails as well:

2024-11-28T20:16:56.476391Z  INFO mistralrs_server: avx: false, neon: false, simd128: false, f16c: false
2024-11-28T20:16:56.476411Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-11-28T20:16:56.476421Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-11-28T20:16:56.476468Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-11-28T20:16:56.476512Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-11-28T20:16:56.476520Z  INFO mistralrs_core::pipeline::paths: Loading `/models/Mistral-Large-Instruct-2411-IQ2_M.gguf` locally at `/models/Mistral-Large-Instruct-2411-IQ2_M.gguf`
2024-11-28T20:16:56.476761Z  INFO mistralrs_core::pipeline::gguf: Loading model `.` on cuda[0].
Error: path: "/models/Mistral-Large-Instruct-2411-IQ2_M.gguf" unknown dtype for tensor 21
   0: candle_core::error::Error::bt
   1: candle_core::quantized::GgmlDType::from_u32
   2: candle_core::quantized::gguf_file::Content::read
   3: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   5: mistralrs_server::main::{{closure}}
   6: tokio::runtime::park::CachedParkThread::block_on
   7: tokio::runtime::context::runtime::enter_runtime
   8: tokio::runtime::runtime::Runtime::block_on
   9: mistralrs_server::main
  10: std::sys::backtrace::__rust_begin_short_backtrace
  11: std::rt::lang_start::{{closure}}
  12: std::rt::lang_start_internal
  13: main
  14: <unknown>
  15: __libc_start_main
  16: _start


Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: mistralrs_server::main::{{closure}}
   4: tokio::runtime::park::CachedParkThread::block_on
   5: tokio::runtime::context::runtime::enter_runtime
   6: tokio::runtime::runtime::Runtime::block_on
   7: mistralrs_server::main
   8: std::sys::backtrace::__rust_begin_short_backtrace
   9: std::rt::lang_start::{{closure}}
  10: std::rt::lang_start_internal
  11: main
  12: <unknown>
  13: __libc_start_main
  14: _start

@sammcj
Copy link
Contributor Author

sammcj commented Nov 28, 2024

No go, looks like it fails with llama 3.2 as well:

thread 'main' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.6/src/lib.rs:98:5:
Unable to dynamically load the "cuda" shared library - searched for library names: ["cuda", "nvcuda"]. Ensure that `LD_LIBRARY_PATH` has the correct path to the installed library. If the shared library is present on the system under a different name than one of those listed above, please open a GitHub issue.
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: cudarc::panic_no_lib_found
   3: std::sys::sync::once::futex::Once::call
   4: std::sync::once_lock::OnceLock<T>::initialize
   5: cudarc::driver::safe::core::CudaDevice::new
   6: <candle_core::cuda_backend::device::CudaDevice as candle_core::backend::BackendDevice>::new
   7: candle_core::device::Device::cuda_if_available
   8: mistralrs_server::main::{{closure}}
   9: tokio::runtime::park::CachedParkThread::block_on
  10: tokio::runtime::context::runtime::enter_runtime
  11: tokio::runtime::runtime::Runtime::block_on
  12: mistralrs_server::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2024-11-28T20:19:34.592640Z  INFO mistralrs_server: avx: false, neon: false, simd128: false, f16c: false
2024-11-28T20:19:34.592673Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-11-28T20:19:34.592684Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-11-28T20:19:34.592728Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-11-28T20:19:34.592770Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-11-28T20:19:34.592779Z  INFO mistralrs_core::pipeline::paths: Loading `/models/Llama-3.2-1B-Instruct-Q8_0.gguf` locally at `/models/Llama-3.2-1B-Instruct-Q8_0.gguf`
2024-11-28T20:19:34.592994Z  INFO mistralrs_core::pipeline::gguf: Loading model `.` on cuda[0].
2024-11-28T20:19:34.729781Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.basename: Llama-3.2
general.file_type: 7
general.finetune: Instruct
general.languages: en, de, fr, it, pt, hi, es, th
general.license: llama3.2
general.name: Llama 3.2 1B Instruct
general.quantization_version: 2
general.size_label: 1B
general.tags: facebook, meta, pytorch, llama, llama-3, text-generation
general.type: model
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.key_length: 64
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.attention.value_length: 64
llama.block_count: 16
llama.context_length: 131072
llama.embedding_length: 2048
llama.feed_forward_length: 8192
llama.rope.dimension_count: 64
llama.rope.freq_base: 500000
llama.vocab_size: 128256
2024-11-28T20:19:34.862767Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-11-28T20:19:34.867489Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|> '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|> ' }}`
Error: cannot find tensor info for output.weight
   0: candle_core::error::Error::bt
   1: candle_core::quantized::gguf_file::Content::tensor
   2: <mistralrs_core::models::quantized_llama::ModelWeights as mistralrs_core::utils::model_config::FromGGUF>::from_gguf
   3: mistralrs_core::utils::model_config::<impl core::convert::TryFrom<mistralrs_core::utils::model_config::ModelParams<mistralrs_core::utils::model_config::ParamsGGUF>> for mistralrs_core::models::quantized_llama::ModelWeights>::try_from
   4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   6: mistralrs_server::main::{{closure}}
   7: tokio::runtime::park::CachedParkThread::block_on
   8: tokio::runtime::context::runtime::enter_runtime
   9: tokio::runtime::runtime::Runtime::block_on
  10: mistralrs_server::main
  11: std::sys::backtrace::__rust_begin_short_backtrace
  12: std::rt::lang_start::{{closure}}
  13: std::rt::lang_start_internal
  14: main
  15: <unknown>
  16: __libc_start_main
  17: _start


Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: mistralrs_server::main::{{closure}}
   4: tokio::runtime::park::CachedParkThread::block_on
   5: tokio::runtime::context::runtime::enter_runtime
   6: tokio::runtime::runtime::Runtime::block_on
   7: mistralrs_server::main
   8: std::sys::backtrace::__rust_begin_short_backtrace
   9: std::rt::lang_start::{{closure}}
  10: std::rt::lang_start_internal
  11: main
  12: <unknown>
  13: __libc_start_main
  14: _start

@EricLBuehler
Copy link
Owner

@sammcj we don't support the I- quants yet (that explains mistral-large). They will be added soon with the upcoming imatrix support 😉!

Can you please update your CUDA docker container? I just released some new images (our images for compute cap 75 is now deprecated).

@sammcj
Copy link
Contributor Author

sammcj commented Nov 28, 2024

Can you please update your CUDA docker container? I just released some new images (our images for compute cap 75 is now deprecated).

Hey that fixed Llama 3.2! Nice work! 🎉

2024-11-28T20:26:35.508451Z  INFO mistralrs_server: avx: false, neon: false, simd128: false, f16c: false
2024-11-28T20:26:35.508470Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-11-28T20:26:35.508483Z  INFO mistralrs_server: Model kind is: gguf quantized from gguf (no adapters)
2024-11-28T20:26:35.508530Z  INFO candle_hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-11-28T20:26:35.508594Z  INFO candle_hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-11-28T20:26:35.508612Z  INFO mistralrs_core::pipeline::paths: Loading `/models/Llama-3.2-1B-Instruct-Q8_0.gguf` locally at `/models/Llama-3.2-1B-Instruct-Q8_0.gguf`
2024-11-28T20:26:35.508884Z  INFO mistralrs_core::pipeline::gguf: Loading model `.` on cuda[0].
2024-11-28T20:26:35.657013Z  INFO mistralrs_core::gguf::content: Model config:
general.architecture: llama
general.basename: Llama-3.2
general.file_type: 7
general.finetune: Instruct
general.languages: en, de, fr, it, pt, hi, es, th
general.license: llama3.2
general.name: Llama 3.2 1B Instruct
general.quantization_version: 2
general.size_label: 1B
general.tags: facebook, meta, pytorch, llama, llama-3, text-generation
general.type: model
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.key_length: 64
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.attention.value_length: 64
llama.block_count: 16
llama.context_length: 131072
llama.embedding_length: 2048
llama.feed_forward_length: 8192
llama.rope.dimension_count: 64
llama.rope.freq_base: 500000
llama.vocab_size: 128256
2024-11-28T20:26:35.792288Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-11-28T20:26:35.797345Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|> '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|> ' }}`
2024-11-28T20:26:40.515726Z  INFO mistralrs_core::paged_attention: Allocating 19069 MB for PagedAttention KV cache
2024-11-28T20:26:40.515743Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 9534 GPU blocks: available context length is 305088 tokens
2024-11-28T20:26:40.532147Z  INFO mistralrs_core::pipeline::paths: Using literal chat template.
2024-11-28T20:26:40.672037Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", unk_tok = `None`
2024-11-28T20:26:40.676076Z  INFO mistralrs_server: Model loaded.
2024-11-28T20:26:40.740919Z  INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2024-11-28T20:26:40.764329Z  INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2024-11-28T20:26:40.765075Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-11-28T20:26:40.765145Z  INFO mistralrs_core: Beginning dummy run.
2024-11-28T20:26:45.690614Z  INFO mistralrs_core: Dummy run completed in 4.925459226s.
2024-11-28T20:26:45.690872Z  INFO mistralrs_server: Serving on http://0.0.0.0:80.

FYI I'm currently running 2x 3090, (compute 86,86+PTX), still trying to decide what to do with my 2x A4000 and 2x P100s I've got sitting here that won't fit in my case 😅

@sammcj
Copy link
Contributor Author

sammcj commented Nov 28, 2024

FYI Qwen 2.5 32b Q6_K starts to load with that updated container image, then crashes with CUDA out of memory - it looks like it's only using 1 of the 2 GPUs, I'll have to check I don't have any configuration issues - if I don't I'll log a separate bug for it.

@sammcj
Copy link
Contributor Author

sammcj commented Nov 28, 2024

FYI manually specifying the layers to place on each GPU correctly used both.

command: -n "0:32;1:32" gguf -m . -f /models/Qwen2.5-Coder-32B-Instruct-128k-Q6_K.gguf
2024-11-28T20:38:45.931592Z  INFO mistralrs_core::device_map: Model has 64 repeating layers.
2024-11-28T20:38:45.990780Z  INFO mistralrs_core::device_map: Loading model according to the following repeating layer mappings:
2024-11-28T20:38:45.990801Z  INFO mistralrs_core::device_map: Layer 0: cuda[0]
2024-11-28T20:38:45.990803Z  INFO mistralrs_core::device_map: Layer 1: cuda[0]
2024-11-28T20:38:45.990804Z  INFO mistralrs_core::device_map: Layer 2: cuda[0]
2024-11-28T20:38:45.990804Z  INFO mistralrs_core::device_map: Layer 3: cuda[0]
2024-11-28T20:38:45.990805Z  INFO mistralrs_core::device_map: Layer 4: cuda[0]
2024-11-28T20:38:45.990806Z  INFO mistralrs_core::device_map: Layer 5: cuda[0]
2024-11-28T20:38:45.990806Z  INFO mistralrs_core::device_map: Layer 6: cuda[0]
2024-11-28T20:38:45.990806Z  INFO mistralrs_core::device_map: Layer 7: cuda[0]
2024-11-28T20:38:45.990807Z  INFO mistralrs_core::device_map: Layer 8: cuda[0]
2024-11-28T20:38:45.990808Z  INFO mistralrs_core::device_map: Layer 9: cuda[0]
2024-11-28T20:38:45.990808Z  INFO mistralrs_core::device_map: Layer 10: cuda[0]
2024-11-28T20:38:45.990809Z  INFO mistralrs_core::device_map: Layer 11: cuda[0]
2024-11-28T20:38:45.990809Z  INFO mistralrs_core::device_map: Layer 12: cuda[0]
2024-11-28T20:38:45.990810Z  INFO mistralrs_core::device_map: Layer 13: cuda[0]
2024-11-28T20:38:45.990810Z  INFO mistralrs_core::device_map: Layer 14: cuda[0]
2024-11-28T20:38:45.990811Z  INFO mistralrs_core::device_map: Layer 15: cuda[0]
2024-11-28T20:38:45.990811Z  INFO mistralrs_core::device_map: Layer 16: cuda[0]
2024-11-28T20:38:45.990811Z  INFO mistralrs_core::device_map: Layer 17: cuda[0]
2024-11-28T20:38:45.990812Z  INFO mistralrs_core::device_map: Layer 18: cuda[0]
2024-11-28T20:38:45.990812Z  INFO mistralrs_core::device_map: Layer 19: cuda[0]
2024-11-28T20:38:45.990813Z  INFO mistralrs_core::device_map: Layer 20: cuda[0]
2024-11-28T20:38:45.990813Z  INFO mistralrs_core::device_map: Layer 21: cuda[0]
2024-11-28T20:38:45.990814Z  INFO mistralrs_core::device_map: Layer 22: cuda[0]
2024-11-28T20:38:45.990814Z  INFO mistralrs_core::device_map: Layer 23: cuda[0]
2024-11-28T20:38:45.990815Z  INFO mistralrs_core::device_map: Layer 24: cuda[0]
2024-11-28T20:38:45.990815Z  INFO mistralrs_core::device_map: Layer 25: cuda[0]
2024-11-28T20:38:45.990816Z  INFO mistralrs_core::device_map: Layer 26: cuda[0]
2024-11-28T20:38:45.990816Z  INFO mistralrs_core::device_map: Layer 27: cuda[0]
2024-11-28T20:38:45.990817Z  INFO mistralrs_core::device_map: Layer 28: cuda[0]
2024-11-28T20:38:45.990817Z  INFO mistralrs_core::device_map: Layer 29: cuda[0]
2024-11-28T20:38:45.990818Z  INFO mistralrs_core::device_map: Layer 30: cuda[0]
2024-11-28T20:38:45.990818Z  INFO mistralrs_core::device_map: Layer 31: cuda[0]
2024-11-28T20:38:45.990819Z  INFO mistralrs_core::device_map: Layer 32: cuda[1]
2024-11-28T20:38:45.990819Z  INFO mistralrs_core::device_map: Layer 33: cuda[1]
2024-11-28T20:38:45.990820Z  INFO mistralrs_core::device_map: Layer 34: cuda[1]
2024-11-28T20:38:45.990820Z  INFO mistralrs_core::device_map: Layer 35: cuda[1]
2024-11-28T20:38:45.990821Z  INFO mistralrs_core::device_map: Layer 36: cuda[1]
2024-11-28T20:38:45.990821Z  INFO mistralrs_core::device_map: Layer 37: cuda[1]
2024-11-28T20:38:45.990822Z  INFO mistralrs_core::device_map: Layer 38: cuda[1]
2024-11-28T20:38:45.990822Z  INFO mistralrs_core::device_map: Layer 39: cuda[1]
2024-11-28T20:38:45.990823Z  INFO mistralrs_core::device_map: Layer 40: cuda[1]
2024-11-28T20:38:45.990823Z  INFO mistralrs_core::device_map: Layer 41: cuda[1]
2024-11-28T20:38:45.990824Z  INFO mistralrs_core::device_map: Layer 42: cuda[1]
2024-11-28T20:38:45.990824Z  INFO mistralrs_core::device_map: Layer 43: cuda[1]
2024-11-28T20:38:45.990825Z  INFO mistralrs_core::device_map: Layer 44: cuda[1]
2024-11-28T20:38:45.990825Z  INFO mistralrs_core::device_map: Layer 45: cuda[1]
2024-11-28T20:38:45.990826Z  INFO mistralrs_core::device_map: Layer 46: cuda[1]
2024-11-28T20:38:45.990827Z  INFO mistralrs_core::device_map: Layer 47: cuda[1]
2024-11-28T20:38:45.990827Z  INFO mistralrs_core::device_map: Layer 48: cuda[1]
2024-11-28T20:38:45.990828Z  INFO mistralrs_core::device_map: Layer 49: cuda[1]
2024-11-28T20:38:45.990828Z  INFO mistralrs_core::device_map: Layer 50: cuda[1]
2024-11-28T20:38:45.990829Z  INFO mistralrs_core::device_map: Layer 51: cuda[1]
2024-11-28T20:38:45.990829Z  INFO mistralrs_core::device_map: Layer 52: cuda[1]
2024-11-28T20:38:45.990830Z  INFO mistralrs_core::device_map: Layer 53: cuda[1]
2024-11-28T20:38:45.990830Z  INFO mistralrs_core::device_map: Layer 54: cuda[1]
2024-11-28T20:38:45.990831Z  INFO mistralrs_core::device_map: Layer 55: cuda[1]
2024-11-28T20:38:45.990831Z  INFO mistralrs_core::device_map: Layer 56: cuda[1]
2024-11-28T20:38:45.990832Z  INFO mistralrs_core::device_map: Layer 57: cuda[1]
2024-11-28T20:38:45.990832Z  INFO mistralrs_core::device_map: Layer 58: cuda[1]
2024-11-28T20:38:45.990833Z  INFO mistralrs_core::device_map: Layer 59: cuda[1]
2024-11-28T20:38:45.990833Z  INFO mistralrs_core::device_map: Layer 60: cuda[1]
2024-11-28T20:38:45.990834Z  INFO mistralrs_core::device_map: Layer 61: cuda[1]
2024-11-28T20:38:45.990834Z  INFO mistralrs_core::device_map: Layer 62: cuda[1]
2024-11-28T20:38:45.990835Z  INFO mistralrs_core::device_map: Layer 63: cuda[1]
2024-11-28T20:38:56.371175Z  INFO mistralrs_core::pipeline::paths: Using literal chat template.
2024-11-28T20:38:56.533169Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|endoftext|>", eos_toks = "<|im_end|>", unk_tok = `None`
2024-11-28T20:38:56.537793Z  INFO mistralrs_server: Model loaded.
2024-11-28T20:38:56.578218Z  INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2024-11-28T20:38:56.598765Z  INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2024-11-28T20:38:56.615615Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-11-28T20:38:56.615700Z  INFO mistralrs_core: Beginning dummy run.
2024-11-28T20:39:01.392310Z ERROR mistralrs_core::engine: prompt step - Model failed with error: WithBacktrace { inner: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cuda { gpu_id: 1 }, op: "slice-set" }, backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "candle_core::tensor_cat::<impl candle_core::tensor::Tensor>::slice_set" }, { fn: "mistralrs_core::pipeline::cache_manager::SingleCache::append" }, { fn: "mistralrs_core::pipeline::cache_manager::KvCache::append" }, { fn: "mistralrs_core::models::quantized_qwen2::ModelWeights::forward" }, { fn: "<mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "tokio::runtime::runtime::Runtime::block_on" }, { fn: "std::sys::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "clone" }] }
2024-11-28T20:39:01.392437Z  INFO mistralrs_core: Dummy run completed in 4.776731074s.
2024-11-28T20:39:01.392736Z  INFO mistralrs_server: Serving on http://0.0.0.0:80.

But then fails:

2024-11-28T20:39:01.392310Z ERROR mistralrs_core::engine: prompt step - Model failed with error: WithBacktrace { inner: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cuda { gpu_id: 1 }, op: "slice-set" }, backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "candle_core::tensor_cat::<impl candle_core::tensor::Tensor>::slice_set" }, { fn: "mistralrs_core::pipeline::cache_manager::SingleCache::append" }, { fn: "mistralrs_core::pipeline::cache_manager::KvCache::append" }, { fn: "mistralrs_core::models::quantized_qwen2::ModelWeights::forward" }, { fn: "<mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "tokio::runtime::runtime::Runtime::block_on" }, { fn: "std::sys::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "clone" }] }

@sammcj sammcj closed this as completed Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants