*** Welcome to KoboldCpp - Version 1.80.3 For command line arguments, please refer to --help *** Auto Selected Vulkan Backend... Initializing dynamic library: koboldcpp_vulkan_noavx2.dll ========== Namespace(benchmark='stdout', blasbatchsize=-1, blasthreads=4, chatcompletionsadapter=None, config=None, contextsize=1024, debugmode=0, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, failsafe=False, flashattention=False, forceversion=0, foreground=True, gpulayers=5, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='L:/AI-Models/mistral7b-erebus-v3.Q4_K_M.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=True, noblas=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=True, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdquant=False, sdt5xxl='', sdthreads=2, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=4, unpack='', useclblast=None, usecpu=False, usecublas=None, usemlock=True, usevulkan=[0], whispermodel='') ========== Loading Text Model: L:\AI-Models\mistral7b-erebus-v3.Q4_K_M.gguf The reported GGUF Arch is: llama Arch Category: 0 --- Identified as GGUF model: (ver 6) Attempting to Load... --- Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead! It means that the RoPE values written above will be replaced by the RoPE values indicated after loading. System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon HD 8600/8700M (AMD proprietary driver) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none llama_load_model_from_file: using device Vulkan0 (AMD Radeon HD 8600/8700M) - 768 MiB free llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from L:\AI-Models\mistral7b-erebus-v3.Q4_K_M.gguf (version GGUF V3 (latest)) llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 3 llm_load_vocab: token to piece cache size = 0.1637 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) llm_load_print_meta: general.name = models llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: EOG token = 2 '' llm_load_print_meta: max token length = 48 ggml_vulkan: Compiling shaders........................Done! llm_load_tensors: relocated tensors: 246 of 291 PrefetchVirtualMemory skipped in compatibility mode. llm_load_tensors: offloading 5 repeating layers to GPU llm_load_tensors: offloaded 5/33 layers to GPU llm_load_tensors: Vulkan0 model buffer size = 662.50 MiB llm_load_tensors: CPU_Mapped model buffer size = 4165.37 MiB ................................................................................................ Automatic RoPE Scaling: Using (scale:1.000, base:10000.0). llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32 llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 1024 llama_new_context_with_model: n_ctx_per_seq = 1024 llama_new_context_with_model: n_batch = 32 llama_new_context_with_model: n_ubatch = 16 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 1024, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32 llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: Vulkan0 KV buffer size = 20.00 MiB llama_kv_cache_init: CPU KV buffer size = 108.00 MiB llama_new_context_with_model: KV self size = 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB llama_new_context_with_model: CPU output buffer size = 0.12 MiB llama_new_context_with_model: Vulkan0 compute buffer size = 3.13 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 3.13 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 3 Load Text Model OK: True Embedded KoboldAI Lite loaded. Embedded API docs loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/ Running benchmark (Not Saved)... Processing Prompt (924 / 924 tokens) Generating (100 / 100 tokens) [00:00:00] CtxLimit:1024/1024, Amt:100/100, Init:0.10s, Process:389.04s (421.0ms/T = 2.38T/s), Generate:55.03s (550.3ms/T = 1.82T/s), Total:444.07s (0.23T/s) Benchmark Completed - v1.80.3 Results: ====== Flags: NoAVX2=True Threads=4 HighPriority=False Cublas_Args=None Tensor_Split=None BlasThreads=4 BlasBatchSize=-1 FlashAttention=False KvCache=0 Timestamp: 0000-00-00 00:00:00.000000+00:00 Backend: koboldcpp_vulkan_noavx2.dll Layers: 5 Model: mistral7b-erebus-v3.Q4_K_M MaxCtx: 1024 GenAmount: 100 ----- ProcessingTime: 389.041s ProcessingSpeed: 2.38T/s GenerationTime: 55.034s GenerationSpeed: 1.82T/s TotalTime: 444.075s Output: 1 1 1 1 ----- === Press ENTER key to exit.