*** Welcome to KoboldCpp - Version 1.80.1 For command line arguments, please refer to --help *** Auto Selected Vulkan Backend... Initializing dynamic library: koboldcpp_clblast.dll ========== Namespace(benchmark='stdout', blasbatchsize=-1, blasthreads=4, chatcompletionsadapter=None, config=None, contextsize=2048, debugmode=0, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, failsafe=False, flashattention=True, forceversion=0, foreground=False, gpulayers=0, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='L:/AI-Models/mistral7b-erebus-v3.Q4_K_M.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=True, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=2, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdquant=False, sdt5xxl='', sdthreads=2, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=4, unpack='', useclblast=[1, 0], usecpu=False, usecublas=None, usemlock=True, usevulkan=None, whispermodel='') ========== Loading Text Model: L:\AI-Models\mistral7b-erebus-v3.Q4_K_M.gguf The reported GGUF Arch is: llama Arch Category: 0 --- Identified as GGUF model: (ver 6) Attempting to Load... --- Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead! It means that the RoPE values written above will be replaced by the RoPE values indicated after loading. Platform:0 Device:0 - AMD Accelerated Parallel Processing with Oland Platform:1 Device:0 - Intel(R) OpenCL with Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz Platform:1 Device:1 - Intel(R) OpenCL with Intel(R) HD Graphics 4000 ggml_opencl: selecting platform: 'Intel(R) OpenCL' ggml_opencl: selecting device: ' Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz' ggml_opencl: warning, not a GPU: ' Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz'. System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from L:\AI-Models\mistral7b-erebus-v3.Q4_K_M.gguf (version GGUF V3 (latest)) llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 3 llm_load_vocab: token to piece cache size = 0.1637 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) llm_load_print_meta: general.name = models llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: EOG token = 2 '' llm_load_print_meta: max token length = 48 OpenCL GPU Offload Fallback... llm_load_tensors: relocated tensors: 291 of 291 PrefetchVirtualMemory skipped in compatibility mode. llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/33 layers to GPU llm_load_tensors: CPU_Mapped model buffer size = 4165.37 MiB ................................................................................................ Automatic RoPE Scaling: Using (scale:1.000, base:10000.0). llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32 llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 32 llama_new_context_with_model: n_ubatch = 16 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'q4_0', type_v = 'q4_0', n_layer = 32 llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: CPU KV buffer size = 72.00 MiB llama_new_context_with_model: KV self size = 72.00 MiB, K (q4_0): 36.00 MiB, V (q4_0): 36.00 MiB llama_new_context_with_model: CPU output buffer size = 0.12 MiB llama_new_context_with_model: CPU compute buffer size = 2.63 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 1 Load Text Model OK: True Embedded KoboldAI Lite loaded. Embedded API docs loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/ Running benchmark (Not Saved)... Processing Prompt (1948 / 1948 tokens) Generating (100 / 100 tokens) [00:00:00] CtxLimit:2048/2048, Amt:100/100, Init:0.10s, Process:905.98s (465.1ms/T = 2.15T/s), Generate:62.47s (624.7ms/T = 1.60T/s), Total:968.45s (0.10T/s) Benchmark Completed - v1.80.1 Results: ====== Flags: NoAVX2=False Threads=4 HighPriority=False Cublas_Args=None Tensor_Split=None BlasThreads=4 BlasBatchSize=-1 FlashAttention=True KvCache=2 Timestamp: 0000-00-00 00:00:00.000000+00:00 Backend: koboldcpp_clblast.dll Layers: 0 Model: mistral7b-erebus-v3.Q4_K_M MaxCtx: 2048 GenAmount: 100 ----- ProcessingTime: 905.981s ProcessingSpeed: 2.15T/s GenerationTime: 62.469s GenerationSpeed: 1.60T/s TotalTime: 968.450s Output: 1 1 1 1