koboldcpp_nocuda-new.exe --show --noavx2 *** Welcome to KoboldCpp - Version 1.80.3 For command line arguments, please refer to --help *** Auto Selected Default Backend... Initializing dynamic library: koboldcpp_vulkan_noavx2.dll ========== Namespace(benchmark=None, blasbatchsize=-1, blasthreads=4, chatcompletionsadapter=None, config=None, contextsize=12288, debugmode=0, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, failsafe=False, flashattention=True, forceversion=0, foreground=True, gpulayers=1, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='L:/AI-Models/L3-8B-Stheno-v3.2-NEO-V1-D_AU-Q4_K_M-imat13.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=True, noblas=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=True, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=2, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdquant=False, sdt5xxl='', sdthreads=2, sdvae='', sdvaeauto=False, showgui=True, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=4, unpack='', useclblast=None, usecpu=False, usecublas=None, usemlock=True, usevulkan=[0], whispermodel='') ========== Loading Text Model: L:\AI-Models\L3-8B-Stheno-v3.2-NEO-V1-D_AU-Q4_K_M-imat13.gguf The reported GGUF Arch is: llama Arch Category: 0 --- Identified as GGUF model: (ver 6) Attempting to Load... --- Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead! It means that the RoPE values written above will be replaced by the RoPE values indicated after loading. System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon HD 8600/8700M (AMD proprietary driver) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none llama_load_model_from_file: using device Vulkan0 (AMD Radeon HD 8600/8700M) - 768 MiB free llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from L:\AI-Models\L3-8B-Stheno-v3.2-NEO-V1-D_AU-Q4_K_M-imat13.gguf (version GGUF V3 (latest)) llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.58 GiB (4.89 BPW) llm_load_print_meta: general.name = L3-8B-Stheno-v3.2 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 '├Д' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_vulkan: Compiling shaders........................Done! llm_load_tensors: relocated tensors: 282 of 291 PrefetchVirtualMemory skipped in compatibility mode. llm_load_tensors: offloading 1 repeating layers to GPU llm_load_tensors: offloaded 1/33 layers to GPU llm_load_tensors: Vulkan0 model buffer size = 117.03 MiB llm_load_tensors: CPU_Mapped model buffer size = 4685.30 MiB ....................................................................................... Automatic RoPE Scaling: Using (scale:1.000, base:1049812.4). llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32 llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 12288 llama_new_context_with_model: n_ctx_per_seq = 12288 llama_new_context_with_model: n_batch = 32 llama_new_context_with_model: n_ubatch = 16 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 1049812.4 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_pre_seq (12288) > n_ctx_train (8192) -- possible training context overflow llama_kv_cache_init: kv_size = 12288, offload = 1, type_k = 'q4_0', type_v = 'q4_0', n_layer = 32 llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: Vulkan0 KV buffer size = 13.50 MiB llama_kv_cache_init: CPU KV buffer size = 418.50 MiB llama_new_context_with_model: KV self size = 432.00 MiB, K (q4_0): 216.00 MiB, V (q4_0): 216.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB Caution: pre-allocated tensor (k_cache_view-31 (copy of Kcur-31)) in a buffer (Vulkan0) that cannot run the operation (CPY) llama_new_context_with_model: Vulkan0 compute buffer size = 2.25 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 16.75 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 5 Traceback (most recent call last): File "koboldcpp.py", line 5027, in main(parser.parse_args(),start_server=True) File "koboldcpp.py", line 4644, in main loadok = load_model(modelname) File "koboldcpp.py", line 890, in load_model ret = handle.load_model(inputs) OSError: exception: access violation writing 0x0000000000001000 [3576] Failed to execute script 'koboldcpp' due to unhandled exception!