Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for codellama #2768

Merged
merged 1 commit into from
Aug 24, 2023
Merged

Fixes for codellama #2768

merged 1 commit into from
Aug 24, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Aug 24, 2023

Changes convert.py to allow missing vocab_size in params.json, adds enum value for 34b model.

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model backend n_gpu_layers test t/s
LLaMA v2 34B mostly Q4_K - Small CUDA 99 pp 512 530.11 ± 0.60
LLaMA v2 34B mostly Q4_K - Small CUDA 99 tg 128 18.55 ± 0.02

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

Short perplexity test:
7b q5_k_m: [1]6.3455,[2]7.2255,[3]9.2265,[4]10.3542,[5]10.4333,[6]9.9436,[7]10.4459,[8]10.3019,[9]10.8475,[10]11.3143,[11]11.8393,[12]11.8581,
13b q5_k_m: [1]5.8744,[2]7.0650,[3]7.9541,[4]9.2436,[5]9.7071,[6]9.6262,[7]9.7747,[8]9.8889,[9]10.2961,[10]10.7754,[11]11.1154,[12]11.1430,
34b q4_k_m: [1]5.3722,[2]6.8634,[3]18.5997,[4]18.0402,[5]20.8216,[6]18.8895,[7]24.5365,[8]23.1267,[9]30.7042,[10]32.4732,[11]38.5647,[12]35.2927,

34b seems to be increasing a bit too much, might have to look into it.

@slaren slaren merged commit fea95c6 into master Aug 24, 2023
@slaren slaren deleted the codellama-fixes branch August 24, 2023 15:44
@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

Final perplexity for 34b q4_k_m is 63.1600,. Might be simply due to the different dataset, but something may be wrong.
7b q5_k_m: 10.1548

llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                          general.file_type u32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:               general.quantization_version u32
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_print_meta: format         = GGUF V1 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 48
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 22016
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 34B
llm_load_print_meta: model ftype    = mostly Q4_K - Medium
llm_load_print_meta: model size     = 33.74 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.13 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  140.76 MB (+   96.00 MB per state)
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 51/51 layers to GPU
llm_load_tensors: VRAM used: 19238 MB
....................................................................................................
llama_new_context_with_model: kv self size  =   96.00 MB
llama_new_context_with_model: compute buffer total size =  119.41 MB
llama_new_context_with_model: VRAM scratch buffer: 118.00 MB

system_info: n_threads = 1 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: tokenizing the input ..
perplexity: calculating perplexity over 655 chunks, batch_size=512
perplexity: 0.99 seconds per pass - ETA 10.77 minutes
[1]5.3722,[2]6.8634,[3]18.5997,[4]18.0402,[5]20.8216,[6]18.8895,[7]24.5365,[8]23.1267,[9]30.7042,[10]32.4732,[11]38.5647,[12]35.2927,[13]35.3977,[14]40.8779,[15]50.0018,[16]44.6572,[17]42.0392,[18]47.4379,[19]42.1442,[20]45.5599,[21]48.5124,[22]50.1700,[23]46.9413,[24]49.1561,[25]48.3779,[26]44.4299,[27]43.9957,[28]42.0548,[29]43.4902,[30]43.0095,[31]43.9184,[32]42.1812,[33]42.3804,[34]44.5481,[35]45.8786,[36]44.1231,[37]45.8371,[38]46.3058,[39]47.8326,[40]49.2811,[41]47.9219,[42]47.5123,[43]48.8565,[44]49.9723,[45]51.3837,[46]52.2847,[47]51.7251,[48]53.2990,[49]55.3860,[50]56.1291,[51]56.6182,[52]56.4147,[53]57.7867,[54]59.2960,[55]56.8689,[56]55.7473,[57]55.4897,[58]55.7443,[59]54.9167,[60]56.6449,[61]57.6218,[62]56.8220,[63]57.7998,[64]58.3952,[65]58.9115,[66]59.2383,[67]60.9043,[68]60.6064,[69]61.1762,[70]61.9318,[71]63.0806,[72]61.8627,[73]62.5991,[74]63.5537,[75]64.4164,[76]64.6967,[77]64.8567,[78]64.9029,[79]65.9337,[80]64.3252,[81]64.6137,[82]63.1367,[83]61.4556,[84]62.1606,[85]62.9551,[86]62.4851,[87]62.9987,[88]62.0748,[89]62.3586,[90]62.7458,[91]63.4891,[92]64.8459,[93]65.2256,[94]65.7529,[95]66.7645,[96]67.4958,[97]66.2010,[98]66.1251,[99]65.4783,[100]66.1518,[101]66.7187,[102]67.5872,[103]66.4947,[104]66.4975,[105]67.1665,[106]67.8499,[107]67.4641,[108]68.1799,[109]67.6239,[110]68.2992,[111]69.2309,[112]69.8823,[113]68.8550,[114]68.9047,[115]69.3004,[116]68.2032,[117]67.1597,[118]67.8393,[119]68.4106,[120]69.3725,[121]69.1116,[122]68.4353,[123]68.2800,[124]68.0477,[125]67.9936,[126]68.6714,[127]68.1732,[128]68.9529,[129]69.4866,[130]70.0720,[131]70.8555,[132]69.7161,[133]68.5270,[134]69.1271,[135]68.9255,[136]69.2516,[137]69.4818,[138]69.5431,[139]69.7582,[140]70.5190,[141]70.7511,[142]69.8141,[143]70.1926,[144]70.3150,[145]70.4458,[146]69.4104,[147]69.6069,[148]69.8050,[149]69.8345,[150]69.0615,[151]69.2670,[152]69.2654,[153]69.4935,[154]69.6644,[155]70.3173,[156]69.5862,[157]69.7767,[158]69.1744,[159]69.5966,[160]68.7871,[161]68.4688,[162]68.1816,[163]67.2025,[164]66.6679,[165]66.4302,[166]66.5904,[167]65.4620,[168]64.5073,[169]64.7642,[170]64.6514,[171]64.4600,[172]64.4797,[173]64.3071,[174]64.3380,[175]64.4337,[176]64.5575,[177]63.7445,[178]64.0980,[179]64.1851,[180]64.3830,[181]63.7599,[182]63.8480,[183]64.0105,[184]64.2083,[185]63.7754,[186]63.9489,[187]63.6941,[188]64.3531,[189]64.6876,[190]64.3726,[191]65.3649,[192]65.8998,[193]66.5370,[194]67.0823,[195]67.5497,[196]68.0094,[197]68.3190,[198]68.0144,[199]67.5266,[200]67.0679,[201]66.3224,[202]66.9096,[203]66.4625,[204]67.1994,[205]66.8987,[206]67.5093,[207]67.6564,[208]68.1417,[209]68.3621,[210]68.5502,[211]69.0086,[212]69.3362,[213]69.2351,[214]68.7124,[215]68.7493,[216]68.9633,[217]68.8705,[218]68.4066,[219]68.6391,[220]68.1637,[221]68.5218,[222]68.7122,[223]68.7446,[224]68.9553,[225]68.8326,[226]69.4308,[227]69.9229,[228]70.4930,[229]70.8096,[230]70.3801,[231]70.7470,[232]70.8915,[233]70.7748,[234]70.7758,[235]70.8871,[236]70.1911,[237]70.4651,[238]70.7678,[239]70.8104,[240]70.7656,[241]70.3186,[242]70.6932,[243]70.8081,[244]70.1327,[245]70.0990,[246]69.4259,[247]69.5657,[248]69.5496,[249]69.7259,[250]69.1864,[251]69.0940,[252]69.1617,[253]69.4336,[254]69.4729,[255]69.4173,[256]68.8389,[257]68.5638,[258]68.8591,[259]69.0085,[260]69.0906,[261]68.7520,[262]68.6570,[263]68.6046,[264]68.5303,[265]68.7943,[266]68.6624,[267]68.6287,[268]68.2216,[269]67.7220,[270]67.7484,[271]67.3327,[272]66.8540,[273]67.0361,[274]67.0683,[275]67.4446,[276]67.2734,[277]67.5035,[278]67.3129,[279]67.5298,[280]67.7587,[281]67.4256,[282]67.5814,[283]67.8374,[284]67.9660,[285]67.5812,[286]67.7368,[287]67.6688,[288]67.2755,[289]66.8072,[290]66.2715,[291]66.3255,[292]66.4076,[293]65.8640,[294]66.0011,[295]66.0790,[296]66.2347,[297]66.3860,[298]66.1118,[299]66.2495,[300]65.6643,[301]65.2227,[302]64.7275,[303]64.6542,[304]64.5785,[305]64.3063,[306]63.8716,[307]63.9514,[308]64.0732,[309]63.8202,[310]63.9167,[311]63.9081,[312]63.5917,[313]63.2475,[314]63.5068,[315]62.9764,[316]62.7250,[317]62.2487,[318]61.6874,[319]61.4673,[320]61.7716,[321]61.9084,[322]61.5076,[323]61.1308,[324]60.8254,[325]60.7589,[326]60.8205,[327]60.4973,[328]60.7845,[329]61.0918,[330]61.3219,[331]61.5470,[332]61.5831,[333]61.8672,[334]62.0101,[335]62.1867,[336]62.1230,[337]62.2017,[338]61.9051,[339]61.9544,[340]62.1529,[341]62.3669,[342]62.0460,[343]62.2432,[344]62.3634,[345]62.0058,[346]62.1173,[347]62.2805,[348]62.3061,[349]62.4143,[350]62.5531,[351]62.6050,[352]62.4007,[353]62.4609,[354]62.7498,[355]63.1266,[356]63.4078,[357]63.4392,[358]63.7693,[359]64.1597,[360]63.7484,[361]63.4239,[362]63.2723,[363]63.6244,[364]63.6976,[365]63.8372,[366]63.8226,[367]64.0798,[368]64.0947,[369]64.2145,[370]64.3675,[371]64.0243,[372]64.1487,[373]64.2855,[374]64.2745,[375]64.3428,[376]64.5512,[377]64.1334,[378]64.3588,[379]64.5686,[380]64.5594,[381]64.6034,[382]64.7722,[383]64.8803,[384]64.7609,[385]64.9052,[386]65.0941,[387]65.2881,[388]65.5088,[389]65.4680,[390]65.2245,[391]65.3933,[392]65.3887,[393]65.1051,[394]64.9781,[395]64.7609,[396]64.4236,[397]64.6313,[398]64.3495,[399]64.5929,[400]64.8183,[401]64.5031,[402]64.8714,[403]64.5638,[404]64.8526,[405]65.0616,[406]65.1516,[407]65.1070,[408]65.0348,[409]65.3795,[410]65.6611,[411]65.9290,[412]65.7917,[413]66.0696,[414]66.2572,[415]66.4053,[416]66.3929,[417]66.6110,[418]66.7518,[419]66.8844,[420]67.1552,[421]67.3992,[422]67.6390,[423]67.4414,[424]67.6983,[425]67.5210,[426]67.7952,[427]67.5891,[428]67.8714,[429]67.8638,[430]67.7827,[431]67.8524,[432]68.0215,[433]68.0250,[434]67.7200,[435]67.8556,[436]67.6048,[437]67.7432,[438]67.8956,[439]67.8467,[440]67.9014,[441]67.9599,[442]68.0346,[443]68.1949,[444]68.3085,[445]68.4728,[446]68.5187,[447]68.7361,[448]68.6779,[449]68.7231,[450]68.4663,[451]68.6147,[452]68.3170,[453]68.3787,[454]68.1536,[455]67.8817,[456]67.9065,[457]68.1159,[458]68.1468,[459]68.3322,[460]68.1498,[461]67.8467,[462]67.5300,[463]67.6604,[464]67.3659,[465]67.1501,[466]66.8277,[467]66.7319,[468]66.5062,[469]66.2973,[470]66.3300,[471]66.0265,[472]66.0738,[473]65.7277,[474]65.4825,[475]65.6804,[476]65.8642,[477]66.0072,[478]65.9554,[479]66.2864,[480]66.4256,[481]66.3546,[482]66.4843,[483]66.3786,[484]66.6304,[485]66.6555,[486]66.3278,[487]66.2141,[488]66.3310,[489]66.0924,[490]66.3018,[491]66.0328,[492]65.7608,[493]65.5610,[494]65.3118,[495]65.3401,[496]65.1191,[497]64.8696,[498]64.6983,[499]64.4070,[500]64.4543,[501]64.2186,[502]64.0064,[503]64.1239,[504]63.9233,[505]63.6977,[506]63.4843,[507]63.6471,[508]63.7544,[509]64.0045,[510]64.0976,[511]63.9092,[512]63.7219,[513]63.8090,[514]63.6563,[515]63.8533,[516]63.7793,[517]63.8389,[518]63.8802,[519]63.9988,[520]64.0833,[521]64.0832,[522]64.0158,[523]64.1246,[524]64.2134,[525]64.3350,[526]64.1216,[527]64.2748,[528]64.0177,[529]64.0979,[530]64.0446,[531]64.1003,[532]63.9388,[533]64.1983,[534]64.0059,[535]64.0748,[536]63.7816,[537]63.9685,[538]64.0965,[539]64.3060,[540]64.1924,[541]64.3957,[542]64.1433,[543]64.1330,[544]64.2264,[545]64.2380,[546]64.2640,[547]64.3215,[548]64.2925,[549]64.3039,[550]64.2860,[551]64.3630,[552]64.5191,[553]64.6287,[554]64.3604,[555]64.0817,[556]64.1676,[557]64.3209,[558]64.3894,[559]64.4420,[560]64.4271,[561]64.4476,[562]64.4579,[563]64.6372,[564]64.6691,[565]64.8215,[566]64.9431,[567]65.0567,[568]64.8633,[569]64.9597,[570]64.7805,[571]64.8904,[572]65.0106,[573]65.1171,[574]65.0246,[575]65.0512,[576]64.8336,[577]64.8709,[578]64.9671,[579]65.1184,[580]65.1459,[581]64.9673,[582]64.7708,[583]64.6237,[584]64.4728,[585]64.1978,[586]64.1859,[587]63.9967,[588]64.1102,[589]63.9630,[590]64.0796,[591]64.1820,[592]64.3731,[593]64.2488,[594]64.0475,[595]64.1433,[596]64.1900,[597]63.9622,[598]63.8459,[599]63.6558,[600]63.8030,[601]63.8747,[602]63.9622,[603]63.8415,[604]63.8371,[605]63.9250,[606]63.7467,[607]63.5332,[608]63.2838,[609]63.3835,[610]63.4859,[611]63.5186,[612]63.7273,[613]63.6170,[614]63.3900,[615]63.3528,[616]63.5236,[617]63.4472,[618]63.3583,[619]63.2271,[620]63.0657,[621]62.8304,[622]62.6378,[623]62.7276,[624]62.7355,[625]62.6736,[626]62.7857,[627]62.7025,[628]62.5853,[629]62.3819,[630]62.4502,[631]62.6082,[632]62.5224,[633]62.5602,[634]62.4200,[635]62.5675,[636]62.6213,[637]62.6924,[638]62.8519,[639]62.9235,[640]63.0890,[641]62.8776,[642]62.8108,[643]62.8572,[644]62.8797,[645]62.7199,[646]62.7847,[647]62.7171,[648]62.5697,[649]62.6333,[650]62.8029,[651]63.0165,[652]63.1887,[653]63.3320,[654]63.3318,[655]63.1600,

llama_print_timings:        load time =  5183.42 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 650506.37 ms / 335360 tokens (    1.94 ms per token,   515.54 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 677581.77 ms

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

The issue may be due to the long context fine-tuning. Using --rope-freq-base 1e6 the results are looking much better.

image

@ggerganov
Copy link
Owner

We have to add rope_theta to the convert.py script and write it in the meta data of the mode

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

Ok, that's definitely an issue. The final ppl with 34b q4_k_m --rope-freq-base 1e6 is 5.7811

llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                          general.file_type u32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:               general.quantization_version u32
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_print_meta: format         = GGUF V1 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 48
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 22016
llm_load_print_meta: freq_base      = 1000000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 34B
llm_load_print_meta: model ftype    = mostly Q4_K - Medium
llm_load_print_meta: model size     = 33.74 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.13 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  140.76 MB (+   96.00 MB per state)
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 51/51 layers to GPU
llm_load_tensors: VRAM used: 19238 MB
....................................................................................................
llama_new_context_with_model: kv self size  =   96.00 MB
llama_new_context_with_model: compute buffer total size =  119.41 MB
llama_new_context_with_model: VRAM scratch buffer: 118.00 MB

system_info: n_threads = 1 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: tokenizing the input ..
perplexity: calculating perplexity over 655 chunks, batch_size=512
perplexity: 1.00 seconds per pass - ETA 10.88 minutes
[1]4.2372,[2]4.7320,[3]5.5051,[4]6.3183,[5]6.4880,[6]6.4093,[7]6.5309,[8]6.5565,[9]6.8206,[10]7.0476,[11]7.2814,[12]7.3205,[13]7.2565,[14]7.3485,[15]7.5977,[16]7.2239,[17]7.0971,[18]7.0294,[19]6.6729,[20]6.6685,[21]6.5518,[22]6.3390,[23]6.2829,[24]6.1878,[25]6.1765,[26]5.9847,[27]5.7738,[28]5.6386,[29]5.5307,[30]5.3570,[31]5.2739,[32]5.3031,[33]5.2486,[34]5.2887,[35]5.3046,[36]5.3362,[37]5.3191,[38]5.3193,[39]5.3439,[40]5.3798,[41]5.3973,[42]5.4398,[43]5.4031,[44]5.4433,[45]5.4429,[46]5.4097,[47]5.4334,[48]5.4142,[49]5.4021,[50]5.3661,[51]5.3801,[52]5.3764,[53]5.4184,[54]5.4072,[55]5.3937,[56]5.4206,[57]5.4382,[58]5.4544,[59]5.4711,[60]5.5030,[61]5.4940,[62]5.5545,[63]5.5759,[64]5.5782,[65]5.6078,[66]5.6046,[67]5.6132,[68]5.6288,[69]5.6525,[70]5.6789,[71]5.7012,[72]5.7352,[73]5.7758,[74]5.7811,[75]5.7850,[76]5.7937,[77]5.8052,[78]5.7914,[79]5.8198,[80]5.8168,[81]5.8321,[82]5.8398,[83]5.7918,[84]5.7946,[85]5.7940,[86]5.7804,[87]5.7329,[88]5.7193,[89]5.7046,[90]5.6888,[91]5.7118,[92]5.7031,[93]5.6973,[94]5.6979,[95]5.7305,[96]5.7293,[97]5.7267,[98]5.7199,[99]5.7056,[100]5.7005,[101]5.7222,[102]5.7176,[103]5.7371,[104]5.7421,[105]5.7402,[106]5.7571,[107]5.7603,[108]5.7683,[109]5.7672,[110]5.7628,[111]5.7838,[112]5.8061,[113]5.8043,[114]5.8025,[115]5.8081,[116]5.8000,[117]5.8072,[118]5.8323,[119]5.8557,[120]5.8899,[121]5.9036,[122]5.9262,[123]5.9611,[124]5.9752,[125]5.9661,[126]6.0024,[127]6.0339,[128]6.0583,[129]6.0453,[130]6.0491,[131]6.0429,[132]6.0352,[133]6.0147,[134]6.0197,[135]6.0119,[136]5.9990,[137]5.9899,[138]5.9669,[139]5.9587,[140]5.9516,[141]5.9228,[142]5.9162,[143]5.8873,[144]5.8662,[145]5.8512,[146]5.8402,[147]5.8395,[148]5.8380,[149]5.8297,[150]5.8262,[151]5.8293,[152]5.8183,[153]5.8056,[154]5.7982,[155]5.8032,[156]5.8023,[157]5.8175,[158]5.8185,[159]5.8227,[160]5.8300,[161]5.8414,[162]5.8157,[163]5.8066,[164]5.7872,[165]5.7615,[166]5.7366,[167]5.7018,[168]5.6776,[169]5.6686,[170]5.6591,[171]5.6381,[172]5.6227,[173]5.6066,[174]5.5824,[175]5.5619,[176]5.5499,[177]5.5312,[178]5.5122,[179]5.4977,[180]5.4907,[181]5.4753,[182]5.4583,[183]5.4466,[184]5.4455,[185]5.4420,[186]5.4456,[187]5.4566,[188]5.4579,[189]5.4803,[190]5.4809,[191]5.5010,[192]5.5167,[193]5.5298,[194]5.5432,[195]5.5652,[196]5.5787,[197]5.5996,[198]5.6119,[199]5.6158,[200]5.6211,[201]5.6152,[202]5.6305,[203]5.6400,[204]5.6377,[205]5.6518,[206]5.6578,[207]5.6537,[208]5.6669,[209]5.6712,[210]5.6753,[211]5.6881,[212]5.6958,[213]5.7033,[214]5.7065,[215]5.7077,[216]5.7186,[217]5.7336,[218]5.7462,[219]5.7446,[220]5.7430,[221]5.7343,[222]5.7304,[223]5.7209,[224]5.7141,[225]5.7087,[226]5.7265,[227]5.7304,[228]5.7380,[229]5.7426,[230]5.7369,[231]5.7487,[232]5.7380,[233]5.7210,[234]5.7059,[235]5.6820,[236]5.6794,[237]5.6711,[238]5.6748,[239]5.6639,[240]5.6540,[241]5.6547,[242]5.6552,[243]5.6521,[244]5.6418,[245]5.6389,[246]5.6281,[247]5.6184,[248]5.6122,[249]5.6094,[250]5.6134,[251]5.6047,[252]5.6015,[253]5.5930,[254]5.5861,[255]5.5745,[256]5.5577,[257]5.5464,[258]5.5379,[259]5.5360,[260]5.5281,[261]5.5234,[262]5.5196,[263]5.5132,[264]5.4880,[265]5.4891,[266]5.4855,[267]5.4798,[268]5.4881,[269]5.4898,[270]5.4923,[271]5.5008,[272]5.5048,[273]5.5066,[274]5.5055,[275]5.5106,[276]5.5165,[277]5.5285,[278]5.5376,[279]5.5451,[280]5.5482,[281]5.5584,[282]5.5649,[283]5.5780,[284]5.5876,[285]5.5969,[286]5.6099,[287]5.6085,[288]5.6138,[289]5.6064,[290]5.5959,[291]5.5852,[292]5.5742,[293]5.5635,[294]5.5650,[295]5.5655,[296]5.5713,[297]5.5717,[298]5.5760,[299]5.5750,[300]5.5669,[301]5.5683,[302]5.5637,[303]5.5564,[304]5.5486,[305]5.5461,[306]5.5347,[307]5.5373,[308]5.5376,[309]5.5261,[310]5.5231,[311]5.5192,[312]5.5215,[313]5.5185,[314]5.5199,[315]5.5058,[316]5.5025,[317]5.4878,[318]5.4697,[319]5.4834,[320]5.4943,[321]5.4977,[322]5.4925,[323]5.4887,[324]5.4899,[325]5.5017,[326]5.5030,[327]5.5049,[328]5.5085,[329]5.5123,[330]5.5162,[331]5.5278,[332]5.5243,[333]5.5315,[334]5.5268,[335]5.5211,[336]5.5234,[337]5.5223,[338]5.5222,[339]5.5185,[340]5.5132,[341]5.5180,[342]5.5199,[343]5.5236,[344]5.5245,[345]5.5255,[346]5.5233,[347]5.5262,[348]5.5283,[349]5.5314,[350]5.5305,[351]5.5303,[352]5.5297,[353]5.5242,[354]5.5238,[355]5.5298,[356]5.5350,[357]5.5323,[358]5.5420,[359]5.5458,[360]5.5433,[361]5.5427,[362]5.5507,[363]5.5611,[364]5.5675,[365]5.5717,[366]5.5741,[367]5.5817,[368]5.5797,[369]5.5816,[370]5.5846,[371]5.5795,[372]5.5846,[373]5.5890,[374]5.5876,[375]5.5875,[376]5.5952,[377]5.5923,[378]5.5950,[379]5.5998,[380]5.5939,[381]5.5917,[382]5.5877,[383]5.5872,[384]5.5880,[385]5.5881,[386]5.5874,[387]5.5896,[388]5.5870,[389]5.5845,[390]5.5792,[391]5.5740,[392]5.5725,[393]5.5746,[394]5.5791,[395]5.5775,[396]5.5718,[397]5.5804,[398]5.5867,[399]5.5953,[400]5.5946,[401]5.5957,[402]5.5981,[403]5.6007,[404]5.6063,[405]5.6021,[406]5.6002,[407]5.6029,[408]5.6044,[409]5.6164,[410]5.6269,[411]5.6376,[412]5.6538,[413]5.6655,[414]5.6725,[415]5.6791,[416]5.6871,[417]5.6974,[418]5.7001,[419]5.7059,[420]5.7138,[421]5.7247,[422]5.7293,[423]5.7369,[424]5.7471,[425]5.7567,[426]5.7647,[427]5.7691,[428]5.7773,[429]5.7807,[430]5.7896,[431]5.8027,[432]5.8053,[433]5.8037,[434]5.8000,[435]5.8028,[436]5.8056,[437]5.8156,[438]5.8245,[439]5.8218,[440]5.8220,[441]5.8191,[442]5.8181,[443]5.8192,[444]5.8205,[445]5.8189,[446]5.8215,[447]5.8235,[448]5.8271,[449]5.8256,[450]5.8267,[451]5.8229,[452]5.8228,[453]5.8164,[454]5.8121,[455]5.8147,[456]5.8194,[457]5.8227,[458]5.8217,[459]5.8218,[460]5.8309,[461]5.8296,[462]5.8304,[463]5.8341,[464]5.8333,[465]5.8324,[466]5.8270,[467]5.8310,[468]5.8332,[469]5.8364,[470]5.8370,[471]5.8349,[472]5.8407,[473]5.8359,[474]5.8398,[475]5.8379,[476]5.8406,[477]5.8360,[478]5.8367,[479]5.8466,[480]5.8527,[481]5.8555,[482]5.8522,[483]5.8507,[484]5.8540,[485]5.8545,[486]5.8508,[487]5.8521,[488]5.8505,[489]5.8475,[490]5.8483,[491]5.8478,[492]5.8454,[493]5.8431,[494]5.8416,[495]5.8417,[496]5.8396,[497]5.8369,[498]5.8371,[499]5.8337,[500]5.8258,[501]5.8212,[502]5.8232,[503]5.8241,[504]5.8169,[505]5.8203,[506]5.8212,[507]5.8141,[508]5.8091,[509]5.8083,[510]5.8091,[511]5.8131,[512]5.8154,[513]5.8173,[514]5.8225,[515]5.8179,[516]5.8173,[517]5.8182,[518]5.8179,[519]5.8203,[520]5.8210,[521]5.8219,[522]5.8243,[523]5.8248,[524]5.8310,[525]5.8341,[526]5.8345,[527]5.8370,[528]5.8325,[529]5.8331,[530]5.8275,[531]5.8264,[532]5.8323,[533]5.8348,[534]5.8334,[535]5.8360,[536]5.8321,[537]5.8301,[538]5.8352,[539]5.8361,[540]5.8379,[541]5.8401,[542]5.8395,[543]5.8411,[544]5.8416,[545]5.8399,[546]5.8406,[547]5.8373,[548]5.8312,[549]5.8311,[550]5.8289,[551]5.8265,[552]5.8247,[553]5.8215,[554]5.8193,[555]5.8165,[556]5.8158,[557]5.8202,[558]5.8175,[559]5.8172,[560]5.8147,[561]5.8152,[562]5.8122,[563]5.8122,[564]5.8171,[565]5.8188,[566]5.8196,[567]5.8192,[568]5.8199,[569]5.8186,[570]5.8210,[571]5.8222,[572]5.8217,[573]5.8220,[574]5.8184,[575]5.8171,[576]5.8169,[577]5.8154,[578]5.8140,[579]5.8136,[580]5.8075,[581]5.8045,[582]5.8050,[583]5.8072,[584]5.8078,[585]5.7998,[586]5.7930,[587]5.7930,[588]5.7967,[589]5.8017,[590]5.8037,[591]5.8058,[592]5.8042,[593]5.8020,[594]5.8026,[595]5.8006,[596]5.8040,[597]5.8015,[598]5.7991,[599]5.8010,[600]5.8000,[601]5.7994,[602]5.8026,[603]5.8041,[604]5.8053,[605]5.8082,[606]5.8100,[607]5.8101,[608]5.8069,[609]5.8072,[610]5.8110,[611]5.8088,[612]5.8104,[613]5.8066,[614]5.8014,[615]5.7944,[616]5.7954,[617]5.7883,[618]5.7826,[619]5.7770,[620]5.7639,[621]5.7573,[622]5.7555,[623]5.7561,[624]5.7553,[625]5.7558,[626]5.7544,[627]5.7571,[628]5.7577,[629]5.7569,[630]5.7603,[631]5.7645,[632]5.7705,[633]5.7687,[634]5.7722,[635]5.7732,[636]5.7700,[637]5.7671,[638]5.7688,[639]5.7653,[640]5.7670,[641]5.7673,[642]5.7735,[643]5.7755,[644]5.7762,[645]5.7748,[646]5.7788,[647]5.7770,[648]5.7779,[649]5.7775,[650]5.7791,[651]5.7837,[652]5.7848,[653]5.7883,[654]5.7821,[655]5.7811,

llama_print_timings:        load time =  5546.65 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 649534.48 ms / 335360 tokens (    1.94 ms per token,   516.31 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 677274.58 ms

The correct value of rope_freq_base seems to be in params.json as "rope_theta": 1000000, but looks like convert.py doesn't export this value currently. Is there a metadata in gguf for this parameter?

@jxy
Copy link
Contributor

jxy commented Aug 24, 2023

We need to add rope-freq-base to GGUF.

@ggerganov
Copy link
Owner

We don't have one yet - we should introduce and add it to the spec: ggerganov/ggml#302

Btw, the vocab now is 32016.
The K-quants require the Y dimension also to be divisible by 256. Why is this needed? Don't we need just the X dimension (i.e. ne[0]) to be divisible by 256?

llama.cpp/llama.cpp

Lines 4533 to 4541 in ef955fb

if (new_type == GGML_TYPE_Q2_K || new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K ||
new_type == GGML_TYPE_Q5_K || new_type == GGML_TYPE_Q6_K) {
int nx = tensor->ne[0];
int ny = tensor->ne[1];
if (nx % QK_K != 0 || ny % QK_K != 0) {
LLAMA_LOG_INFO("\n\nTensor sizes %d x %d are not divisible by %d, required for k-quants.\n",nx,ny,QK_K);
convert_incompatible_tensor = true;
}
}

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

Btw, the vocab now is 32016.

Is this with all models? For 34b the sizes of the tensors suggest a n_vocab of 32000:

tok_embeddings.weight                            -> token_embd.weight                        | UnquantizedDataType(name='BF16') | [32000, 8192]
norm.weight                                      -> output_norm.weight                       | UnquantizedDataType(name='BF16') | [8192]
output.weight                                    -> output.weight                            | UnquantizedDataType(name='BF16') | [32000, 8192]

@jxy
Copy link
Contributor

jxy commented Aug 24, 2023

7B and 13B are tuned with infix, which uses special tokens.

@ggerganov
Copy link
Owner

ggerganov commented Aug 24, 2023

Here are some results on M2 Ultra:

model backend size test t/s
codellama 7B F16 Metal 13G pp 512 663.32 ± 1.10
codellama 7B mostly Q8_0 Metal 6.7G pp 512 631.39 ± 0.30
codellama 7B mostly Q6_K Metal 5.3G pp 512 562.97 ± 0.54
codellama 7B mostly Q5_K - Medium Metal 4.6G pp 512 562.58 ± 0.22
codellama 7B mostly Q4_K - Medium Metal 3.9G pp 512 589.01 ± 0.09
codellama 7B mostly Q4_1 Metal 3.9G pp 512 635.65 ± 0.30
codellama 7B mostly Q4_0 Metal 3.5G pp 512 633.74 ± 0.30
codellama 7B mostly Q3_K - Medium Metal 3.2G pp 512 582.17 ± 0.36
codellama 7B mostly Q2_K Metal 2.8G pp 512 581.32 ± 0.99
codellama 7B F16 Metal 13G tg 64 29.61 ± 0.05
codellama 7B mostly Q8_0 Metal 6.7G tg 64 61.56 ± 0.14
codellama 7B mostly Q6_K Metal 5.3G tg 64 67.49 ± 0.03
codellama 7B mostly Q5_K - Medium Metal 4.6G tg 64 68.46 ± 0.15
codellama 7B mostly Q4_K - Medium Metal 3.9G tg 64 79.03 ± 0.03
codellama 7B mostly Q4_1 Metal 3.9G tg 64 82.60 ± 0.12
codellama 7B mostly Q4_0 Metal 3.5G tg 64 87.73 ± 0.30
codellama 7B mostly Q3_K - Medium Metal 3.2G tg 64 75.79 ± 0.08
codellama 7B mostly Q2_K Metal 2.8G tg 64 74.56 ± 0.12

build: 01f2224 (1053)

model backend size test t/s
codellama 13B F16 Metal 24G pp 512 390.99 ± 0.06
codellama 13B mostly Q8_0 Metal 13G pp 512 368.56 ± 0.22
codellama 13B mostly Q6_K Metal 10G pp 512 324.54 ± 0.04
codellama 13B mostly Q5_K - Medium Metal 8.8G pp 512 321.51 ± 0.05
codellama 13B mostly Q4_K - Medium Metal 7.5G pp 512 340.60 ± 0.12
codellama 13B mostly Q4_1 Metal 7.6G pp 512 371.14 ± 0.05
codellama 13B mostly Q4_0 Metal 6.8G pp 512 369.43 ± 0.06
codellama 13B mostly Q3_K - Medium Metal 6.1G pp 512 336.34 ± 0.14
codellama 13B mostly Q2_K Metal 5.3G pp 512 336.66 ± 0.08
codellama 13B F16 Metal 24G tg 64 16.44 ± 0.02
codellama 13B mostly Q8_0 Metal 13G tg 64 36.69 ± 0.05
codellama 13B mostly Q6_K Metal 10G tg 64 41.11 ± 0.07
codellama 13B mostly Q5_K - Medium Metal 8.8G tg 64 42.46 ± 0.03
codellama 13B mostly Q4_K - Medium Metal 7.5G tg 64 48.80 ± 0.04
codellama 13B mostly Q4_1 Metal 7.6G tg 64 51.26 ± 0.05
codellama 13B mostly Q4_0 Metal 6.8G tg 64 55.35 ± 0.09
codellama 13B mostly Q3_K - Medium Metal 6.1G tg 64 46.22 ± 0.03
codellama 13B mostly Q2_K Metal 5.3G tg 64 47.41 ± 0.05

build: 01f2224 (1053)

model backend size test t/s
codellama 34B F16 Metal 63G pp 512 149.52 ± 0.34
codellama 34B mostly Q8_0 Metal 33G pp 512 140.89 ± 0.03
codellama 34B mostly Q6_K Metal 26G pp 512 123.76 ± 0.04
codellama 34B mostly Q5_K - Medium Metal 22G pp 512 123.63 ± 0.01
codellama 34B mostly Q4_K - Medium Metal 19G pp 512 130.65 ± 0.01
codellama 34B mostly Q4_1 Metal 20G pp 512 142.01 ± 0.03
codellama 34B mostly Q4_0 Metal 18G pp 512 141.55 ± 0.02
codellama 34B mostly Q3_K - Medium Metal 15G pp 512 128.37 ± 0.00
codellama 34B mostly Q2_K Metal 13G pp 512 128.07 ± 0.04
codellama 34B F16 Metal 63G tg 64 7.32 ± 0.00
codellama 34B mostly Q8_0 Metal 33G tg 64 16.85 ± 0.01
codellama 34B mostly Q6_K Metal 26G tg 64 19.22 ± 0.00
codellama 34B mostly Q5_K - Medium Metal 22G tg 64 20.84 ± 0.01
codellama 34B mostly Q4_K - Medium Metal 19G tg 64 25.36 ± 0.00
codellama 34B mostly Q4_1 Metal 20G tg 64 25.75 ± 0.01
codellama 34B mostly Q4_0 Metal 18G tg 64 27.93 ± 0.01
codellama 34B mostly Q3_K - Medium Metal 15G tg 64 23.89 ± 0.00
codellama 34B mostly Q2_K Metal 13G tg 64 23.52 ± 0.04

build: 01f2224 (1053)

Anyone working on adding the rope base to the meta data?
If not, I'll add it in about 15 mins

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

Anyone working on adding the rope base to the meta data?

I am not working on it, I was waiting for some input since I don't know all the details of gguf.

@ggerganov
Copy link
Owner

ggerganov commented Aug 24, 2023

You have to add the KV constant in the gguf.py and in llama.cpp similar to LLM_KV_ROPE_SCALE_LINEAR.
Just grep for all uses of LLM_KV_ROPE_SCALE_LINEAR and replicate as a new KV, for example LLK_KV_ROPE_BASE

And in convert.py in add_meta_arch() add a new call:

self.gguf.add_rope_base(params.f_rope_base)

@TheBloke
Copy link
Contributor

Does this affect all the new Code Llama models or only 34B? Something I'm ready elsewhere suggests all, is that right?

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

Does this affect all the new Code Llama models or only 34B? Something I'm ready elsewhere suggests all, is that right?

My plan is to only affect CodeLLama model, the rope freq base will be added as an optional metadata that will be omitted for the other models, so they won't change. But that may change after the review.

@ggerganov
Copy link
Owner

All new Code Llama are affacted - without this change one would need to provide the rope base manually, which is inconvenient

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

@TheBloke the change has been merged, it should be safe to convert the models now.

@ggerganov
Copy link
Owner

ggerganov commented Aug 24, 2023

Just a heads up - I expect in the near future to tune the quantum mixtures to some extend.
For example, currently Q4_0 does not use a high-bit output tensor (e.g. Q6_K) because the tensor has ne[1] not a multiple of 256. Not sure why we have this restriction, but probably we'll fix it and it would result in a new quantum model

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

Additionally, it might be a good idea to convert them with --ctx 16384, since the converter will default to 4096. Maybe we should change convert.py to use this value automatically if theta_scale is 1e6?

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

@TheBloke I tried your codellama-7b-python.Q4_K_M.gguf and it fails with this error:

error loading model: create_tensor: tensor 'token_embd.weight' has wrong shape; expected  4096, 32016, got  4096, 32000,     1,     1

I tried converting this model myself, and it works for me, so I am not sure what went wrong there. Maybe you used a different tokenizer.model?

@Jipok
Copy link

Jipok commented Aug 24, 2023

@slaren Q8 model works for me:
./main -m ~/Downloads/codellama-7b-instruct.Q8_0.gguf -e -p "<s>[INST] How does hpa work in kubernetes?[/INST]" -s 0 --temp 0 --rope-freq-base 1e6

@TheBloke
Copy link
Contributor

TheBloke commented Aug 24, 2023

ugh yeah I see what went wrong, I converted to HF first and the convert_llama_weights_to_hf reads tokenizer.model from the root directory, not the model weight dir, so I must have done them all with the same tokenizer.model

I'm re-doing everything now

@TheBloke
Copy link
Contributor

TheBloke commented Aug 24, 2023

I'm confused re rope_frequency_Base - I have rope_theta in my config.json but convert.py is not picking it up?

(pytorch2)  ubuntu@a10:/workspace/git/gguf-llama (master ✔) ᐅ grep rope_theta /workspace/models_codellama/7B/config.json
    "rope_theta": 1000000

(pytorch2)  ubuntu@a10:/workspace/git/gguf-llama (master ✔) ᐅ python3 ./convert.py --outtype f16 --outfile /workspace/process/codellama-7b/gguf/codellama-7b.fp16.gguf /workspace/models_codellama/7B
Loading model file /workspace/models_codellama/7B/model-00001-of-00002.safetensors
Loading model file /workspace/models_codellama/7B/model-00001-of-00002.safetensors
Loading model file /workspace/models_codellama/7B/model-00002-of-00002.safetensors
params = Params(n_vocab=32016, n_embd=4096, n_mult=5504, n_layer=32, n_ctx=16384, n_ff=11008, n_head=32, n_head_kv=32, f_norm_eps=1e-05, f_rope_freq_base=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/workspace/models_codellama/7B'))
Loading vocab file '/workspace/models_codellama/7B/tokenizer.model', type 'spm'

f_rope_freq_base=None ? And then when I do inference on this fp16

llm_load_print_meta: format         = GGUF V1 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32016
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 16384
llm_load_print_meta: n_ctx          = 4096
llm_load_print_meta: n_embd         = 4096
llm_load_print_meta: n_head         = 32
llm_load_print_meta: n_head_kv      = 32
llm_load_print_meta: n_layer        = 32
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 11008
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 7B
llm_load_print_meta: model ftype    = mostly F16
llm_load_print_meta: model size     = 6.74 B
llm_load_print_meta: general.name   = LLaMA
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 12853.35 MB (+ 2048.00 MB per state)
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0 MB

freq_base is 10,000 still.

Am I doing something wrong? or misunderstanding something?

@TheBloke
Copy link
Contributor

Oh I am misunderstanding - that's the section of convert.py that reads params.json, not config.json!

OK what am I meant to do when making a model from HF format, how do I set the correct rope_freq_base then?

@TheBloke
Copy link
Contributor

Maybe I should just make the models from PTH, I feel like I'm making life much harder for myself trying to go PTH -> HF -> GGUF

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

It's not supported for HF models. If you can point me to an HF model, I can try to add it, assuming that the parameter is somewhere in config.json or params.json.

@TheBloke
Copy link
Contributor

Yeah I guess it's not officially in there. We don't know how HF are going to offficially do this yet.

User emozilla has created custom Llama modelling code which uses rope_theta in config.json and I'm duplicating that for my repos, eg as seen here: https://huggingface.co/TheBloke/CodeLlama-7B-Python-fp16/blob/main/config.json

But whether HF will stick with that I have no idea. Are you OK with supporting that temporarily at least?

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

Yeah no problem, if everything goes well I'll open a PR in a short while.

@TheBloke
Copy link
Contributor

Thanks so much!

And just to triple check I'm not screwing anything else up:

  • 7B/13B/34B = vocab 32016
  • *-Instruct = vocab 32016
  • *-Python = vocab 32000

Is that right? That seems to work fine just want to be extra sure

@slaren
Copy link
Collaborator Author

slaren commented Aug 24, 2023

The 34B base model has a vocab of 32000, only 7B and 13B should have the extended vocab.

I am not sure about the instruct and python models yet, I can check, but it's going to take a while, I am running out of disk space

@TheBloke
Copy link
Contributor

Ok thanks! I've not looked at 34B yet, will be shortly.

Don't worry it's fine, I was just wanting a sanity check if you already knew. I've done some tests converting direct from PTH and it shows what I described above (at least for 7B and 13B) so I'm confident that must be correct

akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants