Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

falcon: metal crashes with GGML_ASSERT: ggml-metal.m:932: n % 4 == 0 #3754

Closed
jmorganca opened this issue Oct 24, 2023 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@jmorganca
Copy link
Contributor

jmorganca commented Oct 24, 2023

Prerequisites

Running a newly converted + quantized GGUF version of falcon 7b instruct results in an assertion being fired:

./main -m ./ggml-tiiuae-falcon-7b-Q4_0.gguf  -ngl 1 -p "hello"
...
GGML_ASSERT: ggml-metal.m:932: n % 4 == 0
...

Full logs:

./main -m ./ggml-tiiuae-falcon-7b-Q4_0.gguf  -ngl 1 -p "hello"
Log start
main: build = 1419 (e393259)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.0.0
main: seed  = 1698109603
llama_model_loader: loaded meta data with 18 key-value pairs and 196 tensors from ./ggml-tiiuae-falcon-7b-Q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4544, 65024,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor    2:             blk.0.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor    6:            blk.0.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor    7:           blk.1.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor    8:             blk.1.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.1.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   10:         blk.1.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   13:           blk.2.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   14:             blk.2.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   15:            blk.2.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   16:         blk.2.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   17:              blk.2.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   18:            blk.2.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   19:           blk.3.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   20:             blk.3.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   21:            blk.3.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   22:         blk.3.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   23:              blk.3.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   24:            blk.3.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   25:           blk.4.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   26:             blk.4.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.4.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   28:         blk.4.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   29:              blk.4.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   30:            blk.4.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   31:           blk.5.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   32:             blk.5.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   33:            blk.5.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   34:         blk.5.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   35:              blk.5.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   36:            blk.5.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   37:           blk.6.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   38:             blk.6.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   39:            blk.6.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   40:         blk.6.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   41:              blk.6.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   42:            blk.6.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   43:           blk.7.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   44:             blk.7.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.7.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   46:         blk.7.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   47:              blk.7.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   48:            blk.7.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   49:           blk.8.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   50:             blk.8.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   51:            blk.8.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   52:         blk.8.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   53:              blk.8.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   54:            blk.8.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   55:           blk.9.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   56:             blk.9.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   57:            blk.9.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   58:         blk.9.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   59:              blk.9.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   60:            blk.9.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   61:          blk.10.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   62:            blk.10.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   63:           blk.10.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   64:        blk.10.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   65:             blk.10.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   66:           blk.10.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   67:          blk.11.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   68:            blk.11.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   69:           blk.11.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   70:        blk.11.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   71:             blk.11.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   72:           blk.11.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   73:          blk.12.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   74:            blk.12.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   75:           blk.12.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   76:        blk.12.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   77:             blk.12.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   78:           blk.12.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   79:          blk.13.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   80:            blk.13.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   81:           blk.13.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   82:        blk.13.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   83:             blk.13.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   84:           blk.13.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   85:          blk.14.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   86:            blk.14.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   87:           blk.14.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   88:        blk.14.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   89:             blk.14.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   90:           blk.14.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   91:          blk.15.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   92:            blk.15.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   93:           blk.15.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   94:        blk.15.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   95:             blk.15.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   96:           blk.15.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   97:          blk.16.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   98:            blk.16.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.16.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  100:        blk.16.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  101:             blk.16.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  102:           blk.16.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  103:          blk.17.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  104:            blk.17.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  105:           blk.17.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  106:        blk.17.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  107:             blk.17.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  108:           blk.17.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  109:          blk.18.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  110:            blk.18.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  111:           blk.18.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  112:        blk.18.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  113:             blk.18.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  114:           blk.18.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  115:          blk.19.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  116:            blk.19.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.19.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  118:        blk.19.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  119:             blk.19.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  120:           blk.19.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  121:          blk.20.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  122:            blk.20.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  123:           blk.20.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  124:        blk.20.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  125:             blk.20.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  126:           blk.20.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  127:          blk.21.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  128:            blk.21.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  129:           blk.21.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  130:        blk.21.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  131:             blk.21.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  132:           blk.21.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  133:          blk.22.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  134:            blk.22.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.22.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  136:        blk.22.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  137:             blk.22.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  138:           blk.22.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  139:          blk.23.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  140:            blk.23.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  141:           blk.23.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  142:        blk.23.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  143:             blk.23.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  144:           blk.23.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  145:          blk.24.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  146:            blk.24.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  147:           blk.24.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  148:        blk.24.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  149:             blk.24.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  150:           blk.24.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  151:          blk.25.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  152:            blk.25.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.25.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  154:        blk.25.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  155:             blk.25.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  156:           blk.25.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  157:          blk.26.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  158:            blk.26.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  159:           blk.26.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  160:        blk.26.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  161:             blk.26.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  162:           blk.26.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  163:          blk.27.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  164:            blk.27.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  165:           blk.27.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  166:        blk.27.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  167:             blk.27.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  168:           blk.27.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  169:          blk.28.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  170:            blk.28.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.28.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  172:        blk.28.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  173:             blk.28.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  174:           blk.28.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  175:          blk.29.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  176:            blk.29.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  177:           blk.29.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  178:        blk.29.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  179:             blk.29.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  180:           blk.29.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  181:          blk.30.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  182:            blk.30.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  183:           blk.30.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  184:        blk.30.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  185:             blk.30.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  186:           blk.30.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  187:          blk.31.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  188:            blk.31.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.31.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  190:        blk.31.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  191:             blk.31.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  192:           blk.31.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  193:               output_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  194:                 output_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  195:                    output.weight q8_0     [  4544, 65024,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                      falcon.context_length u32     
llama_model_loader: - kv   3:                  falcon.tensor_data_layout str     
llama_model_loader: - kv   4:                    falcon.embedding_length u32     
llama_model_loader: - kv   5:                 falcon.feed_forward_length u32     
llama_model_loader: - kv   6:                         falcon.block_count u32     
llama_model_loader: - kv   7:                falcon.attention.head_count u32     
llama_model_loader: - kv   8:             falcon.attention.head_count_kv u32     
llama_model_loader: - kv   9:        falcon.attention.layer_norm_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:               general.quantization_version u32     
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_0:  129 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 12/65024 vs 0/65024 ).
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = falcon
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 65024
llm_load_print_meta: n_merges         = 64784
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4544
llm_load_print_meta: n_head           = 71
llm_load_print_meta: n_head_kv        = 1
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_gqa            = 71
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 18176
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 7.22 B
llm_load_print_meta: model size       = 3.92 GiB (4.66 BPW) 
llm_load_print_meta: general.name   = Falcon
llm_load_print_meta: BOS token = 11 '<|endoftext|>'
llm_load_print_meta: EOS token = 11 '<|endoftext|>'
llm_load_print_meta: LF token  = 138 'Ä'
llm_load_tensors: ggml ctx size =    0.07 MB
llm_load_tensors: mem required  = 4013.54 MB
....................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =    4.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/jmorgan/workspace/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x122f05e20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_add_row                        0x122f071f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul                            0x122f07720 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_row                        0x122f07fd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_scale                          0x122f08820 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_silu                           0x122f08fd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_relu                           0x122f09780 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu                           0x122f09f30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max                       0x122f0a460 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max_4                     0x122f0a990 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf                  0x122f0aec0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf_8                0x122f0b570 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f32                   0x122f0baa0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f16                   0x122f0bfd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x122f0c500 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x122f0ca30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_0                  0x122f0cf60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_1                  0x122f0d490 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q8_0                  0x122f0d9c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x122f0e060 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x113b06ad0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x113b06d20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x113b07250 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x122f0e590 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rms_norm                       0x122f0ec30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_norm                           0x122f0f160 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f32_f32                 0x113b07660 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f32                 0x113b07f00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_1row            0x122f0f570 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_l4              0x122f0fe90 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_0_f32                0x122e073b0 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_1_f32                0x122e07940 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_0_f32                0x122e07e70 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_1_f32                0x122e083a0 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q8_0_f32                0x122f0ddd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q2_K_f32                0x122f10700 | th_max =  640 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q3_K_f32                0x122f10c30 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_K_f32                0x122f11160 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_K_f32                0x122f11690 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q6_K_f32                0x122f11bc0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f32_f32                 0x122f120f0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f16_f32                 0x122f12620 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                0x122f12b50 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                0x122f13080 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_0_f32                0x122f135b0 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_1_f32                0x122f13ae0 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32                0x122f14010 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                0x122f14730 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                0x122f14c60 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                0x122f15190 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                0x122f156c0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                0x122f15bf0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_rope_f32                       0x122f16120 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rope_f16                       0x122f16650 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_alibi_f32                      0x122f16b80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x122f170b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x122f175e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x122f17b10 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_concat                         0x122f18040 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sqr                            0x122f187f0 | th_max = 1024 | th_width =   32
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 21845.34 MB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 151.88 MB
llama_new_context_with_model: max tensor size =   299.39 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  4015.92 MB, ( 4016.55 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =     4.02 MB, ( 4020.56 / 21845.34)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   145.77 MB, ( 4166.33 / 21845.34)

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


hello all, i just got my droid x last week and i must say that i love it! its a great phone, but there is one thing i reallyGGML_ASSERT: ggml-metal.m:932: n % 4 == 0
GGML_ASSERT: ggml-metal.m:932: n % 4 == 0
GGML_ASSERT: ggml-metal.m:932: n % 4 == 0
zsh: abort      ./main -m ./ggml-tiiuae-falcon-7b-Q4_0.gguf -ngl 1 -p "hello"
@jmorganca jmorganca added the bug Something isn't working label Oct 24, 2023
@jmorganca jmorganca changed the title metal crashes with GGML_ASSERT: ggml-metal.m:932: n % 4 == 0 falcon: metal crashes with GGML_ASSERT: ggml-metal.m:932: n % 4 == 0 Oct 24, 2023
@jmorganca
Copy link
Contributor Author

Update: it also seems to happen with starcoder 3b models. Same assertion being fired

@ggerganov
Copy link
Owner

Should be fixed now - these models have 71 attention heads, didn't expect odd numbers in Metal

@jmorganca
Copy link
Contributor Author

@ggerganov thanks for the fast response 😊

mattgauf added a commit to mattgauf/llama.cpp that referenced this issue Oct 27, 2023
* master: (350 commits)
  speculative : ensure draft and target model vocab matches (ggerganov#3812)
  llama : correctly report GGUFv3 format (ggerganov#3818)
  simple : fix batch handling (ggerganov#3803)
  cuda : improve text-generation and batched decoding performance (ggerganov#3776)
  server : do not release slot on image input (ggerganov#3798)
  batched-bench : print params at start
  log : disable pid in log filenames
  server : add parameter -tb N, --threads-batch N (ggerganov#3584) (ggerganov#3768)
  server : do not block system prompt update (ggerganov#3767)
  sync : ggml (conv ops + cuda MSVC fixes) (ggerganov#3765)
  cmake : add missed dependencies (ggerganov#3763)
  cuda : add batched cuBLAS GEMM for faster attention (ggerganov#3749)
  Add more tokenizer tests (ggerganov#3742)
  metal : handle ggml_scale for n%4 != 0 (close ggerganov#3754)
  Revert "make : add optional CUDA_NATIVE_ARCH (ggerganov#2482)"
  issues : separate bug and enhancement template + no default title (ggerganov#3748)
  Update special token handling in conversion scripts for gpt2 derived tokenizers (ggerganov#3746)
  llama : remove token functions with `context` args in favor of `model` (ggerganov#3720)
  Fix baichuan convert script not detecing model (ggerganov#3739)
  make : add optional CUDA_NATIVE_ARCH (ggerganov#2482)
  ...
brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this issue Nov 17, 2023
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this issue Nov 23, 2023
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this issue Nov 23, 2023
brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this issue Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants