falcon: metal crashes with `GGML_ASSERT: ggml-metal.m:932: n % 4 == 0` #3754

jmorganca · 2023-10-24T01:14:49Z

Prerequisites

Running a newly converted + quantized GGUF version of falcon 7b instruct results in an assertion being fired:

./main -m ./ggml-tiiuae-falcon-7b-Q4_0.gguf  -ngl 1 -p "hello"
...
GGML_ASSERT: ggml-metal.m:932: n % 4 == 0
...

Full logs:

./main -m ./ggml-tiiuae-falcon-7b-Q4_0.gguf  -ngl 1 -p "hello"
Log start
main: build = 1419 (e393259)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.0.0
main: seed  = 1698109603
llama_model_loader: loaded meta data with 18 key-value pairs and 196 tensors from ./ggml-tiiuae-falcon-7b-Q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4544, 65024,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor    2:             blk.0.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor    6:            blk.0.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor    7:           blk.1.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor    8:             blk.1.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.1.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   10:         blk.1.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   13:           blk.2.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   14:             blk.2.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   15:            blk.2.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   16:         blk.2.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   17:              blk.2.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   18:            blk.2.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   19:           blk.3.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   20:             blk.3.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   21:            blk.3.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   22:         blk.3.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   23:              blk.3.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   24:            blk.3.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   25:           blk.4.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   26:             blk.4.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.4.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   28:         blk.4.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   29:              blk.4.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   30:            blk.4.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   31:           blk.5.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   32:             blk.5.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   33:            blk.5.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   34:         blk.5.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   35:              blk.5.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   36:            blk.5.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   37:           blk.6.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   38:             blk.6.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   39:            blk.6.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   40:         blk.6.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   41:              blk.6.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   42:            blk.6.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   43:           blk.7.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   44:             blk.7.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.7.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   46:         blk.7.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   47:              blk.7.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   48:            blk.7.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   49:           blk.8.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   50:             blk.8.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   51:            blk.8.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   52:         blk.8.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   53:              blk.8.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   54:            blk.8.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   55:           blk.9.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   56:             blk.9.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   57:            blk.9.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   58:         blk.9.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   59:              blk.9.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   60:            blk.9.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   61:          blk.10.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   62:            blk.10.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   63:           blk.10.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   64:        blk.10.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   65:             blk.10.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   66:           blk.10.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   67:          blk.11.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   68:            blk.11.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   69:           blk.11.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   70:        blk.11.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   71:             blk.11.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   72:           blk.11.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   73:          blk.12.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   74:            blk.12.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   75:           blk.12.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   76:        blk.12.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   77:             blk.12.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   78:           blk.12.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   79:          blk.13.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   80:            blk.13.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   81:           blk.13.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   82:        blk.13.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   83:             blk.13.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   84:           blk.13.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   85:          blk.14.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   86:            blk.14.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   87:           blk.14.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   88:        blk.14.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   89:             blk.14.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   90:           blk.14.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   91:          blk.15.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   92:            blk.15.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   93:           blk.15.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor   94:        blk.15.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor   95:             blk.15.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor   96:           blk.15.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor   97:          blk.16.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   98:            blk.16.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.16.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  100:        blk.16.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  101:             blk.16.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  102:           blk.16.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  103:          blk.17.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  104:            blk.17.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  105:           blk.17.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  106:        blk.17.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  107:             blk.17.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  108:           blk.17.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  109:          blk.18.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  110:            blk.18.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  111:           blk.18.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  112:        blk.18.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  113:             blk.18.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  114:           blk.18.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  115:          blk.19.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  116:            blk.19.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.19.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  118:        blk.19.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  119:             blk.19.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  120:           blk.19.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  121:          blk.20.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  122:            blk.20.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  123:           blk.20.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  124:        blk.20.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  125:             blk.20.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  126:           blk.20.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  127:          blk.21.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  128:            blk.21.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  129:           blk.21.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  130:        blk.21.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  131:             blk.21.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  132:           blk.21.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  133:          blk.22.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  134:            blk.22.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.22.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  136:        blk.22.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  137:             blk.22.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  138:           blk.22.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  139:          blk.23.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  140:            blk.23.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  141:           blk.23.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  142:        blk.23.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  143:             blk.23.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  144:           blk.23.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  145:          blk.24.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  146:            blk.24.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  147:           blk.24.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  148:        blk.24.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  149:             blk.24.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  150:           blk.24.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  151:          blk.25.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  152:            blk.25.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.25.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  154:        blk.25.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  155:             blk.25.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  156:           blk.25.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  157:          blk.26.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  158:            blk.26.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  159:           blk.26.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  160:        blk.26.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  161:             blk.26.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  162:           blk.26.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  163:          blk.27.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  164:            blk.27.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  165:           blk.27.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  166:        blk.27.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  167:             blk.27.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  168:           blk.27.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  169:          blk.28.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  170:            blk.28.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.28.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  172:        blk.28.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  173:             blk.28.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  174:           blk.28.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  175:          blk.29.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  176:            blk.29.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  177:           blk.29.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  178:        blk.29.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  179:             blk.29.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  180:           blk.29.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  181:          blk.30.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  182:            blk.30.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  183:           blk.30.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  184:        blk.30.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  185:             blk.30.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  186:           blk.30.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  187:          blk.31.attn_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  188:            blk.31.attn_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.31.attn_qkv.weight q4_0     [  4544,  4672,     1,     1 ]
llama_model_loader: - tensor  190:        blk.31.attn_output.weight q4_0     [  4544,  4544,     1,     1 ]
llama_model_loader: - tensor  191:             blk.31.ffn_up.weight q4_0     [  4544, 18176,     1,     1 ]
llama_model_loader: - tensor  192:           blk.31.ffn_down.weight q4_0     [ 18176,  4544,     1,     1 ]
llama_model_loader: - tensor  193:               output_norm.weight f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  194:                 output_norm.bias f32      [  4544,     1,     1,     1 ]
llama_model_loader: - tensor  195:                    output.weight q8_0     [  4544, 65024,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                      falcon.context_length u32     
llama_model_loader: - kv   3:                  falcon.tensor_data_layout str     
llama_model_loader: - kv   4:                    falcon.embedding_length u32     
llama_model_loader: - kv   5:                 falcon.feed_forward_length u32     
llama_model_loader: - kv   6:                         falcon.block_count u32     
llama_model_loader: - kv   7:                falcon.attention.head_count u32     
llama_model_loader: - kv   8:             falcon.attention.head_count_kv u32     
llama_model_loader: - kv   9:        falcon.attention.layer_norm_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:               general.quantization_version u32     
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_0:  129 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 12/65024 vs 0/65024 ).
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = falcon
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 65024
llm_load_print_meta: n_merges         = 64784
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4544
llm_load_print_meta: n_head           = 71
llm_load_print_meta: n_head_kv        = 1
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_gqa            = 71
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 18176
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 7.22 B
llm_load_print_meta: model size       = 3.92 GiB (4.66 BPW) 
llm_load_print_meta: general.name   = Falcon
llm_load_print_meta: BOS token = 11 '<|endoftext|>'
llm_load_print_meta: EOS token = 11 '<|endoftext|>'
llm_load_print_meta: LF token  = 138 'Ä'
llm_load_tensors: ggml ctx size =    0.07 MB
llm_load_tensors: mem required  = 4013.54 MB
....................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =    4.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/jmorgan/workspace/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x122f05e20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_add_row                        0x122f071f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul                            0x122f07720 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_row                        0x122f07fd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_scale                          0x122f08820 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_silu                           0x122f08fd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_relu                           0x122f09780 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu                           0x122f09f30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max                       0x122f0a460 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max_4                     0x122f0a990 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf                  0x122f0aec0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf_8                0x122f0b570 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f32                   0x122f0baa0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f16                   0x122f0bfd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x122f0c500 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x122f0ca30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_0                  0x122f0cf60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_1                  0x122f0d490 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q8_0                  0x122f0d9c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x122f0e060 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x113b06ad0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x113b06d20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x113b07250 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x122f0e590 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rms_norm                       0x122f0ec30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_norm                           0x122f0f160 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f32_f32                 0x113b07660 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f32                 0x113b07f00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_1row            0x122f0f570 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_l4              0x122f0fe90 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_0_f32                0x122e073b0 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_1_f32                0x122e07940 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_0_f32                0x122e07e70 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_1_f32                0x122e083a0 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q8_0_f32                0x122f0ddd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q2_K_f32                0x122f10700 | th_max =  640 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q3_K_f32                0x122f10c30 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_K_f32                0x122f11160 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_K_f32                0x122f11690 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q6_K_f32                0x122f11bc0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f32_f32                 0x122f120f0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f16_f32                 0x122f12620 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                0x122f12b50 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                0x122f13080 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_0_f32                0x122f135b0 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_1_f32                0x122f13ae0 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32                0x122f14010 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                0x122f14730 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                0x122f14c60 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                0x122f15190 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                0x122f156c0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                0x122f15bf0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_rope_f32                       0x122f16120 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rope_f16                       0x122f16650 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_alibi_f32                      0x122f16b80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x122f170b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x122f175e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x122f17b10 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_concat                         0x122f18040 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sqr                            0x122f187f0 | th_max = 1024 | th_width =   32
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 21845.34 MB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 151.88 MB
llama_new_context_with_model: max tensor size =   299.39 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  4015.92 MB, ( 4016.55 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =     4.02 MB, ( 4020.56 / 21845.34)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   145.77 MB, ( 4166.33 / 21845.34)

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


hello all, i just got my droid x last week and i must say that i love it! its a great phone, but there is one thing i reallyGGML_ASSERT: ggml-metal.m:932: n % 4 == 0
GGML_ASSERT: ggml-metal.m:932: n % 4 == 0
GGML_ASSERT: ggml-metal.m:932: n % 4 == 0
zsh: abort      ./main -m ./ggml-tiiuae-falcon-7b-Q4_0.gguf -ngl 1 -p "hello"

The text was updated successfully, but these errors were encountered:

jmorganca · 2023-10-24T06:07:57Z

Update: it also seems to happen with starcoder 3b models. Same assertion being fired

ggerganov · 2023-10-24T06:48:45Z

Should be fixed now - these models have 71 attention heads, didn't expect odd numbers in Metal

jmorganca · 2023-10-24T07:00:27Z

@ggerganov thanks for the fast response 😊

* master: (350 commits) speculative : ensure draft and target model vocab matches (ggerganov#3812) llama : correctly report GGUFv3 format (ggerganov#3818) simple : fix batch handling (ggerganov#3803) cuda : improve text-generation and batched decoding performance (ggerganov#3776) server : do not release slot on image input (ggerganov#3798) batched-bench : print params at start log : disable pid in log filenames server : add parameter -tb N, --threads-batch N (ggerganov#3584) (ggerganov#3768) server : do not block system prompt update (ggerganov#3767) sync : ggml (conv ops + cuda MSVC fixes) (ggerganov#3765) cmake : add missed dependencies (ggerganov#3763) cuda : add batched cuBLAS GEMM for faster attention (ggerganov#3749) Add more tokenizer tests (ggerganov#3742) metal : handle ggml_scale for n%4 != 0 (close ggerganov#3754) Revert "make : add optional CUDA_NATIVE_ARCH (ggerganov#2482)" issues : separate bug and enhancement template + no default title (ggerganov#3748) Update special token handling in conversion scripts for gpt2 derived tokenizers (ggerganov#3746) llama : remove token functions with `context` args in favor of `model` (ggerganov#3720) Fix baichuan convert script not detecing model (ggerganov#3739) make : add optional CUDA_NATIVE_ARCH (ggerganov#2482) ...

* ggerganov/llama.cpp@469c9ad * ggerganov/llama.cpp#3754

ref ggerganov#3754

* ggerganov/llama.cpp@469c9ad * ggerganov/llama.cpp#3754

jmorganca added the bug Something isn't working label Oct 24, 2023

jmorganca changed the title ~~metal crashes with GGML_ASSERT: ggml-metal.m:932: n % 4 == 0~~ falcon: metal crashes with GGML_ASSERT: ggml-metal.m:932: n % 4 == 0 Oct 24, 2023

ggerganov closed this as completed in 469c9ad Oct 24, 2023

brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this issue Nov 17, 2023

Handle odd attention head scales (falcon)

18992ce

* ggerganov/llama.cpp@469c9ad * ggerganov/llama.cpp#3754

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this issue Nov 23, 2023

vulkan : handle ggml_scale for n%8 != 0

922115c

ref ggerganov#3754

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this issue Nov 23, 2023

vulkan : handle ggml_scale for n%8 != 0

6474fc8

ref ggerganov#3754

brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this issue Nov 30, 2023

Handle odd attention head scales (falcon)

792f857

* ggerganov/llama.cpp@469c9ad * ggerganov/llama.cpp#3754

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

falcon: metal crashes with `GGML_ASSERT: ggml-metal.m:932: n % 4 == 0` #3754

falcon: metal crashes with `GGML_ASSERT: ggml-metal.m:932: n % 4 == 0` #3754

jmorganca commented Oct 24, 2023 •

edited

Loading

jmorganca commented Oct 24, 2023

ggerganov commented Oct 24, 2023

jmorganca commented Oct 24, 2023

falcon: metal crashes with GGML_ASSERT: ggml-metal.m:932: n % 4 == 0 #3754

falcon: metal crashes with GGML_ASSERT: ggml-metal.m:932: n % 4 == 0 #3754

Comments

jmorganca commented Oct 24, 2023 • edited Loading

Prerequisites

jmorganca commented Oct 24, 2023

ggerganov commented Oct 24, 2023

jmorganca commented Oct 24, 2023

falcon: metal crashes with `GGML_ASSERT: ggml-metal.m:932: n % 4 == 0` #3754

falcon: metal crashes with `GGML_ASSERT: ggml-metal.m:932: n % 4 == 0` #3754

jmorganca commented Oct 24, 2023 •

edited

Loading