grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) #6609

ochafik · 2024-04-11T14:29:37Z

Here's simple code to repro 1.6x faster inference speedup (on Metal) w/ a nested repetition-heavy grammar from #6555 (for JSON schema {"items": {"type": "number"}, "maxItems": 100}):

git clone https://github.com/ochafik/llama.cpp --branch grammar-speedup3 llama.cpp-grammar && \
    cd llama.cpp-grammar && \
    git pull && \
    mkdir -p models/7B

echo '
    decimal-part ::= [0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9])?)?)?)?)?)?)?)?)?)?)?)?)?)?)?
    integral-part ::= [0-9] | [1-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9])?)?)?)?)?)?)?)?)?)?)?)?)?)?)?
    number ::= ("-"? integral-part) ("." decimal-part)? ([eE] [-+]? integral-part)? space
    root ::= "[" space number "," space number "," space number "," space number "," space number "," space number "," space number "," space number "," space number "," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)? "]" space
    space ::= " "?
' > json_numbers.grammar

hyperfine \
    --warmup 1 --runs 5 \
    -L branch grammar-speedup3,master \
    --setup 'git checkout {branch} && make clean && make -j LLAMA_CURL=1 main' \
    './main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344'

Show results

Benchmark 1: ./main -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file json_numbers.grammar -p "List of 20 integers starting from 0" --seed 12344 (branch = grammar-speedup3)
  Time (mean ± σ):      8.405 s ±  0.234 s    [User: 6.806 s, System: 0.412 s]
  Range (min … max):    8.179 s …  8.750 s    5 runs
 
Benchmark 2: ./main -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file json_numbers.grammar -p "List of 20 integers starting from 0" --seed 12344 (branch = master)
  Time (mean ± σ):     13.386 s ±  0.109 s    [User: 9.596 s, System: 2.552 s]
  Range (min … max):   13.253 s … 13.520 s    5 runs
 
Summary
  ./main -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file json_numbers.grammar -p "List of 20 integers starting from 0" --seed 12344 (branch = grammar-speedup3) ran
    1.59 ± 0.05 times faster than ./main -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file json_numbers.grammar -p "List of 20 integers starting from 0" --seed 12344 (branch = master)

cc/ @HanClinto

github-actions · 2024-04-11T14:43:40Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 458 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=10260.57ms p(95)=26344.27ms fails=, finish reason: stop=408 truncated=50
Prompt processing (pp): avg=111.91tk/s p(95)=474.61tk/s
Token generation (tg): avg=24.24tk/s p(95)=38.01tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=grammar-speedup3 commit=1e0f466920dbd6747852db864118266e6f256700

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 458 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712858073 --> 1712858699
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 365.55, 365.55, 365.55, 365.55, 365.55, 550.85, 550.85, 550.85, 550.85, 550.85, 461.41, 461.41, 461.41, 461.41, 461.41, 494.11, 494.11, 494.11, 494.11, 494.11, 523.04, 523.04, 523.04, 523.04, 523.04, 570.42, 570.42, 570.42, 570.42, 570.42, 571.43, 571.43, 571.43, 571.43, 571.43, 571.95, 571.95, 571.95, 571.95, 571.95, 605.21, 605.21, 605.21, 605.21, 605.21, 607.03, 607.03, 607.03, 607.03, 607.03, 617.74, 617.74, 617.74, 617.74, 617.74, 619.23, 619.23, 619.23, 619.23, 619.23, 620.24, 620.24, 620.24, 620.24, 620.24, 633.46, 633.46, 633.46, 633.46, 633.46, 628.22, 628.22, 628.22, 628.22, 628.22, 635.44, 635.44, 635.44, 635.44, 635.44, 650.96, 650.96, 650.96, 650.96, 650.96, 580.75, 580.75, 580.75, 580.75, 580.75, 579.02, 579.02, 579.02, 579.02, 579.02, 586.04, 586.04, 586.04, 586.04, 586.04, 587.09, 587.09, 587.09, 587.09, 587.09, 600.08, 600.08, 600.08, 600.08, 600.08, 602.35, 602.35, 602.35, 602.35, 602.35, 605.0, 605.0, 605.0, 605.0, 605.0, 606.66, 606.66, 606.66, 606.66, 606.66, 611.5, 611.5, 611.5, 611.5, 611.5, 614.67, 614.67, 614.67, 614.67, 614.67, 617.97, 617.97, 617.97, 617.97, 617.97, 603.65, 603.65, 603.65, 603.65, 603.65, 608.49, 608.49, 608.49, 608.49, 608.49, 611.01, 611.01, 611.01, 611.01, 611.01, 610.2, 610.2, 610.2, 610.2, 610.2, 608.38, 608.38, 608.38, 608.38, 608.38, 608.98, 608.98, 608.98, 608.98, 608.98, 609.71, 609.71, 609.71, 609.71, 609.71, 614.28, 614.28, 614.28, 614.28, 614.28, 617.28, 617.28, 617.28, 617.28, 617.28, 617.08, 617.08, 617.08, 617.08, 617.08, 621.35, 621.35, 621.35, 621.35, 621.35, 625.61, 625.61, 625.61, 625.61, 625.61, 638.65, 638.65, 638.65, 638.65, 638.65, 639.8, 639.8, 639.8, 639.8, 639.8, 642.7, 642.7, 642.7, 642.7, 642.7, 643.52, 643.52, 643.52, 643.52, 643.52, 643.09, 643.09, 643.09, 643.09, 643.09, 643.23, 643.23, 643.23, 643.23, 643.23, 643.74, 643.74, 643.74, 643.74, 643.74, 646.86, 646.86, 646.86, 646.86, 646.86, 646.39, 646.39, 646.39, 646.39, 646.39, 645.57, 645.57, 645.57, 645.57, 645.57, 645.18, 645.18, 645.18, 645.18, 645.18, 642.57, 642.57, 642.57, 642.57, 642.57, 641.91, 641.91, 641.91, 641.91, 641.91, 640.78, 640.78, 640.78, 640.78, 640.78, 639.17, 639.17, 639.17, 639.17, 639.17, 642.51, 642.51, 642.51, 642.51, 642.51, 645.39, 645.39, 645.39, 645.39, 645.39, 645.71, 645.71, 645.71, 645.71, 645.71, 648.25, 648.25, 648.25, 648.25, 648.25, 650.57, 650.57, 650.57, 650.57, 650.57, 651.72, 651.72, 651.72, 651.72]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 458 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712858073 --> 1712858699
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.26, 36.26, 36.26, 36.26, 36.26, 29.58, 29.58, 29.58, 29.58, 29.58, 24.54, 24.54, 24.54, 24.54, 24.54, 25.98, 25.98, 25.98, 25.98, 25.98, 26.42, 26.42, 26.42, 26.42, 26.42, 26.47, 26.47, 26.47, 26.47, 26.47, 26.56, 26.56, 26.56, 26.56, 26.56, 26.8, 26.8, 26.8, 26.8, 26.8, 27.5, 27.5, 27.5, 27.5, 27.5, 27.6, 27.6, 27.6, 27.6, 27.6, 27.46, 27.46, 27.46, 27.46, 27.46, 26.75, 26.75, 26.75, 26.75, 26.75, 25.79, 25.79, 25.79, 25.79, 25.79, 25.45, 25.45, 25.45, 25.45, 25.45, 25.39, 25.39, 25.39, 25.39, 25.39, 24.92, 24.92, 24.92, 24.92, 24.92, 24.84, 24.84, 24.84, 24.84, 24.84, 24.27, 24.27, 24.27, 24.27, 24.27, 23.3, 23.3, 23.3, 23.3, 23.3, 22.97, 22.97, 22.97, 22.97, 22.97, 23.02, 23.02, 23.02, 23.02, 23.02, 23.19, 23.19, 23.19, 23.19, 23.19, 23.04, 23.04, 23.04, 23.04, 23.04, 22.8, 22.8, 22.8, 22.8, 22.8, 22.69, 22.69, 22.69, 22.69, 22.69, 22.64, 22.64, 22.64, 22.64, 22.64, 22.68, 22.68, 22.68, 22.68, 22.68, 22.73, 22.73, 22.73, 22.73, 22.73, 22.6, 22.6, 22.6, 22.6, 22.6, 22.8, 22.8, 22.8, 22.8, 22.8, 23.07, 23.07, 23.07, 23.07, 23.07, 23.1, 23.1, 23.1, 23.1, 23.1, 22.98, 22.98, 22.98, 22.98, 22.98, 22.85, 22.85, 22.85, 22.85, 22.85, 22.8, 22.8, 22.8, 22.8, 22.8, 22.91, 22.91, 22.91, 22.91, 22.91, 23.07, 23.07, 23.07, 23.07, 23.07, 23.24, 23.24, 23.24, 23.24, 23.24, 23.28, 23.28, 23.28, 23.28, 23.28, 23.35, 23.35, 23.35, 23.35, 23.35, 23.24, 23.24, 23.24, 23.24, 23.24, 23.22, 23.22, 23.22, 23.22, 23.22, 23.15, 23.15, 23.15, 23.15, 23.15, 22.88, 22.88, 22.88, 22.88, 22.88, 22.85, 22.85, 22.85, 22.85, 22.85, 22.84, 22.84, 22.84, 22.84, 22.84, 22.89, 22.89, 22.89, 22.89, 22.89, 23.04, 23.04, 23.04, 23.04, 23.04, 23.12, 23.12, 23.12, 23.12, 23.12, 23.0, 23.0, 23.0, 23.0, 23.0, 22.93, 22.93, 22.93, 22.93, 22.93, 22.66, 22.66, 22.66, 22.66, 22.66, 22.39, 22.39, 22.39, 22.39, 22.39, 22.27, 22.27, 22.27, 22.27, 22.27, 21.99, 21.99, 21.99, 21.99, 21.99, 21.46, 21.46, 21.46, 21.46, 21.46, 21.36, 21.36, 21.36, 21.36, 21.36, 21.38, 21.38, 21.38, 21.38, 21.38, 21.45, 21.45, 21.45, 21.45, 21.45, 21.48, 21.48, 21.48, 21.48, 21.48, 21.64, 21.64, 21.64, 21.64]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 458 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712858073 --> 1712858699
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.37, 0.37, 0.37, 0.37, 0.37, 0.27, 0.27, 0.27, 0.27, 0.27, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25, 0.25, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.34, 0.34, 0.34, 0.34, 0.34, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.32, 0.32, 0.32, 0.32, 0.32, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4, 0.4, 0.51, 0.51, 0.51, 0.51, 0.51, 0.43, 0.43, 0.43, 0.43, 0.43, 0.5, 0.5, 0.5, 0.5, 0.5, 0.45, 0.45, 0.45, 0.45, 0.45, 0.37, 0.37, 0.37, 0.37, 0.37, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.08, 0.08, 0.08, 0.08, 0.08, 0.09, 0.09, 0.09, 0.09]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 458 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712858073 --> 1712858699
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0]

HanClinto · 2024-04-11T15:37:40Z

Similar to the integration tests, examples/gbnf-validator will eventually need to be updated to use the new API as well. That's lower priority though, and I can do that after we get through all of this.

Reading through this PR, I'm amazed that such a simple change provides such a dramatic speedup. I still just have a hard time believing it's as effective as it is. :)

ochafik · 2024-04-11T15:56:37Z

Similar to the integration tests, examples/gbnf-validator will eventually need to be updated to use the new API as well.

Done, thanks!

Reading through this PR, I'm amazed that such a simple change provides such a dramatic speedup. I still just have a hard time believing it's as effective as it is. :)

Yeah I tried half a dozen similar rewrites and only these lucky two struck a chord :-D (let's hope for much more dramatic speedups w/ upcoming changes #4218 (comment))

HanClinto · 2024-04-11T16:29:10Z

llama.cpp

    for (auto it = code_points.begin(), end = code_points.end() - 1; it != end; ++it) {
-        grammar->stacks = llama_grammar_accept(grammar->rules, grammar->stacks, *it);
+        llama_grammar_accept(grammar->rules, grammar->stacks, *it, tmp_new_stacks);
+        tmp_new_stacks.swap(grammar->stacks);


Is this better than saying grammar->stacks = tmp_new_stacks;? Because new_stacks is .clear()'d on 11921, it seems like we don't need to save its value here, and we could save a small step (?).

Mainly though, the recursive nature of the swap here was making my eyes cross when trying to follow exactly what this change was doing and how the contents of grammar->stacks and tmp_new_stacks were ping-ponging back and forth in this loop, so getting rid of the .swap() might make it a bit easier to read as well?

FWIW, I tried making this change (into a local grammar-speedup4 branch), and it didn't significantly improve things, but it wasn't slower, and I think the code is a bit more readable:

Benchmark 1: ./main \ -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \ --grammar-file json_numbers.grammar \ -p "List of 20 integers starting from 0" \ --seed 12344 (branch = grammar-speedup4) Time (mean ± σ): 12.586 s ± 0.698 s [User: 8.488 s, System: 1.799 s] Range (min … max): 12.012 s … 13.726 s 5 runs Benchmark 2: ./main \ -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \ --grammar-file json_numbers.grammar \ -p "List of 20 integers starting from 0" \ --seed 12344 (branch = grammar-speedup3) Time (mean ± σ): 12.904 s ± 0.854 s [User: 8.583 s, System: 1.954 s] Range (min … max): 11.846 s … 13.963 s 5 runs Summary ./main \ -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \ --grammar-file json_numbers.grammar \ -p "List of 20 integers starting from 0" \ --seed 12344 (branch = grammar-speedup4) ran 1.03 ± 0.09 times faster than ./main \ -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \ --grammar-file json_numbers.grammar \ -p "List of 20 integers starting from 0" \ --seed 12344 (branch = grammar-speedup3)

Heh, turns out my eyes-crossing swap wasn't even making things faster, removed it / looks simpler thanks.

HanClinto · 2024-04-11T17:04:27Z

FWIW, I've independently confirmed the (rather dramatic) speedup results of 1.71x on my system (wow!):

Benchmark 1: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = grammar-speedup3)
  Time (mean ± σ):     12.302 s ±  1.341 s    [User: 8.369 s, System: 1.672 s]
  Range (min … max):   11.405 s … 14.642 s    5 runs

Benchmark 2: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)
  Time (mean ± σ):     20.978 s ±  1.003 s    [User: 11.519 s, System: 6.894 s]
  Range (min … max):   19.908 s … 22.488 s    5 runs

Summary
  ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = grammar-speedup3) ran
    1.71 ± 0.20 times faster than ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)

Macbook Pro, Apple M1 Pro, 32 GB of RAM, and about a billion open Firefox tabs.

Really awesome work, @ochafik !

llama.cpp

HanClinto · 2024-04-11T17:18:54Z

Other than my minor suggestion re: swap(), I'm really happy with this PR, and can't wait to see it merged in!

HanClinto

This PR looks good to me!

…eserves / reuses) (ggerganov#6609)" This reverts commit cbaadc9.

…/ reuses) (ggerganov#6609) * grammars: reserve rejects & next candidates * grammars: reuse new_stacks * grammars: fix missing sig change in llama.h * grammars: fix test (api changed) * grammars: update gbnf-validator.cpp * grammars: simpler syntax (no swap)

ochafik added 2 commits April 11, 2024 15:09

grammars: reserve rejects & next candidates

3732ad9

grammars: reuse new_stacks

47e37dd

ochafik added 2 commits April 11, 2024 15:47

grammars: fix missing sig change in llama.h

763b41e

grammars: fix test (api changed)

db787a4

grammars: update gbnf-validator.cpp

cb77a8d

HanClinto reviewed Apr 11, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

grammars: simpler syntax (no swap)

1e0f466

HanClinto mentioned this pull request Apr 11, 2024

Grammar optimization: eliminate redundant grammar trees (~4x faster grammar sampling) #6616

Merged

HanClinto approved these changes Apr 11, 2024

View reviewed changes

ochafik marked this pull request as ready for review April 11, 2024 18:46

ochafik merged commit cbaadc9 into ggerganov:master Apr 11, 2024
47 of 50 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Apr 11, 2024

Revert "grammars: 1.5x faster inference w/ complex grammars (vector r…

95c8115

…eserves / reuses) (ggerganov#6609)" This reverts commit cbaadc9.

This was referenced Apr 12, 2024

JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length #6555

Merged

main: add --json-schema / -j flag #6659

Merged

ochafik mentioned this pull request Apr 26, 2024

grammars: x{min,max} repetition operator #6640

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) #6609

grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) #6609

ochafik commented Apr 11, 2024 •

edited

Loading

github-actions bot commented Apr 11, 2024 •

edited

Loading

HanClinto commented Apr 11, 2024

ochafik commented Apr 11, 2024

HanClinto Apr 11, 2024

HanClinto Apr 11, 2024

ochafik Apr 11, 2024

HanClinto commented Apr 11, 2024

HanClinto commented Apr 11, 2024

HanClinto left a comment

grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) #6609

grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) #6609

Conversation

ochafik commented Apr 11, 2024 • edited Loading

github-actions bot commented Apr 11, 2024 • edited Loading

HanClinto commented Apr 11, 2024

ochafik commented Apr 11, 2024

HanClinto Apr 11, 2024

Choose a reason for hiding this comment

HanClinto Apr 11, 2024

Choose a reason for hiding this comment

ochafik Apr 11, 2024

Choose a reason for hiding this comment

HanClinto commented Apr 11, 2024

HanClinto commented Apr 11, 2024

HanClinto left a comment

Choose a reason for hiding this comment

ochafik commented Apr 11, 2024 •

edited

Loading

github-actions bot commented Apr 11, 2024 •

edited

Loading