Releases · ggerganov/llama.cpp

22 Nov 10:41

6dfcfef

b4153

ci: Update oneAPI runtime dll packaging (#10428)

This is the minimum runtime dll dependencies for oneAPI 2025.0

Assets 22

22 Nov 07:22

github-actions

b4151

c18610b

b4151

CANN: Support Ascend310P to accelerate F32 and F16 Model (#10216)

* CANN Support Ascend310P to accelerate F32 and F16 Model

* Add compile option soc type macro ASCEND_310P to ggml-cann lib

* Remove unused code

* Remove the ascend soc_type hard code compile option in CMakelist.txt

Assets 21

21 Nov 18:27

github-actions

b4150

a5e4759

b4150

cuda : optimize argmax (#10441)

* cuda : optimize argmax

* remove unused parameter

ggml-ci

* fixup : use full warps

ggml-ci

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <[email protected]>

* fix ub

* ggml : check ne00 <= INT32_MAX in argmax and argsort

---------

Co-authored-by: Johannes Gäßler <[email protected]>

Assets 21

21 Nov 09:23

github-actions

b4149

1bb30bf

b4149

llama : handle KV shift for recurrent models (#10402)

ggml-ci

Assets 21

21 Nov 08:49

github-actions

b4148

87a533b

b4148

sync : ggml

Assets 21

20 Nov 15:17

github-actions

b4143

fab5d30

b4143

llama : add .clang-format file (#10415)

Assets 21

20 Nov 09:03

github-actions

b4142

8fd4b7f

b4142

vulkan: copy iq4_nl LUT into shared memory (#10409)

Assets 21

20 Nov 08:42

github-actions

b4141

1bacb9f

b4141

vulkan: further optimize mul_mat_vec using larger loads (#10387)

* vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec.

Add some early returns for nonexistent rows in mul_mat_vec shaders. These
can only be hit when dispatching a 2D grid of workgroups. Fix the logic
for the 2D grid of workgroups to round up.

Enable the pipeline robustness extension if it's available, and use it to
disable robustness for these pipelines. The instructions to do the bounds
checking contend for the same ALU resources as the bit twiddling dequant
instructions.

* vulkan: Add GLSL structure aliases for quant types to allow larger loads

In Vulkan it's not possible to cast pointer types, so instead you have to
declare an aliased binding for the memory with a different type. This
commit adds aliases for the quant formats using 16b ints, and in a few
places where the struct size is a multiple of 4 also using 32b ints.
Currently only q4_k's aliases are used, but others will be used in
subsequent commits.

* vulkan: use larger loads in q5_k and q6_k shaders.

Similar to the optimization I did in q4_k recently, this vectorizes some loads
and reduces the number of bit twiddling instructions.

* vulkan: use larger K step per iteration in mul_mat_vec.

Add vec4 dequantization functions, and use them to do K=8 per iteration in
mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B
which helps reduce the load on the memory system.

The K_PER_ITER==2 logic is still there, just for F16/F32, and really only
because they support unaligned sizes.

Tweak the num_iters/unrolling logic to be simpler and catch a couple missed
unrolling opportunities.

Assets 21

19 Nov 23:39

github-actions

b4139

3952a22

b4139

Fix missing file renames in Makefile due to changes in commit ae8de6d…

Assets 21

19 Nov 21:30

github-actions

b4138

42ae10b

b4138

add cmake rvv support (#10411)

Assets 21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ggerganov/llama.cpp

b4153

b4151

b4150

b4149

b4148

b4143

b4142

b4141

b4139

b4138