Releases · ggml-org/llama.cpp

25 Mar 18:35

ef19c71

b4958 Latest

Latest

run: de-duplicate fmt and format functions and optimize (#11596)

Assets 26

cudart-llama-bin-win-cu11.7-x64.zip

303 MB 2025-03-25T18:35:13Z
cudart-llama-bin-win-cu12.4-x64.zip

373 MB 2025-03-25T18:35:21Z
llama-b4958-bin-macos-arm64.zip

24.7 MB 2025-03-25T18:35:32Z
llama-b4958-bin-macos-x64.zip

26.4 MB 2025-03-25T18:35:33Z
llama-b4958-bin-ubuntu-arm64.zip

26.9 MB 2025-03-25T18:35:35Z
llama-b4958-bin-ubuntu-vulkan-x64.zip

33 MB 2025-03-25T18:35:36Z
llama-b4958-bin-ubuntu-x64.zip

28.5 MB 2025-03-25T18:35:38Z
llama-b4958-bin-win-avx-x64.zip

17.2 MB 2025-03-25T18:35:39Z
llama-b4958-bin-win-avx2-x64.zip

17.2 MB 2025-03-25T18:35:40Z
llama-b4958-bin-win-avx512-x64.zip

17.2 MB 2025-03-25T18:35:41Z
Source code (zip)

2025-03-25T17:46:11Z
Source code (tar.gz)

2025-03-25T17:46:11Z

25 Mar 11:57

github-actions

b4957

053b3f9

b4957

ggml-cpu : update KleidiAI to v1.5.0 (#12568)

ggml-cpu : bug fix related to KleidiAI LHS packing

Signed-off-by: Dan Johansson <[email protected]>

Assets 25

25 Mar 11:34

github-actions

b4956

e2f5601

b4956

SYCL: disable Q4_0 reorder optimization (#12560)

ggml-ci

Assets 26

25 Mar 08:01

github-actions

b4953

2d77d88

b4953

context : fix worst-case reserve outputs (#12545)

ggml-ci

Assets 26

24 Mar 17:03

github-actions

b4951

2b65ae3

b4951

opencl: simplify kernel embedding logic in cmakefile (#12503)

Co-authored-by: Max Krasnyansky <[email protected]>

Assets 26

24 Mar 12:15

github-actions

b4948

00d5380

b4948

llama-vocab : add SuperBPE pre-tokenizer (#12532)

Assets 25

24 Mar 11:43

github-actions

b4947

7ea7503

b4947

CUDA: Fix clang warnings (#12540)

Signed-off-by: Xiaodong Ye <[email protected]>

Assets 26

24 Mar 11:37

github-actions

b4946

c54f6b7

b4946

mmap : skip resource limit checks on AIX (#12541)

Assets 26

24 Mar 07:38

github-actions

b4945

9b169a4

b4945

vulkan: fix mul_mat_vec failure in backend tests (#12529)

The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.

Assets 26

23 Mar 19:17

github-actions

b4944

77f9c6b

b4944

server : Add verbose output to OAI compatible chat endpoint. (#12246)

Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.

Assets 26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ggml-org/llama.cpp

b4958

b4957

b4956

b4953

b4951

b4948

b4947

b4946

b4945

b4944