Tags

ggml-org / llama.cpp UNCLAIMED

LLM inference in C/C++

0 0 0 C++

20 tags

b8611

ggml : fix RWKV ops thread assignment (#21226)

d43375f

zip tar.gz

b8610

ggml-cpu: fix fallback for RVV kernels without zvfh (#21157) * ggml-cpu: refactor sgemm; fix rvv checks * ggml-cpu: refactor rvv kernels; set zvfbfwma default to off

2b86e5c

zip tar.gz

b8609

CUDA: Add Flash Attention Support for Head Dimension 512 (#20998) * flash attention support for head dimension 512 added * FA D=512 - match 576 configs, limit ncols2, revert vec cap * fix HIP tile kernel build for D=512 * fix HIP tile kernel occupancy for D=512 on AMD * Apply suggestions from code review Co-authored-by: Johannes Gäßler <[email protected]> * fix tile FA compilation --------- Co-authored-by: Johannes Gäßler <[email protected]>

8845816

zip tar.gz

b8608

llama : refactor llama_model_quantize_params to expose a pure C interface (#20346) * Refactor llama_model_quantize_params to expose a pure C interface * Restore comment and cleanup struct def * Code review refactoring Co-authored-by: Georgi Gerganov <[email protected]> * Code review refactoring --------- Co-authored-by: Georgi Gerganov <[email protected]>

4951250

zip tar.gz

b8607

ggml webgpu: quantized buffers to u32 + wider browser/device support (#21046) * Work towards removing bitcast * Move rest of existing types over * Add timeout back to wait and remove synchronous set_tensor/memset_tensor * move to unpackf16 for wider compatibility * cleanup * Remove deadlock condition in free_bufs

82764c3

zip tar.gz

b8606

ggml-webgpu: port all AOT operators to JIT (#20728) * port cpy pipeline to shader lib with JIT compilation * port glu pipeline to shader lib with JIT compilation * port rope pipeline to shader lib with JIT compilation * port soft_max pipeline to shader lib with JIT compilation * removed unused functions from embed_wgsl.py which were used for old AOT template expansion

825eb91

zip tar.gz

b8605

fix: Use lower-case proxy headers naming (#21235)

0fcb376

zip tar.gz

b8604

common : cleanup logs and modernize the progress bar (#21215) ``` $ build/bin/llama-server -hf unsloth/Qwen3.5-0.8B-GGUF common_download_file_single_online: HEAD failed, status: 404 no remote preset found, skipping Downloading mmproj-BF16.gguf ——————————————————————————————————————— 100% Downloading Qwen3.5-0.8B-Q4_K_M.gguf ——————————————————————————————— 100% ... ``` Signed-off-by: Adrien Gallouët <[email protected]>

6307ec0

zip tar.gz

b8603

CANN: fix multi-thread set_tensor race conditions (#20151) * CANN: fix multi-thread set_tensor race conditions When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues: 1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data. 2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data. 3. The global g_nz_workspaces array had unprotected concurrent access. Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes. Add per-device mutex to g_nz_workspaces to prevent data races. * CANN: fix L2_NORM ignoring eps parameter The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing. * ggml/cann: compare op_params for POOL_2D in ACL graph cache matching When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors. Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties(). * cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM. Changes: - Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements) - Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D

632219a

zip tar.gz

b8602

server: (webui) no more gzip compression (#21073) * webui: no more gzip * try changing a small line * Revert "try changing a small line" This reverts commit 0d7a3531593d87b724d404c8727a96becab3ab07. * fix lint * fix test * rebuild * split into html/css/js * lint * chore: update webui build output * chore: Update git hooks script * server: update webui build output * chore: Update pre-commit hook * refactor: Cleanup --------- Co-authored-by: Aleksander Grygier <[email protected]>

4a00bbf

zip tar.gz

b8601

common : gpt-oss handle builtin and unsolicited tool calls (#21213)

624733d

zip tar.gz

b8600

fix: correct misspellings in code comments (#21217) - emdeddings → embeddings (gemma3.cpp, gemma3n-iswa.cpp, gemma-embedding.cpp) - imlpemented → implemented (llama-adapter.cpp) - interere → interfere (llama-graph.cpp) - overridde → overridden (chat.cpp) - stastistics → statistics (ngram-map.h) - layed → laid (llama-kv-cache.h) - worster → worst (llama-context.cpp) - sequantial → sequential (llama-batch.h)

0b6ff47

zip tar.gz

b8599

CI: Enable CPU and Vulkan ARM64 Release (#21207)

eec6f85

zip tar.gz

b8598

sync : ggml

9281dd1

zip tar.gz

b8595

sycl : enhance fattn perf (#21185)

62278ce

zip tar.gz

b8591

vendor : update BoringSSL to 0.20260327.0 (#21211) Signed-off-by: Adrien Gallouët <[email protected]>

26dac84

zip tar.gz

b8590

common : Disable backend sampling if reasoning budget is enabled (#21209)

5ce013c

zip tar.gz

b8589

opencl: add q4_K gemm and gemv kernels for Adreno (#20919) * opencl: add q4_K gemm and gemv kernels for Adreno * opencl: fix whitespace * opencl: add workarounds for compiler bugs on older devices * opencl: handle fp16 denorm on X Elite * opencl: fix kernel build error * opencl: fix whitespace * opencl: make q4_K cvt kernels signature consistent --------- Co-authored-by: Li He <[email protected]>

08f2145

zip tar.gz

b8587

jinja : handle empty expressions correctly (#20913) * Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments(). * Treat empty computed member expressions with Jinja2 undefined semantics Treat empty computed member expressions like `a[]` as undefined instead of raising a parser error, to match Jinja2 behavior. - return a noop expression for empty computed member arguments - return undefined when a computed member key evaluates to undefined - add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined` * Handle undefined computed member properties Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`. * Use default undefined value in member access Initialize val and then return it when property is undefined. Co-authored-by: Sigbjørn Skjæret <[email protected]> * empty statement parses to blank_expression instead of noop_statement --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

ead417f

zip tar.gz

b8586

CUDA : Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 (#21181) * CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`, while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we had uninitialized values in `offset_iterator[nrows]` for the case when `nrows % block_size == 0`. Fixes #21162 * Reduce nrows in test case to 256, don't need 768

64ac9ab

zip tar.gz