sync : ggml #3319

ggerganov · 2025-07-12T13:27:23Z

No description provided.

* ggml : add version function to get lib version This commit adds a function `ggml_version()` to the ggml library that returns the version of the library as a string. The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used. Usage: ```c printf("GGML version: %s\n", ggml_version()); ``` Output: ```console GGML version: 0.0.2219 ``` * ggml : add ggml_commit() --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* vulkan : implement ggml_roll * vulkan : refactor vk_op_unary_push_constants initialization

…polate (ggml/1291) * supports GGML_SCALE_MODE_BILINEAR and GGML_SCALE_FLAG_ALIGN_CORNERS

* [CANN]update to aclnnGroupedMatmulV2 Signed-off-by: noemotiovon <757486878@qq.com> * Support MUL_MAT_ID on 310p Signed-off-by: noemotiovon <757486878@qq.com> * fix editorconfig Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>

* ci : disable fast-math for Metal GHA CI ggml-ci * cont : remove -g flag ggml-ci

* Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed. * Return previous callback to allow callback chaining * style fixes --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

…435)

* CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output

* llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size

…a/14497)

* ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1

* kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci

…4504) Signed-off-by: nscipione <nicolo.scipione@codeplay.com>

* vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes

…4002) Co-authored-by: luyuhong <luyuhong@kylinos.cn>

ggml-ci

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.

Commit taken from remyoudompheng's PR ggml-org/llama.cpp#12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci

* vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

…a/14595)

* ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32

* SYCL: Initial set_rows kernel implementation * Revert max_threads to 256 * Refactor set_rows and address review comments * Deduplicate conversion function * Remove guard before kernel launch and refactor * Fix and add back SFINAE

* opencl: add `set_rows` for `f16` and `f32` * opencl: better choose workgroup size for `set_rows`

* add tiled mul_mat_f16_f32 * fix trailing whitespace * add insightful comments

**Important** LFM2 was [merged ](huggingface/transformers#39340 transformers, but has not yet been released. To convert into gguf, install transformers from source ```shell pip install "transformers @ git+https://github.com/huggingface/transformers.git@main" ```

* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader * vulkan: increase coopmat2 mul_mat_id tile size * vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path * vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)

* vulkan: support SET_ROWS Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now. * vulkan: optimize set_rows Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.

ggml-ci

danbev and others added 30 commits July 12, 2025 16:22

vulkan : implement ggml_roll (ggml/1290)

9c20ec6

* vulkan : implement ggml_roll * vulkan : refactor vk_op_unary_push_constants initialization

vulkan : implement bilinear interpolation for ggml_upscale/ggml_inter…

ab813e8

…polate (ggml/1291) * supports GGML_SCALE_MODE_BILINEAR and GGML_SCALE_FLAG_ALIGN_CORNERS

add GELU_ERF (llama/14455)

266de70

vulkan: Split large mul_mat_id to fit in shared memory (llama/14451)

c1f229a

ci : disable fast-math for Metal GHA CI (llama/14478)

4234492

* ci : disable fast-math for Metal GHA CI ggml-ci * cont : remove -g flag ggml-ci

opencl : update upscale to support align corners (llama/14488)

7645aee

opencl : skip empty nodes on cgraph compute (llama/14491)

801367c

opencl : fix possible buffer overflow in dump_tensor (llama/14490)

50785ef

ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (llama/14…

66e89a0

…435)

vulkan: support softmax/FA batch and broadcast (llama/14449)

5c6866f

CUDA: broadcasting for FlashAttention mask (llama/14500)

73ea436

CUDA: add softmax broadcast (llama/14475)

7366a28

* CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output

CUDA: add dynamic shared mem to softmax, refactor general usage (llam…

580e2ed

…a/14497)

ggml : fix FA mask dim 2 and 3 (llama/14505)

ca2b806

* ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1

Fix conditional enabling following arch checks for ggml-sycl (llama/1…

417be2a

…4504) Signed-off-by: nscipione <nicolo.scipione@codeplay.com>

ggml: backward pass for split swiglu (llama/14483)

dfa89e7

vulkan: support mixed/deepseekR1 FA head sizes (llama/14509)

e9acb01

* vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes

opencl : broadcast for soft_max (llama/14510)

dea036f

ggml : implement GEGLU_ERF and GEGLU_QUICK ops (llama/14445)

8fde826

CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (llama/1…

fda1c64

…4002) Co-authored-by: luyuhong <luyuhong@kylinos.cn>

metal : disable fast math in all quantize kernels (llama/14528)

c1ee7c4

ggml-ci

opencl: add GELU_ERF (llama/14476)

5b2b458

vulkan: Handle updated FA dim2/3 definition (llama/14518)

3c5e104

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

vulkan: fix rms_norm+mul fusion (llama/14545)

0a8d97e

The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.

vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (llama/14485)

71bcb61

Commit taken from remyoudompheng's PR ggml-org/llama.cpp#12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>

am17an and others added 20 commits July 12, 2025 16:22

CUDA: add bf16 and i32 to getrows (llama/14529)

602d77e

musa: fix build warnings (unused variable) (llama/14561)

f1bab24

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

CUDA: add bilinear interpolation for upscale (llama/14563)

2324247

cuda : fix rope with partial rotation and non-cont src (llama/14580)

5f4d05b

* cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci

vulkan : fix rope with partial rotation and non-cont src (llama/14582)

d5684df

ggml : prevent integer overflow in gguf tensor size calculation (llam…

6cfb733

…a/14595)

cuda : support Falcon-H1 state size for SSM_SCAN (llama/14602)

c161799

opencl: add set_rows for f16 and f32 (llama/14547)

f5449b1

* opencl: add `set_rows` for `f16` and `f32` * opencl: better choose workgroup size for `set_rows`

opencl: add tiled mul_mat_f16_f32 (llama/14535)

445bd88

* add tiled mul_mat_f16_f32 * fix trailing whitespace * add insightful comments

HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (llama/14634)

56b0b32

sync : resolve conflicts (ggml/0)

d79d3c2

ggml-ci

sync : ggml

4bd83ef

talk-llama : sync llama.cpp

ee9f540

ggml-ci

sync : resolve conflicts (#0)

98509f6

ggml-ci

ggerganov merged commit 3775c50 into master Jul 12, 2025
63 checks passed

ggerganov deleted the sync-ggml-25-07-12 branch July 12, 2025 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #3319

sync : ggml #3319

Uh oh!

ggerganov commented Jul 12, 2025

Uh oh!

Uh oh!

Uh oh!

sync : ggml #3319

sync : ggml #3319

Uh oh!

Conversation

ggerganov commented Jul 12, 2025

Uh oh!

Uh oh!

Uh oh!