sync : ggml #3329

ggerganov · 2025-07-19T14:49:10Z

No description provided.

* CUDA: add set rows for f32 and f16 * Review: change kernel params, use strides from host * Use 1-d kernel * Review: use int64_t for blockDim.x, rename nb->s for clarity

…661) ggml-ci

* SYCL: Use 1D kernel for set_rows * Remove dangling comment * Refactor and use ceil_div

Remove un-necessary templates from class definition and packing functions Reduce deeply nested conditionals, if-else switching in mnapck function Replace repetitive code with inline functions in Packing functions 2 ~ 7% improvement in Q8 Model 15 ~ 50% improvement in Q4 Model Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* vulkan: fix noncontig check for mat_mul_id splitting Remove supports_op check for > 4096 (splitting fixes this) * vulkan: fix batched matmul dequant for Q*_K

* ggml : add asserts ggml-ci * cont : fix constant type Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

* kv-cache : prepare K/V buffers for separation ggml-ci * batched-bench : fix oob write ggml-ci * llama : add "virtual sequences" ggml-ci * llama : use "stream" vs "virtual sequence" ggml-ci * graph : fix stream splitting when KV cache is not used ggml-ci * kv-cache : add multi-stream save/load support ggml-ci * llama : add "--attn-streams" flag ggml-ci * kv-cache : fix handling when find_slot fails ggml-ci * kv-cache : restore find_slot impl ggml-ci * kv-cache : add comments * kv-cache : add bounds checks for sequence id ggml-ci * cont : add n_seq_max to batch allocr ggml-ci * kv-cache : perform stream copies lazily after llama_synchronize ggml-ci * kv-cache : avoid throwing exceptions across the C boundary ggml-ci * CUDA: 4D FlashAttention support (llama/14628) * CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel * llama : rename attn_streams -> kv_unified ggml-ci * common : rename kv_split -> kv_unified ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults * Initialize webgpu device * Making progress on setting up the backend * Finish more boilerplate/utility functions * Organize file and work on alloc buffer * Add webgpu_context to prepare for actually running some shaders * Work on memset and add shader loading * Work on memset polyfill * Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it * Implement get_tensor and buffer_clear * Finish rest of setup * Start work on compute graph * Basic mat mul working * Work on emscripten build * Basic WebGPU backend instructions * Use EMSCRIPTEN flag * Work on passing ci, implement 4d tensor multiplication * Pass thread safety test * Implement permuting for mul_mat and cpy * minor cleanups * Address feedback * Remove division by type size in cpy op * Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends * Fix name * Fix macos dawn prefix path

…/14732)

* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs Gemma3n uses Matrix-Matrix addition as part of their input processing, wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size of 1 is used. * Exclude `project_per_layer_input` by matching node names This ensures that all other graphs which don't exhibit this pattern do not have their behavior changed. * Revert unnecessary formatting changes

ggml-ci

am17an and others added 20 commits July 19, 2025 17:47

CUDA: add set rows for f32 and f16 (llama/14551)

03e85da

* CUDA: add set rows for f32 and f16 * Review: change kernel params, use strides from host * Use 1-d kernel * Review: use int64_t for blockDim.x, rename nb->s for clarity

metal : Add missing unary ops Metal support (llama/14660)

a6b85bc

ggml : add build-time message to remind about ggml_set_rows (llama/14…

8610c4c

…661) ggml-ci

cuda : add ELU support (llama/14657)

e6509f7

cuda : add set rows for bf16 (llama/14664)

6055fb4

sycl: Batched mulmat rework for oneDNN dispatch (llama/14617)

24643c0

SYCL: use 1D kernel for set_rows (llama/14618)

e5d8efc

* SYCL: Use 1D kernel for set_rows * Remove dangling comment * Refactor and use ceil_div

sycl: Hotfix for non dnnl codepath (llama/14677)

72dae6b

cuda: fix build warnings in set-rows.cu (unused variable) (llama/14687)

591bc24

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

vulkan: add RTE variants for glu/add/sub/mul/div (llama/14653)

b2653a9

vulkan: fix noncontig check for mat_mul_id splitting (llama/14683)

bf13b82

* vulkan: fix noncontig check for mat_mul_id splitting Remove supports_op check for > 4096 (splitting fixes this) * vulkan: fix batched matmul dequant for Q*_K

ggml : add asserts (llama/14720)

26f3de1

* ggml : add asserts ggml-ci * cont : fix constant type Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

use max work group size for device to replace the magic number (llama…

9664629

…/14732)

CUDA: set_rows + cpy.cu refactor (llama/14712)

7df67c2

metal : fuse add, mul + add tests (llama/14596)

f476d9b

ggml-ci

sync : ggml

74b5c27

ggml-ci

ggerganov merged commit c0dc391 into master Jul 19, 2025
63 checks passed

ggerganov deleted the sync-ggml-25-07-19 branch July 19, 2025 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #3329

sync : ggml #3329

Uh oh!

ggerganov commented Jul 19, 2025

Uh oh!

Uh oh!

Uh oh!

sync : ggml #3329

sync : ggml #3329

Uh oh!

Conversation

ggerganov commented Jul 19, 2025

Uh oh!

Uh oh!

Uh oh!