Skip to content

Commit f6d5e1a

Browse files
committed
Merge remote-tracking branch 'origin/master' into GraniteFourPerf
* origin/master: (49 commits) ci : correct label refactor->refactoring (ggml-org#14832) CUDA: fix quantized KV cache + multiple sequences (ggml-org#14822) tests : add non-cont K,V FA tests memory : handle saving/loading null layers in recurrent memory (ggml-org#14675) ggml: fix loongarch quantize_row_q8_1 error (ggml-org#14827) CANN: weight format to NZ for Ascend310P3 (ggml-org#14407) CUDA: add fused rms norm (ggml-org#14800) ggml : model card yaml tab->2xspace (ggml-org#14819) vulkan: fix rms_norm_mul to handle broadcasting dim0 (ggml-org#14817) llama : add model type detection for rwkv7 7B&14B (ggml-org#14816) imatrix: add option to display importance score statistics for a given imatrix file (ggml-org#12718) Mtmd: add a way to select device for vision encoder (ggml-org#14236) cuda : implement bf16 cpy ops and enable bf16 cont (ggml-org#14763) opencl: remove unreachable `return` (ggml-org#14806) server : allow setting `--reverse-prompt` arg (ggml-org#14799) cuda: remove linking to cublasLt (ggml-org#14790) opencl: fix `im2col` when `KW!=KH` (ggml-org#14803) opencl: add conv2d kernel (ggml-org#14403) sycl: Fix im2col (ggml-org#14797) kleidiai: add support for get_rows (ggml-org#14676) ...
2 parents e55176a + 221c0e0 commit f6d5e1a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+4958
-1195
lines changed

.clang-format

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@ AllowShortIfStatementsOnASingleLine: Never
2222
AllowShortLambdasOnASingleLine: Inline
2323
AllowShortLoopsOnASingleLine: false
2424
AlwaysBreakBeforeMultilineStrings: true
25-
BinPackArguments: true
26-
BinPackParameters: true # OnePerLine
25+
BinPackArguments: false
26+
BinPackParameters: false # OnePerLine
2727
BitFieldColonSpacing: Both
2828
BreakBeforeBraces: Custom # Attach
2929
BraceWrapping:
@@ -70,15 +70,18 @@ ExperimentalAutoDetectBinPacking: false
7070
FixNamespaceComments: true
7171
IncludeBlocks: Regroup
7272
IncludeCategories:
73-
- Regex: '^<.*\.h>'
73+
- Regex: '".*"'
7474
Priority: 1
7575
SortPriority: 0
76-
- Regex: '^<.*'
76+
- Regex: '^<.*\.h>'
7777
Priority: 2
7878
SortPriority: 0
79-
- Regex: '.*'
79+
- Regex: '^<.*'
8080
Priority: 3
8181
SortPriority: 0
82+
- Regex: '.*'
83+
Priority: 4
84+
SortPriority: 0
8285
IncludeIsMainRegex: '([-_](test|unittest))?$'
8386
IncludeIsMainSourceRegex: ''
8487
IndentAccessModifiers: false

.devops/nix/package.nix

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ let
4747
inherit (lib)
4848
cmakeBool
4949
cmakeFeature
50+
optionalAttrs
5051
optionals
5152
strings
5253
;
@@ -197,7 +198,7 @@ effectiveStdenv.mkDerivation (finalAttrs: {
197198
];
198199

199200
# Environment variables needed for ROCm
200-
env = optionals useRocm {
201+
env = optionalAttrs useRocm {
201202
ROCM_PATH = "${rocmPackages.clr}";
202203
HIP_DEVICE_LIB_PATH = "${rocmPackages.rocm-device-libs}/amdgcn/bitcode";
203204
};

.github/workflows/close-issue.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ jobs:
1717
steps:
1818
- uses: actions/stale@v5
1919
with:
20-
exempt-issue-labels: "refactor,help wanted,good first issue,research,bug,roadmap"
20+
exempt-issue-labels: "refactoring,help wanted,good first issue,research,bug,roadmap"
2121
days-before-issue-stale: 30
2222
days-before-issue-close: 14
2323
stale-issue-label: "stale"

CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@
99
/ggml/src/ggml-cuda/mmvq.* @JohannesGaessler
1010
/ggml/src/ggml-opt.cpp @JohannesGaessler
1111
/ggml/src/gguf.cpp @JohannesGaessler
12+
/ggml/src/ggml-vulkan/ @0cc4m

README.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -270,7 +270,6 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
270270
| [CANN](docs/build.md#cann) | Ascend NPU |
271271
| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |
272272
| [WebGPU [In Progress]](docs/build.md#webgpu) | All |
273-
274273
| [RPC](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) | All |
275274

276275
## Obtaining and quantizing models
@@ -436,7 +435,7 @@ To learn more about model quantization, [read this documentation](tools/quantize
436435

437436
## [`llama-perplexity`](tools/perplexity)
438437

439-
#### A tool for measuring the perplexity [^1][^2] (and other quality metrics) of a model over a given text.
438+
#### A tool for measuring the [perplexity](tools/perplexity/README.md) [^1] (and other quality metrics) of a model over a given text.
440439

441440
- <details open>
442441
<summary>Measure the perplexity over a text file</summary>
@@ -459,8 +458,7 @@ To learn more about model quantization, [read this documentation](tools/quantize
459458

460459
</details>
461460

462-
[^1]: [tools/perplexity/README.md](./tools/perplexity/README.md)
463-
[^2]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)
461+
[^1]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)
464462

465463
## [`llama-bench`](tools/llama-bench)
466464

common/arg.cpp

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1612,7 +1612,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
16121612
[](common_params & params, const std::string & value) {
16131613
params.antiprompt.emplace_back(value);
16141614
}
1615-
).set_examples({LLAMA_EXAMPLE_MAIN}));
1615+
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER}));
16161616
add_opt(common_arg(
16171617
{"-sp", "--special"},
16181618
string_format("special tokens output enabled (default: %s)", params.special ? "true" : "false"),
@@ -2655,6 +2655,13 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
26552655
params.i_chunk = value;
26562656
}
26572657
).set_examples({LLAMA_EXAMPLE_IMATRIX}));
2658+
add_opt(common_arg(
2659+
{"--show-statistics"},
2660+
string_format("show imatrix statistics and then exit (default: %s)", params.show_statistics ? "true" : "false"),
2661+
[](common_params & params) {
2662+
params.show_statistics = true;
2663+
}
2664+
).set_examples({LLAMA_EXAMPLE_IMATRIX}));
26582665
add_opt(common_arg(
26592666
{"--parse-special"},
26602667
string_format("prase special tokens (chat, tool, etc) (default: %s)", params.parse_special ? "true" : "false"),

common/common.cpp

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -448,6 +448,15 @@ void string_replace_all(std::string & s, const std::string & search, const std::
448448
bool string_ends_with(const std::string_view & str, const std::string_view & suffix) {
449449
return str.size() >= suffix.size() && str.compare(str.size()-suffix.size(), suffix.size(), suffix) == 0;
450450
}
451+
452+
bool string_remove_suffix(std::string & str, const std::string_view & suffix) {
453+
bool has_suffix = string_ends_with(str, suffix);
454+
if (has_suffix) {
455+
str = str.substr(0, str.size() - suffix.size());
456+
}
457+
return has_suffix;
458+
}
459+
451460
size_t string_find_partial_stop(const std::string_view & str, const std::string_view & stop) {
452461
if (!str.empty() && !stop.empty()) {
453462
const char text_last_char = str.back();

common/common.h

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -432,9 +432,10 @@ struct common_params {
432432
int32_t n_save_freq = 0; // save the imatrix every n_save_freq iterations
433433
int32_t i_chunk = 0; // start processing from this chunk
434434

435-
bool process_output = false; // collect data for the output tensor
436-
bool compute_ppl = true; // whether to compute perplexity
437-
bool parse_special = false; // whether to parse special tokens during imatrix tokenization
435+
bool process_output = false; // collect data for the output tensor
436+
bool compute_ppl = true; // whether to compute perplexity
437+
bool show_statistics = false; // show imatrix statistics per tensor
438+
bool parse_special = false; // whether to parse special tokens during imatrix tokenization
438439

439440
// cvector-generator params
440441
int n_pca_batch = 100;
@@ -534,6 +535,7 @@ static bool string_starts_with(const std::string & str,
534535

535536
// While we wait for C++20's std::string::ends_with...
536537
bool string_ends_with(const std::string_view & str, const std::string_view & suffix);
538+
bool string_remove_suffix(std::string & str, const std::string_view & suffix);
537539
size_t string_find_partial_stop(const std::string_view & str, const std::string_view & stop);
538540

539541
bool string_parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides);

convert_hf_to_gguf.py

Lines changed: 161 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -843,6 +843,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
843843
if chkhsh == "169bf0296a13c4d9b7672313f749eb36501d931022de052aad6e36f2bf34dd51":
844844
# ref: https://huggingface.co/LiquidAI/LFM2-Tokenizer
845845
res = "lfm2"
846+
if chkhsh == "2085e1638f6c377a0aa4ead21b27bb4cb941bf800df86ed391011769c1758dfb":
847+
# ref: https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B
848+
res = "exaone4"
846849

847850
if res is None:
848851
logger.warning("\n")
@@ -2861,7 +2864,8 @@ def set_gguf_parameters(self):
28612864
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
28622865
num_heads = self.hparams["num_attention_heads"]
28632866
num_kv_heads = self.hparams["num_key_value_heads"]
2864-
head_dim = self.hparams["head_dim"]
2867+
if (head_dim := self.hparams.get("head_dim")) is None:
2868+
head_dim = self.hparams["hidden_size"] // num_heads
28652869

28662870
if "ernie." in name:
28672871
name = name.replace("ernie.", "model.")
@@ -2894,6 +2898,93 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter
28942898
return [(self.map_tensor_name(name), data_torch)]
28952899

28962900

2901+
@ModelBase.register("Ernie4_5_MoeForCausalLM")
2902+
class Ernie4_5MoeModel(Ernie4_5Model):
2903+
model_arch = gguf.MODEL_ARCH.ERNIE4_5_MOE
2904+
_experts: list[dict[str, Tensor]] | None = None
2905+
2906+
def __init__(self, *args, **kwargs):
2907+
super().__init__(*args, **kwargs)
2908+
self._experts = [{} for _ in range(self.block_count)]
2909+
2910+
def set_gguf_parameters(self):
2911+
super().set_gguf_parameters()
2912+
self.gguf_writer.add_expert_count(self.hparams["moe_num_experts"])
2913+
self.gguf_writer.add_expert_used_count(self.hparams["moe_k"])
2914+
self.gguf_writer.add_interleave_moe_layer_step(self.hparams["moe_layer_interval"])
2915+
self.gguf_writer.add_leading_dense_block_count(self.hparams["moe_layer_start_index"])
2916+
if (moe_intermediate_size := self.hparams.get("moe_intermediate_size")) is not None:
2917+
self.gguf_writer.add_expert_feed_forward_length(moe_intermediate_size)
2918+
if (shared_expert_count := self.hparams.get('moe_num_shared_experts')) is not None:
2919+
self.gguf_writer.add_expert_shared_count(shared_expert_count)
2920+
if shared_expert_count > 0 and (shared_expert_intermediate_size := self.hparams.get('intermediate_size')) is not None and (num_key_value_heads := self.hparams.get('num_key_value_heads')) is not None:
2921+
self.gguf_writer.add_expert_shared_feed_forward_length(shared_expert_intermediate_size // num_key_value_heads)
2922+
2923+
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
2924+
# Modify correction bias name as in DeepseekV2
2925+
if name.endswith("e_score_correction_bias"):
2926+
name = name.replace("e_score_correction_bias", "e_score_correction.bias")
2927+
2928+
# skip Multi-Token Prediction (MTP) layers (again, same as DeepseekV2)
2929+
match = re.match(r"model.mtp_block.(\d+)", name)
2930+
if match:
2931+
return []
2932+
2933+
# skip all other MTP tensors for now
2934+
match = re.match(r"model.mtp_emb_norm.(\d+)", name)
2935+
if match:
2936+
return []
2937+
2938+
match = re.match(r"model.mtp_hidden_norm.(\d+)", name)
2939+
if match:
2940+
return []
2941+
2942+
match = re.match(r"model.mtp_linear_proj.(\d+)", name)
2943+
if match:
2944+
return []
2945+
2946+
# process the experts separately
2947+
if name.find("mlp.experts") != -1:
2948+
n_experts = self.hparams["moe_num_experts"]
2949+
assert bid is not None
2950+
2951+
if self._experts is None:
2952+
self._experts = [{} for _ in range(self.block_count)]
2953+
2954+
self._experts[bid][name] = data_torch
2955+
2956+
if len(self._experts[bid]) >= n_experts * 3:
2957+
tensors: list[tuple[str, Tensor]] = []
2958+
2959+
# merge the experts into a single 3d tensor
2960+
for w_name in ["gate_proj", "up_proj", "down_proj"]:
2961+
datas: list[Tensor] = []
2962+
2963+
for xid in range(n_experts):
2964+
ename_to_retrieve = f"model.layers.{bid}.mlp.experts.{xid}.{w_name}.weight"
2965+
datas.append(self._experts[bid][ename_to_retrieve])
2966+
del self._experts[bid][ename_to_retrieve]
2967+
2968+
data_torch = torch.stack(datas, dim=0)
2969+
merged_name = f"model.layers.{bid}.mlp.experts.{w_name}.weight"
2970+
new_name = self.map_tensor_name(merged_name)
2971+
tensors.append((new_name, data_torch))
2972+
2973+
return tensors
2974+
else:
2975+
return []
2976+
return [(self.map_tensor_name(name), data_torch)]
2977+
2978+
def prepare_tensors(self):
2979+
super().prepare_tensors()
2980+
2981+
if self._experts is not None:
2982+
# flatten `list[dict[str, Tensor]]` into `list[str]`
2983+
experts = [k for d in self._experts for k in d.keys()]
2984+
if len(experts) > 0:
2985+
raise ValueError(f"Unprocessed experts: {experts}")
2986+
2987+
28972988
@ModelBase.register(
28982989
"Qwen2VLModel",
28992990
"Qwen2VLForConditionalGeneration",
@@ -6692,6 +6783,75 @@ def generate_extra_tensors(self) -> Iterable[tuple[str, Tensor]]:
66926783
yield (self.format_tensor_name(gguf.MODEL_TENSOR.ROPE_FREQS), torch.tensor(rope_factors, dtype=torch.float32))
66936784

66946785

6786+
@ModelBase.register("Exaone4ForCausalLM")
6787+
class Exaone4Model(TextModel):
6788+
model_arch = gguf.MODEL_ARCH.EXAONE4
6789+
6790+
def set_vocab(self):
6791+
tokens, toktypes, tokpre = self.get_vocab_base()
6792+
self.gguf_writer.add_tokenizer_model("gpt2")
6793+
self.gguf_writer.add_tokenizer_pre(tokpre)
6794+
self.gguf_writer.add_token_list(tokens)
6795+
self.gguf_writer.add_token_types(toktypes)
6796+
6797+
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
6798+
special_vocab.add_to_gguf(self.gguf_writer)
6799+
6800+
def set_gguf_parameters(self):
6801+
super().set_gguf_parameters()
6802+
hparams = self.hparams
6803+
self.gguf_writer.add_vocab_size(hparams["vocab_size"])
6804+
6805+
if hparams.get("sliding_window") is not None:
6806+
self.gguf_writer.add_sliding_window(hparams["sliding_window"])
6807+
if "layer_types" in hparams:
6808+
self.gguf_writer.add_sliding_window_pattern([t == "sliding_attention" for t in hparams["layer_types"]])
6809+
elif "sliding_window_pattern" in hparams:
6810+
sliding_window_pattern = []
6811+
if isinstance(hparams["sliding_window_pattern"], str): # e.g. LLLG
6812+
for i in range(hparams["num_hidden_layers"]):
6813+
sliding_window_pattern.append(hparams["sliding_window_pattern"][i % len(hparams["sliding_window_pattern"])] == "L")
6814+
if isinstance(hparams["sliding_window_pattern"], int): # e.g. 4
6815+
for i in range(hparams["num_hidden_layers"]):
6816+
sliding_window_pattern.append((i + 1) % hparams["sliding_window_pattern"] != 0)
6817+
if len(sliding_window_pattern) == hparams["num_hidden_layers"]:
6818+
self.gguf_writer.add_sliding_window_pattern(sliding_window_pattern)
6819+
6820+
rope_scaling = self.hparams.get("rope_scaling") or {}
6821+
if rope_scaling.get("rope_type", rope_scaling.get("type")) == "linear" and "factor" in rope_scaling:
6822+
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.LINEAR)
6823+
self.gguf_writer.add_rope_scaling_factor(rope_scaling["factor"])
6824+
6825+
def generate_extra_tensors(self) -> Iterable[tuple[str, Tensor]]:
6826+
if rope_scaling := self.find_hparam(["rope_scaling"], optional=True):
6827+
if rope_scaling.get("rope_type", '').lower() == "llama3":
6828+
base = self.hparams.get("rope_theta", 10_000.0)
6829+
if (dim := self.hparams.get("head_dim")) is None:
6830+
dim = self.hparams["hidden_size"] // self.hparams["num_attention_heads"]
6831+
freqs = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
6832+
6833+
factor = rope_scaling.get("factor", 16.0)
6834+
low_freq_factor = rope_scaling.get("low_freq_factor", 1.0)
6835+
high_freq_factor = rope_scaling.get("high_freq_factor", 4.0)
6836+
old_context_len = self.hparams.get("original_max_position_embeddings", 8192)
6837+
6838+
low_freq_wavelen = old_context_len / low_freq_factor
6839+
high_freq_wavelen = old_context_len / high_freq_factor
6840+
6841+
rope_factors = []
6842+
for freq in freqs:
6843+
wavelen = 2 * math.pi / freq
6844+
if wavelen < high_freq_wavelen:
6845+
rope_factors.append(1)
6846+
elif wavelen > low_freq_wavelen:
6847+
rope_factors.append(factor)
6848+
else:
6849+
smooth = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
6850+
rope_factors.append(1 / ((1 - smooth) / factor + smooth))
6851+
6852+
yield (self.format_tensor_name(gguf.MODEL_TENSOR.ROPE_FREQS), torch.tensor(rope_factors, dtype=torch.float32))
6853+
6854+
66956855
@ModelBase.register("GraniteForCausalLM")
66966856
class GraniteModel(LlamaModel):
66976857
"""Conversion for IBM's GraniteForCausalLM"""

convert_hf_to_gguf_update.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,7 @@ class TOKENIZER_TYPE(IntEnum):
129129
{"name": "a.x-4.0", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/skt/A.X-4.0", },
130130
{"name": "midm-2.0", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/K-intelligence/Midm-2.0-Base-Instruct", },
131131
{"name": "lfm2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LiquidAI/LFM2-Tokenizer"},
132+
{"name": "exaone4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B", },
132133
]
133134

134135
# some models are known to be broken upstream, so we will skip them as exceptions

0 commit comments

Comments
 (0)