This repository was archived by the owner on Jun 24, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 376
This repository was archived by the owner on Jun 24, 2024. It is now read-only.
Behavior when missing quantization version #447
Copy link
Copy link
Open
Description
The problem happened below. Turns out it didn't include the "general.quantization_version" metadata. In the case that llama.cpp reads a file without a version, it assumes 2 (grep for the line gguf_set_val_u32(ctx_out, "general.quantization_version", GGML_QNT_VERSION);
), so this model works with llama.cpp but fails with rusformers/llm.
model_name = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(local_dir)
torch.save(model.state_dict(), os.path.join(local_dir, "pytorch_model.bin"))
python llm/crates/ggml/sys/llama-cpp/convert.py models/ --vocab-dir models/ --ctx 4096 --outtype q8_0
let model = llm::load(
path,
llm::TokenizerSource::Embedded,
parameters,
llm::load_progress_callback_stdout,
)
.unwrap_or_else(|err| panic!("Failed to load model: {err}"));
thread '<unnamed>' panicked at llm/inference/src/llms/local/llama2.rs:45:35:
Failed to load model: quantization version was missing, despite model containing quantized tensors
My solution was to just get rid of this whole block
let any_quantized = gguf
.tensor_infos
.values()
.any(|t| t.element_type.is_quantized());
// if any_quantized {
// match quantization_version {
// Some(MetadataValue::UInt32(2)) => {
// // Currently supported version
// }
// Some(quantization_version) => {
// return Err(LoadError::UnsupportedQuantizationVersion {
// quantization_version: quantization_version.clone(),
// })
// }
// None => return Err(LoadError::MissingQuantizationVersion),
// }
// }
Unsure how you want to handle this since it does remove a check.
Metadata
Metadata
Assignees
Labels
No labels