-
Notifications
You must be signed in to change notification settings - Fork 179
add DeepseekV3 AWQ mapping #1619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @cjackal, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces support for quantizing DeepseekV3 models using the AWQ (Activation-aware Weight Quantization) method. It specifically adds the necessary activation-smooth mappings required for DeepseekV3ForCausalLM
within the llmcompressor
framework, enabling more efficient deployment of these models.
Highlights
- DeepseekV3 AWQ Support: I've added specific
AWQMapping
configurations forDeepseekV3ForCausalLM
to enable activation-smooth quantization. These mappings define the relationships between various normalization and projection layers (e.g.,input_layernorm
toq_a_proj
andkv_a_proj_with_mqa
,up_proj
todown_proj
) crucial for the AWQ algorithm. - Registry Update: I've registered the new DeepseekV3 mappings in the
AWQ_MAPPING_REGISTRY
to ensureDeepseekV3ForCausalLM
models can correctly utilize the AWQ modifier within thellmcompressor
framework.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Activation-aware Weight Quantization (AWQ) for the DeepseekV3 model architecture. This is achieved by defining a new set of layer mappings specific to DeepseekV3 and registering them. The changes are clear and follow the existing structure for defining architecture-specific mappings. I have one suggestion to improve code maintainability.
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
It correctly counts the number of calibrations (14971, the same number as
|
It seems like GPTQ does not exploit |
Thank you @cjackal for the contribution! Please let us know how it goes with torch.compile off, I am not sure why you are hitting device on meta errors in this case. |
Any updates on this? With the new Kimi K2 release, there is a lot of renewed interest in quantizing the DeepSeek V3 architecture. |
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Hi @casper-hansen , we were talking about Kimi K2 internally today, if it's feasible to run AWQ or our other compression algorithms on a single GPU, layer-by-layer. These mappings looks correct, though I haven't had a chance to validate on deepseek or k2. But it uses the same I am trying to validate @cjackal 's example script to try to get this PR in soon. |
@brian-dellabetta DeepSeek V3 and R1 was quantized successfully in AutoAWQ. You do need a machine with a lot of system RAM but it works just like any other model. We also had to convert to bfloat16 before quantizing. |
Sorry for late response; it wasn't the I am debugging it but I'd first like to share my finding here so that the repo maintainers have a chance to give a quick guidance for the fix. The error is hit the first time when the AWQ weight mapping having |
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Hi @cjackal , @casper-hansen I had to push some updates to cjackal's branch to prevent GPU OOM errors, but it is working through the layers with experts now:
This is hitting about ~60GB peak GPU RAM (128 samples with 512 max sequence length). Extrapolating, this would take 8-10 hours on an H100, but I'm running on a noisy server with lots of other processes running. @casper-hansen do you recall what memory/time requirements AutoAWQ had for DeepseekV3 with 128 samples and 512 max sequence length? @cjackal I made a few modifications to your script, along with merging into this branch some new changes we've recently pushed to main, it resolves the error you were hitting. My script is attached below. Feel free to try it out, maybe I need to switch gears for the next few days, but will keep an eye on this thread. I can revisit towards the end of the week. import torch
from datasets import load_dataset
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modeling import prepare_for_calibration
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor import oneshot
# Select model and load it.
# This script takes about 48 hours on 1xA100 to complete.
# Future improvements will reduce this runtime (#1561, #1558).
# For DeepSeek-R1, we require a full precision model in order to properly calibrate
# `DeepSeek-R1-0528-BF16` is a DeepSeek-V3 FP8 model which has been converted to BF16
model_id = "unsloth/DeepSeek-R1-0528-BF16"
config = AutoConfig.from_pretrained(model_id)
if hasattr(config, "quantization_config"):
del config.quantization_config # fp8 qconfig no longer appplies to bf16 model
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", config=config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = prepare_for_calibration(model)
# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 128
MAX_SEQUENCE_LENGTH = 512
# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)
def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
)
}
ds = ds.map(preprocess)
# Tokenize inputs.
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds = ds.map(tokenize, remove_columns=ds.column_names)
# Configure the quantization algorithm to run.
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
# list so they remain at full precision
recipe = AWQModifier(
targets="Linear",
scheme="W4A16",
ignore=["lm_head", "re:.*mlp.gate$"],
offload_device=torch.device("cpu"),
)
# Apply algorithms.
# model is loaded sequentially, automatically onto GPU
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR) |
@brian-dellabetta Let me test again with the head commit and remove the draft tag from this PR if successful. |
@brian-dellabetta about 24 hours to quantize the full model. First few layers go fast. I don’t have memory stats. |
Thanks @cjackal and @casper-hansen for the information! Feeling good that our implementation seems to be working with such a large model. @cjackal please let me know how it goes, we are looking into k2 as well |
b737681
to
5288bec
Compare
….models` Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
So sequential onloading was the culprit in my test script, I shouldn't have copy-paste the GPTQ script blindfolded. @brian-dellabetta With your test script I got past the 9th layer (I also run on H100 but mine is fully isolated) in about an hour, which looks rosy. Let me turn this PR into ready and wait to complete the job. BTW current main is incompatible with |
Thanks @cjackal , glad to hear it's working rosy now. Thanks for the heads-up on the transformers issue, 4.52.0 broke some other things for us, recommend to always be on latest version even though our pins are very loose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit, otherwise this looks good. I can approve ci/cd to run
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
…nsformers versions Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution and tests! I will kick off CI/CD and hopefully get this in soon
from transformers.models.llama4.configuration_llama4 import ( | ||
Llama4Config, | ||
Llama4TextConfig, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: these changes allow successful run with older versions of transformers
allowed by our transformers>4
pin
SUMMARY:
Add AWQ activation-smooth mapping for
DeepseekV3ForCausalLM
.TEST PLAN:
examples/quantizing_moe/deepseek_r1_example.py but recipe adapted to use
AWQModifier
instead: