[Performance] Add memory compression and decompression pathways #301

kylesayrs · 2025-04-16T21:35:18Z

Purpose

Reduce memory requirements when compressing a model

Memory Visualization

Compression Memory Improvement

Format	State dict compression	Model compression
Quantized
Sparse
Sparse24 Stacked

Model Compression and Decompression

Format	Model Compression + Decompression
Quantized
Sparse
Sparse24 Stacked

Demonstration Script

import torch
from pttp import TensorProfiler
from transformers import AutoModelForCausalLM, AutoConfig
from compressed_tensors.compressors import ModelCompressor

for name, model_stub, comp_stub in [
    (
        "quantized-only",
        "nm-testing/llama2.c-stories42M-gsm8k-quantized-only-uncompressed",
        "nm-testing/llama2.c-stories42M-gsm8k-quantized-only-compressed",
    ),
    (
        "sparse-only",
        "nm-testing/llama2.c-stories42M-gsm8k-sparse-only-uncompressed",
        "nm-testing/llama2.c-stories42M-gsm8k-sparse-only-compressed",
    ),
    (
        "stacked",
        "nm-testing/llama2.c-stories42M-gsm8k-stacked-uncompressed",
        "nm-testing/llama2.c-stories42M-gsm8k-stacked-compressed",
    )
]:
    from transformers.utils.quantization_config import CompressedTensorsConfig

    with TensorProfiler() as prof:
        prof.mark_event("Start load")
        config = AutoConfig.from_pretrained(model_stub)
        config.tie_word_embeddings = False
        model = AutoModelForCausalLM.from_pretrained(model_stub, torch_dtype=torch.float32, device_map="cuda:0", config=config)
        compressor = ModelCompressor.from_pretrained(comp_stub)

        prof.mark_event("Start compress")
        compressor.compress_model(model)

        prof.mark_event("Start decompress")
        compressor.decompress_model(model)

    prof.save_memory_timeline(f"cdc_{name}.png")

Prerequisites

Changes

Implement compress_model and decompress_model, which both act on a model in memory rather than a state dict or model on disk
- Because compress_model compresses each module independently, implement show_progress on compress methods to squelch tqdm prints for each module compression
- Implement decompress_from_state_dict for sparsity compressors
- Extend get_nested_mappings_from_state_dict to support returning unmatched params, similar to get_nested_weight_mappings
  - I personally dislike this usage, as I think it leads to multiple sources of truth as to which modules should be compressed. IMO, a module should be (de)compressed if and only if it is listed in the config. This function is used to create another source of truth, which is that a module should be compressed if and only if it has the relevant compression params. Currently, we use both, which means taking the intersection.
Misc
- Fix bug on decompress_from_state_dict where scheme was gotten instead of weight args
- Clarify name where variable weight_name was referring to a module path, not a weight name
- Change sparse24 decompressor behavior where the weight was being moved to an arbitrary cuda device in fp8 cases. This violates the assumption that all ops are performed on the cpu
- Remove remove_suffix util which can be replaced with str.removesuffix as of python3.9+ (which is the minimum we support, double check with @dsikka @rahul-tuli
- Use get_execution_device when initing params for CompressedLinear

Testing

Added test_compress_model which tests that memory compression is equivalent to dict compression
Added test_decompress_model which tests that hfquantizer decompression (from disk) is equivalent to decompression from memory

dsikka

Is this ready for review? Still draft

src/compressed_tensors/compressors/model_compressors/model_compressor.py

rahul-tuli · 2025-04-28T14:09:21Z

Could we add a test to compress a model with sparsity+quantization?

rahul-tuli · 2025-04-30T16:47:15Z

LGTM pending conflict, good work!

src/compressed_tensors/compressors/model_compressors/model_compressor.py

rahul-tuli

Looks good pending verification that sparse only models can be compressed using these changes!

src/compressed_tensors/quantization/lifecycle/forward.py

dsikka

Why do we need to use CompressedLinear for compression? What about if we’re compressing something that isn’t a linear layer?

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…-compression-memory

dsikka · 2025-05-08T03:52:20Z

Looks good pending verification that sparse only models can be compressed using these changes!

Can you share sparse + fp8 models recipes where we have non-uniform sparsity and/or quantization cases?
@rahul-tuli

cc @kylesayrs

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

brian-dellabetta

It's beautiful Kyle 🥇 . Love the detailed summary and charts showing the improvement

src/compressed_tensors/compressors/model_compressors/model_compressor.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

brian-dellabetta

LGTM!

rahul-tuli

LGTM!

src/compressed_tensors/compressors/model_compressors/model_compressor.py

…-project#301) * Implement memory compression and decompression Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * perform ops on cpu, move back to module device Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * add mixed tests Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs changed the title ~~[WIP]: Simplify map_module_to_scheme~~ [WIP]: Reduce memory requirements Apr 21, 2025

kylesayrs mentioned this pull request Apr 22, 2025

Enable module state_dict compression, simplify compression logic #302

Merged

kylesayrs changed the base branch from main to kylesayrs/map_module_to_scheme April 22, 2025 20:22

kylesayrs mentioned this pull request Apr 22, 2025

[Accelerate] allow get_execution_device to be used when initializing a model #303

Merged

kylesayrs changed the title ~~[WIP]: Reduce memory requirements~~ [Performance] Reduce compression memory requirements via structure change Apr 22, 2025

dsikka reviewed Apr 23, 2025

View reviewed changes

src/compressed_tensors/compressors/model_compressors/model_compressor.py Outdated Show resolved Hide resolved

kylesayrs marked this pull request as ready for review April 23, 2025 19:32

Base automatically changed from kylesayrs/map_module_to_scheme to main April 28, 2025 15:16

kylesayrs requested a review from dsikka April 28, 2025 16:00

rahul-tuli reviewed Apr 30, 2025

View reviewed changes

src/compressed_tensors/compressors/model_compressors/model_compressor.py Outdated Show resolved Hide resolved

rahul-tuli requested changes Apr 30, 2025

View reviewed changes

brian-dellabetta previously approved these changes Apr 30, 2025

View reviewed changes

src/compressed_tensors/quantization/lifecycle/forward.py Outdated Show resolved Hide resolved

dsikka reviewed Apr 30, 2025

View reviewed changes

kylesayrs dismissed brian-dellabetta’s stale review via 16f9f1f May 2, 2025 18:58

kylesayrs marked this pull request as draft May 5, 2025 15:20

kylesayrs changed the title ~~[Performance] Reduce compression memory requirements via structure change~~ [Performance] Add memory compression and decompression pathways May 7, 2025

Implement memory compression and decompression

b2cad7e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs force-pushed the kylesayrs/reduce-quantized-compression-memory branch from 0e9544d to b2cad7e Compare May 8, 2025 01:38

Merge remote-tracking branch 'origin' into kylesayrs/reduce-quantized…

1a70148

…-compression-memory

kylesayrs marked this pull request as ready for review May 8, 2025 02:05

kylesayrs marked this pull request as draft May 8, 2025 14:06

perform ops on cpu, move back to module device

ff3323b

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs marked this pull request as ready for review May 8, 2025 14:16

brian-dellabetta previously approved these changes May 8, 2025

View reviewed changes

src/compressed_tensors/compressors/model_compressors/model_compressor.py Show resolved Hide resolved

add mixed tests

3524710

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed brian-dellabetta’s stale review via 3524710 May 12, 2025 17:03

kylesayrs requested a review from brian-dellabetta May 12, 2025 17:06

kylesayrs requested review from dsikka and rahul-tuli May 12, 2025 17:06

brian-dellabetta approved these changes May 12, 2025

View reviewed changes

rahul-tuli approved these changes May 14, 2025

View reviewed changes

src/compressed_tensors/compressors/model_compressors/model_compressor.py Show resolved Hide resolved

kylesayrs merged commit f192f68 into main May 14, 2025
1 check passed

kylesayrs deleted the kylesayrs/reduce-quantized-compression-memory branch May 14, 2025 13:45

brian-dellabetta mentioned this pull request May 14, 2025

[Feature] Log/info/Save/Restore quantization steps vllm-project/llm-compressor#1410

Closed

kylesayrs mentioned this pull request Jun 9, 2025

[Hotfix] Implement quantization compressor methods on dense compressor #344

Merged

kylesayrs mentioned this pull request Aug 10, 2025

[Bug]: CPU memory OOM at save_pretrained for GLM-4.5 despite 1.5TB of RAM vllm-project/llm-compressor#1718

Open

[Performance] Add memory compression and decompression pathways #301

[Performance] Add memory compression and decompression pathways #301

Uh oh!

Conversation

kylesayrs commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Memory Visualization

Prerequisites

Changes

Testing

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rahul-tuli commented Apr 28, 2025

Uh oh!

rahul-tuli commented Apr 30, 2025

Uh oh!

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

dsikka commented May 8, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kylesayrs commented Apr 16, 2025 •

edited

Loading