Skip to content

[Performance] Reduce compression memory requirements via structure change #301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Contributor

@kylesayrs kylesayrs commented Apr 16, 2025

Purpose

  • Reduce memory requirements when compressing a model

Allocating separate compressed state dict (current)

fp8_gpu

Replacing model structure (new)

fp8_gpu_replace

Prerequisites

Changes

  • Implement apply_compression_status which converts the Linear layers of a model to CompressedLinears
  • Implement utility module_replace_dfs, which is more likely performant than replacing modules using existing replace_module utility
  • Initialize CompressedLinear parameters on the execution device
    • Ensures compatibility when converting a model with offloaded modules

Testing

  • Added test_apply_compression_status

kylesayrs added 15 commits April 2, 2025 18:34
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs changed the title [WIP]: Simplify map_module_to_scheme [WIP]: Reduce memory requirements Apr 21, 2025
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs changed the base branch from main to kylesayrs/map_module_to_scheme April 22, 2025 20:22
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs changed the title [WIP]: Reduce memory requirements [Performance] Reduce compression memory requirements via structure change Apr 22, 2025
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ready for review? Still draft

@@ -362,6 +364,29 @@ def get_unexpected_file_keys(self, model: Module) -> List[str]:

return list(unexpected_keys)

def apply_compression_status(self, model: Module) -> Module:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this get applied?

Copy link
Contributor Author

@kylesayrs kylesayrs Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this lands, we'll add model_compressor. apply_compression_status(model) within the overridden save_pretrained function. In the future, this will be used/folded into apply_compression_status

@kylesayrs kylesayrs marked this pull request as ready for review April 23, 2025 19:32
@rahul-tuli
Copy link
Member

Could we add a test to compress a model with sparsity+quantization?

Base automatically changed from kylesayrs/map_module_to_scheme to main April 28, 2025 15:16
@kylesayrs kylesayrs requested a review from dsikka April 28, 2025 16:00
@rahul-tuli
Copy link
Member

LGTM pending conflict, good work!

def apply_compression_status(self, model: Module):
if self.quantization_config is None:
for module in model.modules():
module.quantization_status = QuantizationStatus.COMPRESSED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we run compression here too, for cases when we only have sparsity!

Copy link
Member

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good pending verification that sparse only models can be compressed using these changes!

Comment on lines +316 to +317
def unwrap_module_forward_quantized(module: Module):
delattr(module, "forward") # revert to class implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this then expose the forward method on the Linear or CompressedLinear class? why do we want to delete the attr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this exposes the forward method of the class implementation

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use CompressedLinear for compression? What about if we’re compressing something that isn’t a linear layer?

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
kylesayrs added 2 commits May 2, 2025 15:55
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants