Skip to content

PerceptionLM #37878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open

PerceptionLM #37878

wants to merge 66 commits into from

Conversation

shuminghu
Copy link

@shuminghu shuminghu commented Apr 29, 2025

This PR implements PerceptionLM released by Meta:
https://github.com/facebookresearch/perception_models

Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@github-actions github-actions bot marked this pull request as draft April 29, 2025 23:40
@shuminghu shuminghu changed the title Perception lm [WIP] Perception lm Apr 30, 2025
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, super excited to get the model shipped! I know it is early to review, I noticed the model doesn't have a modular file yet. I recommend to use modular transformers to add the model, it will allow you to inherit from any similar model in transformers and you won't have to rewrite the whole class

Also it makes the review process easier and faster, since we see what are the main differences between PE and other existing model 😉

@shuminghu
Copy link
Author

Yay, super excited to get the model shipped! I know it is early to review, I noticed the model doesn't have a modular file yet. I recommend to use modular transformers to add the model, it will allow you to inherit from any similar model in transformers and you won't have to rewrite the whole class

Also it makes the review process easier and faster, since we see what are the main differences between PE and other existing model 😉

I see. Let me take a look. The classes here were added automatically (first commit: plm template) from running this command:

transformers-cli add-new-model-like
What is the model you would like to duplicate? Please provide the lowercase `model_type` (e.g. roberta): llava
What is the name (with no special casing) for your new model in the paper (e.g. RoBERTa)? PerceptionLM
What identifier would you like to use for the `model_type` of this model?  [perceptionlm] perception_lm
What lowercase name would you like to use for the module (folder) of this model?  [perceptionlm] perception_lm
What prefix (camel-cased) would you like to use for the model classes of this model (e.g. Roberta)?  [PerceptionLM] 
What prefix (upper-cased) would you like to use for the constants relative to this model?  [PERCEPTIONLM] PERCEPTION_LM
What will be the name of the config class for this model?  [PerceptionLMConfig] 
Please give a checkpoint identifier (on the model Hub) for this new model (e.g. facebook/FacebookAI/roberta-base): facebook/Perception-LM-1B
Will your new model use the same processing class as llava (LlamaTokenizer, LlavaProcessor) (yes/no)? no
What will be the name of the tokenizer class for this model?  [PerceptionLMTokenizer] 
What will be the name of the processor class for this model?  [PerceptionLMProcessor] 
Should we add # Copied from statements when creating the new modeling file (yes/no)?  [yes] 
Should we add a version of your new model in all the frameworks implemented by llava (['pt']) (yes/no)?  [yes] 
The constants at the start of the new tokenizer file created needs to be manually fixed. If your new model has a tokenizer fast, you will also need to manually add the converter in the `SLOW_TO_FAST_CONVERTERS` constant of `convert_slow_tokenizer.py`.

@zucchini-nlp
Copy link
Member

Yeah, that way is correct and usually copies an existing similar model. In this case Llava was written without modular as it was almost the first VLM in transformers :)

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for great work @shuminghu !

My main concern currently is to make the image processing and the vision tower more aligned with transformers standards. I left some comments below on how we could load the model, lmk if that works

Model tester for `PerceptionLMForConditionalGeneration`.
"""

all_model_classes = (PerceptionLMForConditionalGeneration,) if is_torch_available() else ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_generative_model_classes as well, fast generation tests are needed to make sure we don't break anything in the future. We don't normally run integration tests for all PRs

@zucchini-nlp
Copy link
Member

@shuminghu ping me when it is ready for review and can you mark it PR as "ready for review' as well'

@shuminghu
Copy link
Author

shuminghu commented May 27, 2025 via email

@shuminghu shuminghu changed the title [WIP] Perception lm PerceptionLM Jun 13, 2025
@shuminghu shuminghu marked this pull request as ready for review June 13, 2025 07:20
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, super clean! ❤️ I left mostly comments about nits, like deleting unnecessary files and print statements

The only major comment to address is about moving contents from image_trasnfom.py to the processing file, and making sure it follows the transformers format style (i.e. uses torchvision.F without compose)

## Overview

The PerceptionLM model was proposed in [<INSERT PAPER NAME HERE>](<INSERT PAPER LINK HERE>) by <INSERT AUTHORS HERE>.
<INSERT SHORT SUMMARY HERE>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs to be filled before merging

use_cls_token=True,
architecture="vit_pe_core_large_patch14_336",
width=1024,
img_size=[448, 448],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't use mutables as defaults for args in python, same for ref_feat_shape. Lets make it a tuple

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed them from Tuple coz a unit test failed

tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_config FAILED                                                 [ 12%]
============================================================================= short test summary info ==============================================================================
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_config - AssertionError: {'ima[1414 chars]'', 'model_type': 'perception_encoder', 'use_c[3584 chars]alse} != {'ima[1414 chars]'', 'use_cls_token': True, 'architecture': 'vi[3584 chars]...
=========================================================================== 1 failed, 1 warning in 1.19s ============================

it turns out config serialized from json produces List which is not equal to original Tuple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can initialize te default if after as in img_size = img_size is img_size is not None else [448, 448]

import torch


TEST_MODEL_PATH = "shumingh/plm_1b_hf"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to update to actual ckpt after the model is released and public

Comment on lines 167 to 155
# input_ids[
# :, self.num_image_tokens : self.num_video_tokens + self.num_image_tokens
# ] = config.video_token_id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

video_inference doesn't work? We would need tests for all modalities

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't support mixing image and video in the same prompt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we can add a separate Tester for video only, so it is tested

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added back. it doesnt matter here.

pass


TEST_MODEL_PATH = "shumingh/plm_1b_hf"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, to be updated before merge, so we dont forget

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! Approving as the PR looks good, left only minor comments for readability. Let's make the CI green and ask for last review from core maintainers

For the failing integration tests that a timm model isn't found, cc @ydshieh . Which version on timm do we have in the runners and can we update it? We'll also have another big release soon based on timm as a backbone

use_cls_token=True,
architecture="vit_pe_core_large_patch14_336",
width=1024,
img_size=[448, 448],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can initialize te default if after as in img_size = img_size is img_size is not None else [448, 448]

Comment on lines 274 to 278
thumbnails, _ = self.resize(stacked_images, self.tile_size, max_num_tiles=1)
images_for_tiling, (tiles_w, tiles_h) = self.resize(
stacked_images, self.tile_size, max_num_tiles=self.max_num_tiles
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to configure self.tile_size at call time and it is guaranteed to be in the input args. So we need to expand the signature args and use tile_size. Same for max_num_tiles

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 17, 2025

For the failing integration tests that a timm model isn't found, cc @ydshieh . Which version on timm do we have in the runners and can we update it? We'll also have another big release soon based on timm as a backbone

Hi, it doesn't seems just integration tests but all tests, if we are talking about

RuntimeError: Unknown model (vit_pe_core_large_patch14_336)

In the runner, we have

timm==1.0.15

(this info. could be find in the job step Show installed libraries and their versions)

However, we can not use a model like vit_pe_core_large_patch14_336 in non-slow tests (for circleci for example). Usually with HF models, we set a config with very small values for some attributes, and use that config to create a very tiny model on the fly.

For integration tests, we can use it.

@zucchini-nlp
Copy link
Member

Right, we need to modify non-integration tests @shuminghu . And for the integration ones, do we need to add a decorator for timm version so it doesn't flood nightlies with failures or do we update timm in runners?

@shuminghu
Copy link
Author

shuminghu commented Jun 17, 2025 via email

@shuminghu
Copy link
Author

For the failing integration tests that a timm model isn't found, cc @ydshieh . Which version on timm do we have in the runners and can we update it? We'll also have another big release soon based on timm as a backbone

Hi, it doesn't seems just integration tests but all tests, if we are talking about

RuntimeError: Unknown model (vit_pe_core_large_patch14_336)

In the runner, we have

timm==1.0.15

(this info. could be find in the job step Show installed libraries and their versions)

However, we can not use a model like vit_pe_core_large_patch14_336 in non-slow tests (for circleci for example). Usually with HF models, we set a config with very small values for some attributes, and use that config to create a very tiny model on the fly.

For integration tests, we can use it.

@ydshieh vit_pe_core_large_patch14_336 is actually not used in non-integration test despite the naming.
With latest TimmWrapperConfig support, it is specified as following in non-integration tests, where architecture is merely a template and model dim and depth is specified as model_args. But yea, all of this requires timm's source code up-to-date to a week ago.

        vision_config={
            "architecture": "vit_pe_core_large_patch14_336",
            "model_args": {
                "embed_dim": 64,
                "img_size": (14, 14),
                "depth": 2,
                "global_pool": "",
                "use_post_transformer_norm": False,
                "init_values": 0.1,
                "ref_feat_shape": (1, 1),
            },
        },

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 27, 2025

Happy to upgrade timm in docker if @zucchini-nlp and @qubvel confirm it is necessary.

Does this mean we also have to pin timm version in setup.py so users will have the correct timm version?

(And this requirement is specific to this new model or we need it anyway in general?)

@qubvel
Copy link
Member

qubvel commented Jun 27, 2025

We also need the latest timm for gemma3n, because mobilenetv5 was added just recently, but not sure if we should pin reqs strict to the latest version, maybe just add version check with correct error message

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 27, 2025

we have timm==1.0.15 at this moment, and we have v1.0.16 released 16 hours ago).

@shuminghu Are the failing tests caused by the timm==1.0.15 ...?

@shuminghu
Copy link
Author

shuminghu commented Jun 27, 2025

we have timm==1.0.15 at this moment, and we have v1.0.16 released 16 hours ago).

@shuminghu Are the failing tests caused by the timm==1.0.15 ...?

@ydshieh Right. v1.0.16 would be good for me for CI.

From Release v1.0.16 change log:
...
Add EVA ViT based PE (Perceptual Encoder) impl by @rwightman in huggingface/pytorch-image-models#2487

--- upate ----
Thanks @ydshieh! I just saw CI is passing. So it must have been updated.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright! Great great work, super clean 🤗🤗 Just a few comments to polish a bit, but this is very nice already and almost ready to be shipped!👌

Comment on lines 23 to 24

from transformers.generation.utils import GenerationMixin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be consistent and use relative imports everywhere!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 43 to 46
_CONFIG_FOR_DOC = "PerceptionLMConfig"

# Base docstring
_CHECKPOINT_FOR_DOC = "facebook/Perception-LM-1B"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed as it does not seem to be used (they were here before we had auto_docstring mostly)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 49 to 51
class AdaptiveAvgPooling(nn.Module):
def __init__(self, pooling_ratio=2):
super(AdaptiveAvgPooling, self).__init__()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class AdaptiveAvgPooling(nn.Module):
def __init__(self, pooling_ratio=2):
super(AdaptiveAvgPooling, self).__init__()
class PerceptionLMAdaptiveAvgPooling(nn.Module):
def __init__(self, pooling_ratio=2):
super().__init__()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 88 to 90
self.pooling = (
AdaptiveAvgPooling(config.projector_pooling_ratio) if config.projector_pooling_ratio > 1 else nn.Identity()
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have checkpoints where it's actually > 1? I see it's 1 by default in the config. If it's not used in any real checkpoint, let's remove this layer for simplicity!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 73 to 87
self.projector = nn.ModuleList(
[
nn.Linear(
in_features=input_size,
out_features=output_size,
bias=True,
),
nn.GELU(),
nn.Linear(
in_features=output_size,
out_features=output_size,
bias=True,
),
]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here if possible, I'd rather we unwrap those layers and use them directly in the forward as well instead if iterating, for clarity (would require to update the converter as well)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 277 to 288
# Make modules available throught conditional class for BC
@property
def language_model(self):
return self.model.language_model

@property
def vision_tower(self):
return self.model.vision_tower

@property
def multi_modal_projector(self):
return self.model.multi_modal_projector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need those

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I added this back (BC). This was added to fix this test

    @require_torch_sdpa
    def test_sdpa_can_dispatch_composite_models(self):
        """
        Tests if composite models dispatch correctly on SDPA/eager when requested so when loading the model.
        This tests only by looking at layer names, as usually SDPA layers are called "SDPAAttention".
        In contrast to the above test, this one checks if the "config._attn_implamentation" is a dict after the model
        is loaded, because we manually replicate requested attn implementation on each sub-config when loading.
        See https://github.com/huggingface/transformers/pull/32238 for more info
    
        The test tries to cover most general cases of composite models, VLMs with vision and text configs. Any model
        that has a different set of sub-configs has to overwrite this test.
        """
        if not self.has_attentions:
            self.skipTest(reason="Model architecture does not support attentions")
    
        if not self._is_composite:
            self.skipTest(f"{self.all_model_classes[0].__name__} does not support SDPA")
    
        for model_class in self.all_model_classes:
            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
            model = model_class(config)
    
            with tempfile.TemporaryDirectory() as tmpdirname:
                model.save_pretrained(tmpdirname)
                model_sdpa = model_class.from_pretrained(tmpdirname)
                model_sdpa = model_sdpa.eval().to(torch_device)
    
                vision_model_names = {"visual", "image_tower", "vision_tower", "vision_model"}
                language_model_names = {"language_model", "model", "text_model"}
>               vision_model_name = [name for name in vision_model_names if hasattr(model_sdpa, name)][0]
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E               IndexError: list index out of range

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned test_sdpa_can_dispatch_composite_models in the comment.

Comment on lines +113 to +114
@auto_docstring
class PerceptionLMModel(LlavaModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LlavaModel has a mapping that we should not need here, and that we should overwrite to empty explicitly

Suggested change
@auto_docstring
class PerceptionLMModel(LlavaModel):
@auto_docstring
class PerceptionLMModel(LlavaModel):
_checkpoint_conversion_mapping = {}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar as before for return_dict with the decorator! Let's remove it everywhere!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed return_dict from all methods.

Comment on lines 311 to 317
@unittest.skip("ViT PE cannot be tested with meta device")
def test_can_be_initialized_on_meta(self):
pass

@unittest.skip("ViT PE cannot be tested with meta device")
def test_can_load_with_meta_device_context_manager(self):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that due to timm?

Copy link
Author

@shuminghu shuminghu Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is. Mentioned timm in reason string.

Comment on lines +275 to +289
def test_training(self):
self.all_model_classes = (PerceptionLMForConditionalGeneration,) if is_torch_available() else ()
super().test_training()

def test_training_gradient_checkpointing(self):
self.all_model_classes = (PerceptionLMForConditionalGeneration,) if is_torch_available() else ()
super().test_training_gradient_checkpointing()

def test_training_gradient_checkpointing_use_reentrant(self):
self.all_model_classes = (PerceptionLMForConditionalGeneration,) if is_torch_available() else ()
super().test_training_gradient_checkpointing_use_reentrant()

def test_training_gradient_checkpointing_use_reentrant_false(self):
self.all_model_classes = (PerceptionLMForConditionalGeneration,) if is_torch_available() else ()
super().test_training_gradient_checkpointing_use_reentrant_false()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never remember exactly how pytest runs the tests, but I think that overriding self.all_model_classes like this may pollute some other tests using the attribute 🥲 Why do we change it? Should work with the base model as well

Copy link
Author

@shuminghu shuminghu Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh these are instantiated per test. I tested before by printing "self.all_model_classes" at the beginning of each test when running them all.

The reason is loss is not defined base model (PerceptionLMModel), only PerceptionLMForConditionalGeneration has lm_head, loss, labels defined which are used in testing. I think other models (llava_onvision...) chose to skip these tests. I kept them here as i think tests are still valid for PerceptionLMForConditionalGeneration

Copy link
Contributor

github-actions bot commented Jul 5, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, perception_lm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants