Skip to content

PerceptionLM #37878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
0a05362
plm template
shuminghu Apr 25, 2025
259a8ec
A working plm with fixed image features
shuminghu Apr 26, 2025
c18171e
hacked processor
shuminghu Apr 29, 2025
2a1169b
First version that reproduced PLM output using PE from timm.
shuminghu Apr 30, 2025
e093e20
Simplify and fix tie_word_embeddings
shuminghu May 1, 2025
31aa91b
Use PIL resize. Simplify converstion.
shuminghu May 2, 2025
91de705
First version that works with video input.
shuminghu May 7, 2025
71582cc
simplifed image preprocessing (not batched)
shuminghu May 8, 2025
e1d5f8f
Minor fixes after rebasing on main.
shuminghu May 12, 2025
74db884
Video processor based on new API.
shuminghu May 13, 2025
fb5ae4b
Revert to use _preprocess for image processor.
shuminghu May 13, 2025
ee716c3
refactor with modular
shuminghu May 14, 2025
65e5231
fix tie_word_embedding
shuminghu May 14, 2025
cdbeeeb
Testing with timm PE
shuminghu May 19, 2025
303ddff
check in missed converstion from modular to model.py
shuminghu May 19, 2025
742c8e1
First working version of PLM with Eva PE. PLM-1B and 3B outputs are e…
shuminghu May 19, 2025
63be4c6
address review comments
shuminghu May 20, 2025
139f829
Fixed batching if video and image examples mixed.
shuminghu May 22, 2025
0f8663a
Simplify PE configuration.
shuminghu May 23, 2025
82ff8bd
Enable AutoModel for PerceptionEncoder.
shuminghu May 23, 2025
66a021f
Update PE config style.
shuminghu May 23, 2025
7d97732
update all headers
shuminghu May 23, 2025
70480d4
Minor fixes.
shuminghu May 30, 2025
3d65bc9
Move lm_head to PerceptionLMForConditionalGeneration.
shuminghu May 31, 2025
1137a67
Fix for testing_modeling_perception_lm.py
shuminghu Jun 11, 2025
f1338b1
Image processing refactoring to use more common parts.
shuminghu Jun 12, 2025
7a28970
Fix processor test.
shuminghu Jun 12, 2025
642d2e7
update tests to use model from hub
shuminghu Jun 13, 2025
2fadbae
More test fixes.
shuminghu Jun 13, 2025
6f2d5a3
integration test GT update after rebasing; probably due to video prep…
shuminghu Jun 13, 2025
76c9c4d
update test media path to hub
shuminghu Jun 13, 2025
14c1755
Stop tracking local scripts
shuminghu Jun 13, 2025
dbf35c1
address some review comments
shuminghu Jun 13, 2025
15b176a
refactor image processing.
shuminghu Jun 17, 2025
0337ce1
small fixes
shuminghu Jun 17, 2025
3bc3096
update documentation and minor fixes
shuminghu Jun 17, 2025
53d2f01
remove scripts
shuminghu Jun 17, 2025
350aa79
Minor fix for CI
shuminghu Jun 17, 2025
d63b4f8
Fix image processing
shuminghu Jun 17, 2025
836f546
CI and doc fix
shuminghu Jun 17, 2025
b9e5fa0
CI formatting fix
shuminghu Jun 17, 2025
6b69945
ruff fix
shuminghu Jun 17, 2025
6c82012
ruff formatting
shuminghu Jun 17, 2025
ed1dd4b
ran utils/sort_auto_mappings.py
shuminghu Jun 17, 2025
b77f53e
update docstring
shuminghu Jun 17, 2025
eebcc7a
more docstring udpates
shuminghu Jun 17, 2025
2c73fc4
add vision_input_type default fallback for image processing
shuminghu Jun 17, 2025
6ceb83a
more verbose variable naming
shuminghu Jun 18, 2025
87c6ca4
test update
shuminghu Jun 18, 2025
d7b47d6
Remove PE and PEConfig use AutoModel(TimmWrapper) instead
shuminghu Jun 25, 2025
cbc1057
Minor cleanup.
shuminghu Jun 25, 2025
d62b35d
Minor Fix: remove any ref to PE. Ruff format and check.
shuminghu Jun 25, 2025
f6d095a
fix docstring
shuminghu Jun 25, 2025
508117d
Fix modular/model consistency.Improvex docstringfor .
shuminghu Jun 25, 2025
bcedcc0
Fix PerceptionLMForConditionalGenerationModelTest
shuminghu Jun 26, 2025
f25a2ca
ruff fix
shuminghu Jun 26, 2025
e8d08e8
fix for check_repo
shuminghu Jun 27, 2025
4c05fb3
minor formatting
shuminghu Jun 27, 2025
c74d652
dummy size arg to fix for processor test.
shuminghu Jun 27, 2025
3c46d3a
Update docstring for PerceptionLMConfig
shuminghu Jun 27, 2025
160b039
Minor fixes from review feedback.
shuminghu Jul 2, 2025
97fbfca
Revert some minor changes per reviewer feedback.
shuminghu Jul 2, 2025
99a1bf2
update base_model_prefix
shuminghu Jul 2, 2025
74b7995
Merge branch 'main' into perception_lm
Cyrilvallez Jul 4, 2025
73c919a
address reviewer feedback
shuminghu Jul 5, 2025
423af52
fix comment in modeling file
shuminghu Jul 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1031,6 +1031,8 @@
title: PaliGemma
- local: model_doc/perceiver
title: Perceiver
- local: model_doc/perception_lm
title: PerceptionLM
- local: model_doc/phi4_multimodal
title: Phi4 Multimodal
- local: model_doc/pix2struct
Expand Down
68 changes: 68 additions & 0 deletions docs/source/en/model_doc/perception_lm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# PerceptionLM

## Overview

The PerceptionLM model was proposed in [PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding](https://ai.meta.com/research/publications/perceptionlm-open-access-data-and-models-for-detailed-visual-understanding/) by Jang Hyun Cho et al. It's a fully open, reproducible model for transparent research in image and video understanding. PLM consists of
a vision encoder with a small scale (<8B parameters) LLM decoder.

The abstract from the paper is the following:

*Vision-language models are integral to computer vision research, yet many high-performing models
remain closed-source, obscuring their data, design and training recipe. The research community
has responded by using distillation from black-box models to label training data, achieving strong
benchmark results, at the cost of measurable scientific progress. However, without knowing the details
of the teacher model and its data sources, scientific progress remains difficult to measure. In this
paper, we study building a Perception Language Model (PLM) in a fully open and reproducible
framework for transparent research in image and video understanding. We analyze standard training
pipelines without distillation from proprietary models and explore large-scale synthetic data to identify
critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M
human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded
video captions. Additionally, we introduce PLM–VideoBench, a suite for evaluating challenging video
understanding tasks focusing on the ability to reason about “what”, “where”, “when”, and “how” of a
video. We make our work fully reproducible by providing data, training recipes, code & models.*


This model was contributed by [shumingh](https://huggingface.co/shumingh).
The original code can be found [here](https://github.com/facebookresearch/perception_models).


## PerceptionLMConfig

[[autodoc]] PerceptionLMConfig

## PerceptionLMProcessor

[[autodoc]] PerceptionLMProcessor

## PerceptionLMImageProcessorFast

[[autodoc]] PerceptionLMImageProcessorFast

## PerceptionLMVideoProcessor

[[autodoc]] PerceptionLMVideoProcessor

## PerceptionLMModel

[[autodoc]] PerceptionLMModel

## PerceptionLMForConditionalGeneration

[[autodoc]] PerceptionLMForConditionalGeneration
- forward
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,7 @@
from .pegasus import *
from .pegasus_x import *
from .perceiver import *
from .perception_lm import *
from .persimmon import *
from .phi import *
from .phi3 import *
Expand Down
5 changes: 5 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,8 @@
("pegasus", "PegasusConfig"),
("pegasus_x", "PegasusXConfig"),
("perceiver", "PerceiverConfig"),
("perception_encoder", "TimmWrapperConfig"),
("perception_lm", "PerceptionLMConfig"),
("persimmon", "PersimmonConfig"),
("phi", "PhiConfig"),
("phi3", "Phi3Config"),
Expand Down Expand Up @@ -663,6 +665,8 @@
("pegasus", "Pegasus"),
("pegasus_x", "PEGASUS-X"),
("perceiver", "Perceiver"),
("perception_encoder", "PerceptionEncoder"),
("perception_lm", "PerceptionLM"),
("persimmon", "Persimmon"),
("phi", "Phi"),
("phi3", "Phi3"),
Expand Down Expand Up @@ -869,6 +873,7 @@
("llama4_text", "llama4"),
("blip_2_qformer", "blip_2"),
("fastspeech2_conformer_with_hifigan", "fastspeech2_conformer"),
("perception_encoder", "perception_lm"),
]
)

Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@
("owlvit", ("OwlViTImageProcessor", "OwlViTImageProcessorFast")),
("paligemma", ("SiglipImageProcessor", "SiglipImageProcessorFast")),
("perceiver", ("PerceiverImageProcessor", "PerceiverImageProcessorFast")),
("perception_lm", ("PerceptionLMImageProcessorFast",)),
("phi4_multimodal", ("Phi4MultimodalImageProcessorFast",)),
("pix2struct", ("Pix2StructImageProcessor",)),
("pixtral", ("PixtralImageProcessor", "PixtralImageProcessorFast")),
Expand Down Expand Up @@ -597,7 +598,6 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
raise ValueError(
"This image processor cannot be instantiated. Please make sure you have `Pillow` installed."
)

raise ValueError(
f"Unrecognized image processor in {pretrained_model_name_or_path}. Should have a "
f"`image_processor_type` key in its {IMAGE_PROCESSOR_NAME} of {CONFIG_NAME}, or one of the following "
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,8 @@
("pegasus", "PegasusModel"),
("pegasus_x", "PegasusXModel"),
("perceiver", "PerceiverModel"),
("perception_encoder", "PerceptionEncoder"),
("perception_lm", "PerceptionLMModel"),
("persimmon", "PersimmonModel"),
("phi", "PhiModel"),
("phi3", "Phi3Model"),
Expand Down Expand Up @@ -933,6 +935,7 @@
("mistral3", "Mistral3ForConditionalGeneration"),
("mllama", "MllamaForConditionalGeneration"),
("paligemma", "PaliGemmaForConditionalGeneration"),
("perception_lm", "PerceptionLMForConditionalGeneration"),
("pix2struct", "Pix2StructForConditionalGeneration"),
("pixtral", "LlavaForConditionalGeneration"),
("qwen2_5_vl", "Qwen2_5_VLForConditionalGeneration"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@
("owlv2", "Owlv2Processor"),
("owlvit", "OwlViTProcessor"),
("paligemma", "PaliGemmaProcessor"),
("perception_lm", "PerceptionLMProcessor"),
("phi4_multimodal", "Phi4MultimodalProcessor"),
("pix2struct", "Pix2StructProcessor"),
("pixtral", "PixtralProcessor"),
Expand Down
29 changes: 29 additions & 0 deletions src/transformers/models/perception_lm/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_perception_lm import *
from .image_processing_perception_lm_fast import *
from .modeling_perception_lm import *
from .processing_perception_lm import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# coding=utf-8
# Copyright 2025 Meta Platforms, Inc. and the HuggingFace Inc. team. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PerceptionLM model configuration"""

from ...configuration_utils import PretrainedConfig
from ...utils import logging
from ..auto import CONFIG_MAPPING, AutoConfig
from ..timm_wrapper.configuration_timm_wrapper import TimmWrapperConfig


logger = logging.get_logger(__name__)


class PerceptionLMConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`PerceptionLMForConditionalGeneration`]. It is used to instantiate an
PerceptionLM model according to the specified arguments, defining the model architecture.

Example models:
- [facebook/Perception-LM-1B](https://huggingface.co/facebook/Perception-LM-1B).
- [facebook/Perception-LM-3B](https://huggingface.co/facebook/Perception-LM-3B).
- [facebook/Perception-LM-8B](https://huggingface.co/facebook/Perception-LM-8B).

Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.

Args:
vision_config (`Union[TimmWrapperConfig, dict]`, *optional*, defaults to `TimmWrapperConfig()`):
The config object or dictionary of the vision backbone.
text_config (`Union[PretrainedConfig, dict]`, *optional*, defaults to `LlamaConfig()`):
The config object or dictionary of the text backbone.
vision_use_cls_token (`bool`, *optional*, defaults to `True`):
Whether CLS token is used in the vision backbone. If used, we remove CLS token embedding from vision output.
projector_pooling_ratio (`int`, *optional*, defaults to 1):
The pooling ratio used in the multimodal projector.
image_token_id (`int`, *optional*, defaults to 128002):
The image token index to encode the image prompt.
video_token_id (`int`, *optional*, defaults to 128003):
The video token index to encode the video prompt.
"""

model_type = "perception_lm"
sub_configs = {"text_config": AutoConfig, "vision_config": TimmWrapperConfig}

def __init__(
self,
vision_config=None,
text_config=None,
vision_use_cls_token=True,
projector_pooling_ratio=1,
image_token_id=128002,
video_token_id=128003,
**kwargs,
):
self.image_token_id = image_token_id
self.video_token_id = video_token_id
if isinstance(vision_config, dict):
vision_config = TimmWrapperConfig(**vision_config)
elif isinstance(vision_config, TimmWrapperConfig):
vision_config = vision_config
elif vision_config is None:
vision_config = TimmWrapperConfig()
self.vision_config = vision_config
self.vision_use_cls_token = vision_use_cls_token

if isinstance(text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
elif text_config is None:
text_config = CONFIG_MAPPING["llama"]()

self.text_config = text_config
self.projector_pooling_ratio = projector_pooling_ratio
super().__init__(**kwargs)


__all__ = ["PerceptionLMConfig"]
Loading