D-FINE (based on v4.51.3)
A new model is added to transformers: D-FINE
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-D-FINE-preview
.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-D-FINE-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the D-FINE model. This tag is a tagged version of the main
branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0
.
D-FINE

The D-FINE model was proposed in D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement by
Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu
The abstract from the paper is the following:
We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD).
FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.
Usage example
D-FINE can be found on the Huggingface Hub.
>>> import torch
>>> from transformers.image_utils import load_image
>>> from transformers import DFineForObjectDetection, AutoImageProcessor
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = load_image(url)
>>> image_processor = AutoImageProcessor.from_pretrained("ustc-community/dfine_x_coco")
>>> model = DFineForObjectDetection.from_pretrained("ustc-community/dfine_x_coco")
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> results = image_processor.post_process_object_detection(outputs, target_sizes=[(image.height, image.width)], threshold=0.5)
>>> for result in results:
... for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
... score, label = score.item(), label_id.item()
... box = [round(i, 2) for i in box.tolist()]
... print(f"{model.config.id2label[label]}: {score:.2f} {box}")
cat: 0.96 [344.49, 23.4, 639.84, 374.27]
cat: 0.96 [11.71, 53.52, 316.64, 472.33]
remote: 0.95 [40.46, 73.7, 175.62, 117.57]
sofa: 0.92 [0.59, 1.88, 640.25, 474.74]
remote: 0.89 [333.48, 77.04, 370.77, 187.3]