Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models
| 🤗 HF Dataset | 📃 Paper |
We introduce MMGeoLM, a project designed to enhance the geometric problem-solving capabilities of Large Multimodal Models (LMMs). This work consists of three main components: (1) constructing a comprehensive geometric dataset, which includes image-text alignment data, image-based and text-based hard negatives, and supervised fine-tuning data; (2) training vision encoders with hard negatives to improve their perception of geometric elements; and (3) performing supervised fine-tuning of LMMs.
We release MM-Math-Align, a dataset built upon MM-Math, which is derived from actual geometry questions used in middle school exams. Each sample contains the original geometric diagram, a Python script's image that approximately reconstructs the original diagram, a caption describing the positive image, 10 negative Python script images generated based on the positive image, and 10 corresponding negative captions. The dataset consists of a total of 4,021 samples.
The Hugging Face dataset download link for MM-Math-Align: 🤗 HF Dataset
If you want to construct your own image-text alignment data for your geometry dataset, you can follow the steps below:
- First, run the code_generation.py script. The input should be the geometry question and its corresponding answer, which will be used to generate a geometric diagram similar to the original one. We recommend using the gemini-2.5-pro-preview model, as it has strong mathematical reasoning and coding capabilities.
- Next, after generating the geometric image script, use re_verification.py to perform a secondary verification. After running it through your local system, you will obtain a geometric diagram similar to the original one.
- For constructing negative captions of the geometric diagrams, please refer to the prompt settings provided in prompts.py. This prompt has been tested and can run stably in practice.
- The scripts support multi-threaded execution for fast data construction. However, please be mindful of token usage, as the scripts require a model API key.
- Relase the hard text negatives data
- Relase the SFT geometric data
We modify the original CLIP training strategy to support an arbitrary number of negative samples, rather than restricting within a batch. We provide two example scripts: run_negative_images.sh
for image-based negatives and run_negative_texts.sh
for text-based negatives.
The data format for image-based hard negatives is as follows:
[
{
"positive_image_path": "/path/positive_image.png",
"negative_image_path": ["/path/negative_image1.png","/path/negative_image2.png"],
"conversations" :{"from": "humam", "value": "describe the image"}, {"from": "gpt", "value": "Positive Caption."}
},
]
The data format for text-based hard negatives is as follows:
[
{
"positive_image_path": "/path/positive_image.png",
"negative_captions": ["Negative Caption 1","Negative Caption 2"],
"conversations" :{"from": "humam", "value": "describe the image"}, {"from": "gpt", "value": "positive captions"}
}
]
We adopt the Qwen2.5-7B-Instruct model as the backbone of our LMs. As for Mammoth2-7B mentioned in the paper, it can be trained following the LLaVA-1.5 training strategy.
First Stage: After training CLIP, we provide a training script for MLP. Please refer to pretrain_qwen2_5.sh
. In this stage, both the vision encoder and LLM are frozen. In the script, conversation version must be configured with --version qwen_1_5
; otherwise, errors may occur when training with the Qwen2.5 model.
Second Stage: In this stage, we involves training the entire multimodal architecture, and we provide a training script finetune_qwen2_5.sh
. This script integrates the pre-trained MLP module and performs supervised fine-tuning on the full model.
@misc{sun2025hardnegativecontrastivelearning,
title={Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models},
author={Kai Sun and Yushi Bai and Zhen Yang and Jiajie Zhang and Ji Qi and Lei Hou and Juanzi Li},
year={2025},
eprint={2505.20152},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.20152},
}