Skip to content

THU-KEG/MMGeoLM

Repository files navigation

Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

| 🤗 HF Dataset | 📃 Paper |

We introduce MMGeoLM, a project designed to enhance the geometric problem-solving capabilities of Large Multimodal Models (LMMs). This work consists of three main components: (1) constructing a comprehensive geometric dataset, which includes image-text alignment data, image-based and text-based hard negatives, and supervised fine-tuning data; (2) training vision encoders with hard negatives to improve their perception of geometric elements; and (3) performing supervised fine-tuning of LMMs.

Geometric Dataset

Image-Text Alignment Dataset

PDF Example Image

We release MM-Math-Align, a dataset built upon MM-Math, which is derived from actual geometry questions used in middle school exams. Each sample contains the original geometric diagram, a Python script's image that approximately reconstructs the original diagram, a caption describing the positive image, 10 negative Python script images generated based on the positive image, and 10 corresponding negative captions. The dataset consists of a total of 4,021 samples.

The Hugging Face dataset download link for MM-Math-Align: 🤗 HF Dataset

Dataset Construction Script

If you want to construct your own image-text alignment data for your geometry dataset, you can follow the steps below:

  1. First, run the code_generation.py script. The input should be the geometry question and its corresponding answer, which will be used to generate a geometric diagram similar to the original one. We recommend using the gemini-2.5-pro-preview model, as it has strong mathematical reasoning and coding capabilities.
  2. Next, after generating the geometric image script, use re_verification.py to perform a secondary verification. After running it through your local system, you will obtain a geometric diagram similar to the original one.
  3. For constructing negative captions of the geometric diagrams, please refer to the prompt settings provided in prompts.py. This prompt has been tested and can run stably in practice.
  4. The scripts support multi-threaded execution for fast data construction. However, please be mindful of token usage, as the scripts require a model API key.

TODO

  • Relase the hard text negatives data
  • Relase the SFT geometric data

Hard Negative Training

We modify the original CLIP training strategy to support an arbitrary number of negative samples, rather than restricting within a batch. We provide two example scripts: run_negative_images.sh for image-based negatives and run_negative_texts.sh for text-based negatives.

The data format for image-based hard negatives is as follows:

[
    {
        "positive_image_path": "/path/positive_image.png",
        "negative_image_path": ["/path/negative_image1.png","/path/negative_image2.png"], 
        "conversations" :{"from": "humam", "value": "describe the image"}, {"from": "gpt", "value": "Positive Caption."}
    },
]

The data format for text-based hard negatives is as follows:

[
    {
        "positive_image_path": "/path/positive_image.png", 
        "negative_captions": ["Negative Caption 1","Negative Caption 2"],
        "conversations" :{"from": "humam", "value": "describe the image"}, {"from": "gpt", "value": "positive captions"} 
    }
]

Supervised Fine-tuning

We adopt the Qwen2.5-7B-Instruct model as the backbone of our LMs. As for Mammoth2-7B mentioned in the paper, it can be trained following the LLaVA-1.5 training strategy.

First Stage: After training CLIP, we provide a training script for MLP. Please refer to pretrain_qwen2_5.sh. In this stage, both the vision encoder and LLM are frozen. In the script, conversation version must be configured with --version qwen_1_5; otherwise, errors may occur when training with the Qwen2.5 model.

Second Stage: In this stage, we involves training the entire multimodal architecture, and we provide a training script finetune_qwen2_5.sh. This script integrates the pre-trained MLP module and performs supervised fine-tuning on the full model.

Citation

@misc{sun2025hardnegativecontrastivelearning,
      title={Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models}, 
      author={Kai Sun and Yushi Bai and Zhen Yang and Jiajie Zhang and Ji Qi and Lei Hou and Juanzi Li},
      year={2025},
      eprint={2505.20152},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.20152}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published