Mosaic-IT: Cost-Free Compositional Data Synthesis for Instruction Tuning (ACL'25)

Chinese Version: 知乎

This is the repo for the Mosaic-IT project, which introduces an augmentation method for Instruction Tuning, which concurrently improves the LLM performances and lowers the training expenses.

(Feel free to email minglii@umd.edu for any questions or feedback.)

News

[2025/05] Our paper has been accepted to the ACL 2025 findings!
[2024/05] We initialized the Mosaic-IT repo.

Overview

Finetuning large language models with a variety of instruction-response pairs has enhanced their capability to understand and follow instructions. Current instruction tuning primarily relies on teacher models or human intervention to generate and refine the instructions and responses, which are costly, non-sustainable, and may lack diversity. In this paper, we introduce Mosaic Instruction Tuning (Mosaic-IT), a human/model-free method that can efficiently create rich and diverse augmentations from existing instruction tuning data to enhance the finetuned LLM. Mosaic-IT randomly concatenates multiple instruction data into one and trains the model to produce the corresponding responses with predefined higher-level meta-instructions to strengthen its multi-step instruction-following and format-following skills. Our extensive evaluations demonstrate a superior performance and training efficiency of Mosaic-IT, which achieves consistent performance improvements over various benchmarks and an 80% reduction in training costs compared with original instruction tuning.

Illustration of Mosaic-IT. The upper section represents the original Instruction Tuning method. In the middle is our method with the Primary Mosaic strategy, which concatenates instructions and corresponding responses as an augmentation for better Instruction Tuning. On the right are corresponding data examples of each method. The lower section illustrates the three Auxiliary Mosaic Strategies for Mosaic-IT.

Highlights

We extend the instruction tuning process into arbitrarily following multiple instructions, rather than one instruction at a time, which dramatically digs out the potential of the existing instruction tuning dataset.
Our Mosaic-IT (1) achieves better performances on various benchmarks, with (2) 80% reduction on training cost (3) without using any extra models。
Our method is a potential attempt to alleviate the memorizing effects during training, see Further Discussion for details.
The Mosaic Strategies and Predefined Rules in our method can be easily further extended, potentially further improve the Mosaic-IT performance.

Install

Install the dependencies with pip install -r requirements.txt

Note: The use of Mosaic-IT only needs the transformers package, thus if you are using a different code base with transformers installed, you can directly run the code and manually install the missing packages.

Run Code

python Mosaic-IT/mosaic_main.py \
    --data_path data/alpaca_gpt4_data.json \
    --save_path alpaca_gpt4_data_mosaicked.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --epo_num 4 \
    --wrap_mode uniform \
    --wrap_max_num 10 \
    --version both

--data_path: Input data, in alpaca format.
--save_path: Save data path.
--model_name_or_path: The model used to calculate the token counts.
--epo_num: The times of random mosaic process to be run.
--wrap_mode: How to decide the distribution of the number of instructions.
--wrap_max_num: Max number of instructions.
--version: Mosaic Strateties.

Training

We use the prompt and code base from FastChat:

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello.</s>USER: Who are you? ASSISTANT: I am ...</s>......

ToDo

Initialize the repo.
Release paper on Arxiv.
Release data and model.

Citation

Please consider citing our papers if you think our codes, data, or models are useful. Thank you!

@inproceedings{li-etal-2025-mosaic,
    title = "Mosaic-{IT}: Cost-Free Compositional Data Synthesis for Instruction Tuning",
    author = "Li, Ming  and
      Chen, Pei  and
      Wang, Chenguang  and
      Zhao, Hongyu  and
      Liang, Yijun  and
      Hou, YuPeng  and
      Liu, Fuxiao  and
      Zhou, Tianyi",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.1297/",
    doi = "10.18653/v1/2025.findings-acl.1297",
    pages = "25287--25318",
    ISBN = "979-8-89176-256-5"
}

@inproceedings{li-etal-2025-ruler,
    title = "{R}ule{R}: Improving {LLM} Controllability by Rule-based Data Recycling",
    author = "Li, Ming  and
      Chen, Han  and
      Wang, Chenguang  and
      Nguyen, Dang  and
      Li, Dianqi  and
      Zhou, Tianyi",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-short.78/",
    doi = "10.18653/v1/2025.naacl-short.78",
    pages = "926--943",
    ISBN = "979-8-89176-190-2"
}

Our Related Works

If you are interested in Data Selection for Instruction Tuning, please see Cherry_LLM and Superfiltering.
If you are interested in human/LLM-free Data Augmentation for Instruction Tuning, please see Mosaic-IT and RuleR.
If you are interested in Data Improvement for Instruction Tuning, please see Reflection_Tuning.
If you are interested in Knowledge Distillation in the LLM era, please see this Survey.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Mosaic-IT		Mosaic-IT
data		data
images		images
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Mosaic-IT: Cost-Free Compositional Data Synthesis for Instruction Tuning (ACL'25)

News

Contents

Overview

Highlights

Install

Run Code

Training

ToDo

Citation

Our Related Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

Uh oh!

tianyi-lab/Mosaic-IT

Folders and files

Latest commit

History

Repository files navigation

Mosaic-IT: Cost-Free Compositional Data Synthesis for Instruction Tuning (ACL'25)

News

Contents

Overview

Highlights

Install

Run Code

Training

ToDo

Citation

Our Related Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages