LlamaDuo

🎉 Accepted to ACL 2025!
Our paper “LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs” has been accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025).

[Paper Link on 🤗][Model Checkpoints & Synthetic Datasets on 🤗]

The widespread adoption of cloud-based proprietary large language models (LLMs) has introduced significant challenges, including operational dependencies, privacy concerns, and the necessity of continuous internet connectivity. In this work, we introduce an LLMOps pipeline, "LlamaDuo", for the seamless migration of knowledge and abilities from service-oriented LLMs to smaller, locally manageable models. This pipeline is crucial for ensuring service continuity in the presence of operational failures, strict privacy policies, or offline requirements. Our LlamaDuo involves fine-tuning a small language model against the service LLM using a synthetic dataset generated by the latter. If the performance of the fine-tuned model falls short of expectations, it is automatically improved through additional fine-tuning using extra similar data generated by the service LLM. This multi-turn process guarantees that the smaller model can eventually match or even surpass the service LLM's capabilities in specific downstream tasks, offering a practical and scalable solution for managing AI deployments in constrained environments. Extensive experiments with leading-edge LLMs are conducted to demonstrate the effectiveness, adaptability, and affordability of LlamaDuo across various downstream tasks.

Table of contents

Tech stacks
Motivation
Tech stacks
Overview
Synthetic Datasets

Tech stacks

Tested service LLMs

Tested local LLMs

Fine-tuning

Motivation

We assume that a small LLM could show comparable performance to that of a service-type LLM on a specific task, and this project tries to showcase such a possibility in a practically grounded manner. Furthermore, this project shows how to migrate from service LLM to small LLM smoothly.

Assume that service LLM is integrated into your service or system. However, from time to time, the service LLM should be replaced for the following reaons:

failure of service LLM, which may be operationally impacting a business.
data privacy issue. You don't want to expose your private data.
some system runs without internet connection. Service LLM did a great job on PoC, but now you need the same intelligence in an on-premise environment.
version control issue. Service LLMs change their versions from time to time, and legacy versions will become obsolete. However, we just want to keep the behavior as is.
...

To better prepare for such impacting situations, this project suggests migrating from a service LLM to a local small LLM. Since we are satisfied with the results from service LLM, we know our inputs (prompts) and the desired outputs. Then, we can fine-tune small size LLM on the collected prompts to match the desired outputs. Furthermore, if the fine-tuned LLM's performance is still poor, we can grow the size of the dataset by generating more similar data via service LLM.

Tech stacks

For this project, the following tech stacks are chosen:

Hugging Face open source ecosystem (transformers, peft, alignment-handbook, huggingface_hub)
Gemini API(AI Studio), Gemini API(Vertex AI), OpenAI API, Anthropic API, Anthropic API(Vertex AI), Anthropic API(AWS Bedrock).

Additionally, this project implements desirable features when calling the Gemini API: concurrency and rate-limiting.

Overview

This project comes with the toolset of batch inference, evaluation, and synthetic data generation. Each tool can be run independently, but they could be hooked up to form a pipeline. It's on the end user to figure out the best way to collate these together.

Hugging Face Hub authentication

All steps in this project leverages Hugging Face Hub for accessing and managing of models and datasets. To avoid any unexpected erros, we recommend authenticating Hugging Face Hub before procceding steps. Hugging Face Hub authentication could be done with the follwing CLI. Simply paste the Hugging Face access token into the prompt that appears.

# Alternative way is setting HUGGING_FACE_HUB_TOKEN environment variable
# Hugging Face libraries will look up the HUGGING_FACE_HUB_TOKEN value
$ huggingface-cli login

Choosing service LLM

Step 5(evaluation) and 6(synthetic data generation) require an access to a service LLM. This project leverages genai-apis library to switch between different service LLMs. The supported service LLMs include Gemini API(AI Studio), Gemini API(Vertex AI), OpenAI API, Anthropic API, Anthropic API(Vertex AI), Anthropic API(AWS Bedrock). To use one of these, you need to follow the instructions below:

$ # openai, gemini, gemini-vertex, anthropic, anthropic-vertex, anthropic-bedrock
$ pip install genai-apis[gemini]

$ # for openai, gemini, and anthropic, set API key as below
$ # for *-vertex and *-bedrock, use gcloud or aws CLIs to get credentials
$ export SERVICE_LLM_API_KEY=XXXX

$ # for *-vertex and *-bedrock, setup the additional environment variables
$ export GCP_PROJECT_ID=XXXX
$ export GCP_LOCATION=XXXX
$ export AWS_LOCATION=XXXX

$ # choose the right service provider and model when running
$ # the step 3 and 4 with evaluation.py and data_gen.py
$ python evaluation.py ... --service-llm-provider gemini --service-model-name gemini-1.0-pro

$ # additionally, setup service type specific generation config
$ # gemini: config/gemini_gen_configs.yaml
$ # openai: config/gpt_gen_configs.yaml
$ # claude: config/claude_gen_configs.yaml

Coverage dataset

The initial coverage dataset should follow the format as below. This is essentially the same format that synthetically generated dataset should follow too.

column names	generators	prompt_ids	seed_prompts	messages	category
descriptions	model used to generate data	--	the base prompts used to generate data	generated synthetic data	category this data belongs to

Fine-tuning

We leverage Hugging Face's alignment-handbook to streamline the LLM fine-tuning. Specifically, all the detailed fine-tuning parameters for this project can be found in this config. Also note that the same config can be reused for the batch inference in the next section to make sure there are no mismatched configurations.

Here are the recipes that we used to fine-tune Gemma 2B and 7B, Mistral 7B 0.3, and LLaMA3 8B models

recipes for Gemma2B (summarization, classification, coding, closedQA)
recipes for Gemma7B (summarization, classification, coding, closedQA)
recipes for Mistral7B (summarization, classification, coding, closedQA)
recipes for LLaMA3 (summarization, classification, coding, closedQA)

Batch inference

Batch inference lets fine-tuned LLM to generate text and push the results on the Hugging Face Dataset repository.

To perform this, you need to run the following commands in the terminal:

# All parameters defined in the config/batch_inference.yaml file
# could be manually inputted as CLI arguments (arg names are the same)
$ python batch_inference.py --from-config config/batch_inference.yaml

Then, the resulting outputs will be pushed to Hugging Face Dataset repository in the following structure.

column names	instructions	target_responses	candidate_responses	model_id	model_sha
descriptions	the input prompts	desired outputs	model generated outputs	model id that generated outputs	the version of the model

Evaluation

Evaluation evaluates the generated text from fine-tuned LLM with the help of service LLM. The evaluation criteria is the similarity and quality by comparing to the given desired outputs.

To perform this you need to run the following commands in terminal:

# All parameters defined in the config/evaluation.yaml file
# could be manually inputted as CLI arguments (arg names are the same)
$ python evaluation.py --from-config config/evaluation.yaml

Then, the resulting outputs will be pushed to Hugging Face Dataset repository in the following structure.

column names	ommited..	eval_prompts	similarity_scores	precision_scores	evaluators	dates
descriptions	all columns are copied from batch inference	prompts input to the evaluator	similarity score in 0~100 scale	precision score in 0~100 scale	model name used as evaluator	dates

Synthetic data generation

Synthetic data generation generates similar data to the ones used to fine-tune the LLM. This could be performed based on the evaluation results. For instance, if you are not satisfied with the evaluation results, and if you think the training dataset is not large enough, you can create more of the similar data to boost the performance of the LLM.

To perform this you need to run the following commands in terminal:

# All parameters defined in the config/synth_data_gen.yaml file
# could be manually inputted as CLI arguments (arg names are the same)
$ python data_gen.py --from-config config/synth_data_gen.yaml

Then, the resulting outputs will be pushed to Hugging Face Dataset repository in the following structure.

column names	generators	prompt_ids	seed_prompts	messages	category
descriptions	model used to generate data	--	the base prompts used to generate data	generated synthetic data	category this data belongs to

Merging generated dataset

Synthetically generated datasets are a means of supplementing the original dataset. In order to original and synthetica datasets into account when fine-tuning a language model, both datasets better to be merged into a single dataset. This project provies a script for such purpose.

To perform this you need to run the following commands in terminal. If you have more than one synthetic dataset, consider to run the same script iteratively:

# All parameters defined in the config/dataset_merge.yaml file
# could be manually inputted as CLI arguments (arg names are the same)
$ python dataset_merge.py --from-config config/dataset_merge.yaml

Synthetic Dataset from LlamaDuo

We openly provide all the datasets synthetically generated via GPT-4o, Gemini 1.5 Flash, and Claude 3 Sonnet. To use it, please check out the data directory.

synth_classification_ds_gpt4o_dedup: GPT4o generated 128K synthetic dataset on classification task
synth_closedqa_ds_gpt4o_dedup: GPT4o generated 128K synthetic dataset on closedqa task
synth_coding_ds_gpt4o_dedup: GPT4o generated 128K synthetic dataset on coding task
synth_summarize_ds_gpt4o_dedup: GPT4o generated 256K synthetic dataset on summarize task
synth_summarize_ds_gemini1_5flash_dedup: Gemini 1.5 Flash generated 256K synthetic dataset on summarize task
synth_summarize_ds_claude3sonnet_dedup: Claude 3 Sonnet generated 256K synthetic dataset on summarize task

Citation

If you use the data or code in this repo, please consider citing the following paper.

@inproceedings{park2025llmops,
  title     = {LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs},
  author    = {Chansung Park and Juyong Jiang and Fan Wang and Sayak Paul and Jing Tang},
  booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)},
  year      = {2025},
  url       = {https://2025.aclweb.org/program/main_papers/},
}

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
assets		assets
config		config
data		data
dstack		dstack
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
batch_inference.py		batch_inference.py
data_gen.py		data_gen.py
dataset_merge.py		dataset_merge.py
evaluation.py		evaluation.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LlamaDuo

Tech stacks

Tested service LLMs

Tested local LLMs

Fine-tuning

Motivation

Tech stacks

Overview

Hugging Face Hub authentication

Choosing service LLM

Coverage dataset

Fine-tuning

Batch inference

Evaluation

Synthetic data generation

Merging generated dataset

Synthetic Dataset from LlamaDuo

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

deep-diver/llamaduo

Folders and files

Latest commit

History

Repository files navigation

LlamaDuo

Tech stacks

Tested service LLMs

Tested local LLMs

Fine-tuning

Motivation

Tech stacks

Overview

Hugging Face Hub authentication

Choosing service LLM

Coverage dataset

Fine-tuning

Batch inference

Evaluation

Synthetic data generation

Merging generated dataset

Synthetic Dataset from LlamaDuo

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages