Fine-tuning sentence transformer models for Vietnamese information retrieval using any custom dataset.
ViIR is a flexible framework for fine-tuning Bi-Encoder and other transformer models for Vietnamese information retrieval tasks. The framework supports three main fine-tuning strategies:
- Baseline: Using pre-trained models without fine-tuning for benchmarking
- Positive-pair Tuning: Fine-tuning with query-document positive pairs
- Hard Negative Tuning: Advanced fine-tuning with hard negatives for improved discrimination
This framework is designed to work with any Vietnamese dataset that contains query-document pairs, making it adaptable for legal documents, news articles, medical information, and more.
# Clone repository
git clone https://github.com/ntphuc149/ViIR.git
cd ViIR
# Install package and dependencies
pip install -e .
- Python 3.8+
- PyTorch 1.10+
- Transformers 4.16+
- Sentence-transformers 2.2.0+
- Scikit-learn 1.0.0+
- Pandas & NumPy
- Tqdm
- PyYAML
The framework expects your dataset to have at least the following columns:
question
orquery
: The search query textcontext
ordocument
: The document text- (Optional)
abstractive_answer
: The ground truth answer
python scripts/preprocess.py --input /path/to/your_dataset.csv --output data/processed/
The framework supports three training strategies that can be easily selected through configuration files:
# Baseline (no fine-tuning)
python scripts/train.py --config viir/config/baseline.yaml
# Positive-pair Tuning
python scripts/train.py --config viir/config/positive_pair.yaml
# Hard Negative Tuning
python scripts/train.py --config viir/config/hard_negative.yaml
You can directly specify model, batch size, learning rate and other parameters via command line:
# Using PhoBERT model with custom hyperparameters
python scripts/train.py --config viir/config/hard_negative.yaml \
--model_name vinai/phobert-base \
--batch_size 32 \
--learning_rate 2e-5 \
--epochs 5
Evaluate your trained model with standard IR metrics including NDCG, MRR, precision, and recall:
python scripts/evaluate.py --model_path output/model/ --data_dir data/processed/ --split test
For comprehensive evaluation across all splits:
python scripts/evaluate.py --model_path output/model/ --data_dir data/processed/ --split all
For convenience, you can run the entire pipeline in one command:
# Using the run.py script
python run.py --input /path/to/your_dataset.csv --strategy hard_negative
Switch between strategies and models:
# Using baseline strategy with default XLM-RoBERTa
python run.py --input /path/to/your_dataset.csv --strategy baseline
# Using positive-pair strategy with PhoBERT
python run.py --input /path/to/your_dataset.csv \
--strategy positive_pair \
--model_name vinai/phobert-base
# Using hard negative strategy with custom hyperparameters
python run.py --input /path/to/your_dataset.csv \
--strategy hard_negative \
--model_name vinai/phobert-base-v2 \
--batch_size 16 \
--learning_rate 3e-5 \
--epochs 3
ViIR/
├── viir/ # Main package directory
│ ├── __init__.py # Package initialization
│ ├── config/ # Configuration files
│ ├── data/ # Data processing modules
│ ├── trainers/ # Training strategy implementations
│ ├── utils/ # Utility functions
│ ├── evaluation/ # Evaluation tools
│ └── main.py # Main module
├── scripts/ # CLI scripts
│ ├── preprocess.py # Data preprocessing
│ ├── train.py # Model training
│ └── evaluate.py # Model evaluation
├── run.py # Convenience script for running the pipeline
├── setup.py # Package setup
└── README.md # This file
You can use any model from the Hugging Face hub either by specifying it in the command line or by changing the model.name
parameter in the configuration files:
python run.py --input your_data.csv --strategy hard_negative --model_name vinai/phobert-base
model:
name: "vinai/phobert-base" # Or any other Vietnamese language model
max_seq_length: 512
trust_remote_code: true
The framework has been tested with the following Vietnamese models:
FacebookAI/xlm-roberta-base
(default)FacebookAI/xlm-roberta-large
vinai/phobert-base-v2
vinai/phobert-large
- And other models compatible with Sentence Transformers
If your dataset has a different format, you can modify the viir/data/processor.py
file to handle your specific data structure.
If you use this framework in your research or applications, please cite:
@misc{viir,
author = {Truong-Phuc Nguyen},
title = {ViIR: The Unified Framework for Fine-tuning Vietnamese Information Retrieval Models with Various Tuning Strategies},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ntphuc149/ViIR}}
}
MIT