ViIR: Information Retrieval Fine-tuning Framework for Vietnamese Documents

Fine-tuning sentence transformer models for Vietnamese information retrieval using any custom dataset.

Overview

ViIR is a flexible framework for fine-tuning Bi-Encoder and other transformer models for Vietnamese information retrieval tasks. The framework supports three main fine-tuning strategies:

Baseline: Using pre-trained models without fine-tuning for benchmarking
Positive-pair Tuning: Fine-tuning with query-document positive pairs
Hard Negative Tuning: Advanced fine-tuning with hard negatives for improved discrimination

This framework is designed to work with any Vietnamese dataset that contains query-document pairs, making it adaptable for legal documents, news articles, medical information, and more.

Installation

# Clone repository
git clone https://github.com/ntphuc149/ViIR.git
cd ViIR

# Install package and dependencies
pip install -e .

Requirements

Python 3.8+
PyTorch 1.10+
Transformers 4.16+
Sentence-transformers 2.2.0+
Scikit-learn 1.0.0+
Pandas & NumPy
Tqdm
PyYAML

Usage

1. Data Preprocessing

The framework expects your dataset to have at least the following columns:

question or query: The search query text
context or document: The document text
(Optional) abstractive_answer: The ground truth answer

python scripts/preprocess.py --input /path/to/your_dataset.csv --output data/processed/

2. Model Training

The framework supports three training strategies that can be easily selected through configuration files:

# Baseline (no fine-tuning)
python scripts/train.py --config viir/config/baseline.yaml

# Positive-pair Tuning
python scripts/train.py --config viir/config/positive_pair.yaml

# Hard Negative Tuning
python scripts/train.py --config viir/config/hard_negative.yaml

You can directly specify model, batch size, learning rate and other parameters via command line:

# Using PhoBERT model with custom hyperparameters
python scripts/train.py --config viir/config/hard_negative.yaml \
                        --model_name vinai/phobert-base \
                        --batch_size 32 \
                        --learning_rate 2e-5 \
                        --epochs 5

3. Model Evaluation

Evaluate your trained model with standard IR metrics including NDCG, MRR, precision, and recall:

python scripts/evaluate.py --model_path output/model/ --data_dir data/processed/ --split test

For comprehensive evaluation across all splits:

python scripts/evaluate.py --model_path output/model/ --data_dir data/processed/ --split all

4. Running the Complete Pipeline

For convenience, you can run the entire pipeline in one command:

# Using the run.py script
python run.py --input /path/to/your_dataset.csv --strategy hard_negative

Switch between strategies and models:

# Using baseline strategy with default XLM-RoBERTa
python run.py --input /path/to/your_dataset.csv --strategy baseline

# Using positive-pair strategy with PhoBERT
python run.py --input /path/to/your_dataset.csv \
              --strategy positive_pair \
              --model_name vinai/phobert-base

# Using hard negative strategy with custom hyperparameters
python run.py --input /path/to/your_dataset.csv \
              --strategy hard_negative \
              --model_name vinai/phobert-base-v2 \
              --batch_size 16 \
              --learning_rate 3e-5 \
              --epochs 3

Project Structure

ViIR/
├── viir/                  # Main package directory
│   ├── __init__.py        # Package initialization
│   ├── config/            # Configuration files
│   ├── data/              # Data processing modules
│   ├── trainers/          # Training strategy implementations
│   ├── utils/             # Utility functions
│   ├── evaluation/        # Evaluation tools
│   └── main.py            # Main module
├── scripts/               # CLI scripts
│   ├── preprocess.py      # Data preprocessing
│   ├── train.py           # Model training
│   └── evaluate.py        # Model evaluation
├── run.py                 # Convenience script for running the pipeline
├── setup.py               # Package setup
└── README.md              # This file

Customization

Using Different Models

You can use any model from the Hugging Face hub either by specifying it in the command line or by changing the model.name parameter in the configuration files:

Command line method:

python run.py --input your_data.csv --strategy hard_negative --model_name vinai/phobert-base

Configuration file method:

model:
  name: "vinai/phobert-base"  # Or any other Vietnamese language model
  max_seq_length: 512
  trust_remote_code: true

Supported Vietnamese Models

The framework has been tested with the following Vietnamese models:

FacebookAI/xlm-roberta-base (default)
FacebookAI/xlm-roberta-large
vinai/phobert-base-v2
vinai/phobert-large
And other models compatible with Sentence Transformers

Custom Dataset Format

If your dataset has a different format, you can modify the viir/data/processor.py file to handle your specific data structure.

Citation

If you use this framework in your research or applications, please cite:

@misc{viir,
  author = {Truong-Phuc Nguyen},
  title = {ViIR: The Unified Framework for Fine-tuning Vietnamese Information Retrieval Models with Various Tuning Strategies},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ntphuc149/ViIR}}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
scripts		scripts
viir		viir
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ViIR: Information Retrieval Fine-tuning Framework for Vietnamese Documents

Overview

Installation

Requirements

Usage

1. Data Preprocessing

2. Model Training

3. Model Evaluation

4. Running the Complete Pipeline

Project Structure

Customization

Using Different Models

Command line method:

Configuration file method:

Supported Vietnamese Models

Custom Dataset Format

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ntphuc149/ViIR

Folders and files

Latest commit

History

Repository files navigation

ViIR: Information Retrieval Fine-tuning Framework for Vietnamese Documents

Overview

Installation

Requirements

Usage

1. Data Preprocessing

2. Model Training

3. Model Evaluation

4. Running the Complete Pipeline

Project Structure

Customization

Using Different Models

Command line method:

Configuration file method:

Supported Vietnamese Models

Custom Dataset Format

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages