MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

This repository is the official implementation of MangaLMM.

Setup

First, set up the environment as shown below:

# (Optional) Use a conda environment to avoid conflicts with the base environment. Unexpected issues may occur otherwise.
conda create -n manga python=3.12 -y 
conda activate manga

# Install uv
pip install uv

# Make a virtual environment
uv venv --python 3.12

# Install dependencies (options: cpu, cu118, cu121, cu124, cu126). Our work was tested with cuda 12.4.
uv sync --dev --extra cu124
uv sync --dev --extra cu124 --extra deepspeed --extra flashattn
uv sync --dev --extra cu124 --extra deepspeed --extra flashattn --extra notebook

# Activate the virtual environment
source .venv/bin/activate

After setting up the environment, complete the following two steps:

Download the Manga109 dataset.
Download and preprocess our MangaOCR and MangaVQA training sets.

Prerequisites: Download the Manga109 dataset.

According to the Manga109 license, redistribution of Manga109 is not permitted.
Therefore, you should download Manga109 from Hugging Face.
After downloading Manga109, place the Manga109_released_2023_12_07.zip in the data folder.

Data preprocessing: Download and preprocess our MangaOCR and MangaVQA training sets.

To clean the data, we download the training set from Hugging Face and preprocess it by removing fields with null values (e.g., "image": null, "text": null). The resulting dataset is saved as a locally accessible jsonl file, which makes inspection and debugging easier.

cd data/
unzip `Manga109_released_2023_12_07.zip`
python image_split.py # split Manga109 images into train, valid, test
python preprocess.py # preprocess our MangaOCR and MangaVQA training sets.

Training

To train the MangaLMM in the paper, run this command:

sh scripts/MangaLMM.sh

The hyperparameters are specified in scripts/MangaLMM.sh. We used four A100 GPUs (80GB each) for training.

Evaluation

The inference of the MangaLMM model on MangaOCR takes over 10 hours on a single A100 GPU, but only several hours when using the script below with the parallel command.

apt install parallel # this command only needs to be run once
sh scripts/inference_OCR.sh ./outputs/MangaLMM 4  # for 4GPU

Once the script above completes, you can run evaluation/eval_OCR.ipynb to evaluate the OCR performance.

To evaluate the MangaLMM model on MangaVQA, run:

export OPENAI_API_KEY="{input_openai_api_key}"
CUDA_VISIBLE_DEVICES=0 python evaluation/inference_VQA.py --model_path ./outputs/MangaLMM

The evaluation uses GPT-4o as the LLM-as-a-judge and takes approximately 40 minutes on a single A100 GPU.

Pre-trained Models

You can download pretrained models here:

MangaLMM trained on training sets of MangaOCR and MangaVQA as described in the paper.

To evaluate the pretrained MangaLMM model on MangaOCR, run:

apt install parallel # this command only needs to be run once
sh scripts/inference_OCR.sh hal-utokyo/MangaLMM 4  # for 4GPU

Once the script above completes, you can run evaluation/eval_OCR.ipynb to evaluate the OCR performance.

To evaluate the pretrained MangaLMM model on MangaVQA, run:

export OPENAI_API_KEY="{input_openai_api_key}"
CUDA_VISIBLE_DEVICES=0 python evaluation/inference_VQA.py --model_path hal-utokyo/MangaLMM

Results

Our model achieves the following performance on :

Model name	MangaOCR (Hmean %)	MangaVQA (LMM /10.0)
GPT-4o	0.0	5.76
Gemini2.5 Flash	0.0	3.87
Phi-4-Multimodal	0.0	3.08
Qwen2.5-VL 7B	0.9	5.36
MangaLMM (Ours)	71.5	6.57

Contributing

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
data		data
evaluation		evaluation
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
overview.jpg		overview.jpg
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Setup

Prerequisites: Download the Manga109 dataset.

Data preprocessing: Download and preprocess our MangaOCR and MangaVQA training sets.

Training

Evaluation

Pre-trained Models

Results

Contributing

About

Uh oh!

Releases

Packages

Languages

License

manga109/MangaLMM

Folders and files

Latest commit

History

Repository files navigation

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Setup

Prerequisites: Download the Manga109 dataset.

Data preprocessing: Download and preprocess our MangaOCR and MangaVQA training sets.

Training

Evaluation

Pre-trained Models

Results

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages