On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation
This repository contains all code to support the paper:
"On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation"
[arXiv
]
We developed a vision-language model for the pathology domain of melanocytic lesions. The model was trained and evaluated using a dataset of 19,636 melanocytic lesion cases, consisting of one or more whole slide images (WSIs) and a pathology report per case. In total, the dataset comprised of 42,433 H&E-stained WSIs and 2,132,008 words. We built upon the BLIP-2 framework using BioGPT as base language model and HIPT for WSI feature extraction. To evaluate the model, we assessed the cross-modal retrieval performance and conducted a reader study to score the quality of the generated reports.
We provide checkpoints for both the retrieval and report generation stages. All models are available from the corresponding HuggingFace repository.
The retrieval model is trained with 16 queries and is used for the retrieval results presented in the paper.
The final report generation models build upon the Stage 1 checkpoint trained with 64 queries and are used for the reader study results.
Final Stage 2 models:
If you found our work useful in your research, please consider citing our paper:
@article{lucassen2025importance,
title={On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation},
author={Lucassen, Ruben T and van de Luijtgaarden, Tijn and Moonemans, Sander P J and Breimer, Gerben E and Blokx, Willeke A M and Veta, Mitko},
year={2025},
eprint={2502.19285},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.19285}
}