MarkushGrapher

This is the repository for MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures.

Citation

If you find this repository useful, please consider citing:

@article{morin2025markushgrapherjointvisualtextual,
	title        = {{MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures}},
	author       = {Lucas Morin and Valéry Weber and Ahmed Nassar and Gerhard Ingmar Meijer and Luc Van Gool and Yawei Li and Peter Staar},
	year         = 2025,
	journal      = {arXiv preprint arXiv:2503.16096},
	url          = {https://arxiv.org/abs/2503.16096},
	eprint       = {2503.16096},
	archiveprefix = {arXiv},
	primaryclass = {cs.CV}
}

Installation

Create a virtual environment.

python3.10 -m venv markushgrapher-env
source markushgrapher-env/bin/activate

Install MarkushGrapher.

pip install -e .

Install transformers. This fork contains the code for the MarkushGrapher architecture. It was written starting from a copy of the UDOP architecture.

git clone https://github.com/lucas-morin/transformers.git ./external/transformers
pip install -e ./external/transformers

Install MolScribe. This fork contains minor fixes for compatibility with albumentations.

git clone https://github.com/lucas-morin/MolScribe.git ./external/MolScribe
pip install -e ./external/MolScribe --no-deps

Model

Download the MarkushGrapher model from HuggingFace.

huggingface-cli download ds4sd/MarkushGrapher --local-dir ./tmp/ --repo-type model && cp -r ./tmp/models . && rm -r ./tmp/

Download the MolScribe model from HuggingFace.

wget https://huggingface.co/yujieq/MolScribe/resolve/main/swin_base_char_aux_1m680k.pth -P ./external/MolScribe/ckpts/

Datasets

Download the datasets from HuggingFace.

huggingface-cli download ds4sd/MarkushGrapher-Datasets --local-dir ./data/hf --repo-type dataset

For training, we use:

MarkushGrapher-Synthetic-Training (Synthetic dataset)

For benchmarking, we use:

M2S (Multi-modal real-world dataset)
USPTO-Markush (Image-only real-world dataset)
MarkushGrapher-Synthetic (Synthetic dataset)

The synthetic datasets are generated using MarkushGenerator.

Inference

Note: MarkushGrapher is currently not able to process images without OCR annotations. The model relies on OCR bounding boxes and text provided as input.

Select a dataset by setting the dataset_path parameter in MarkushGrapher/config/dataset_predict.yaml.
Run MarkushGrapher.

python3.10 -m markushgrapher.eval config/predict.yaml

Visualize predictions in: MarkushGrapher/data/visualization/prediction/.

Training

Select the training configuration in MarkushGrapher/config/train.yaml and MarkushGrapher/config/datasets/datasets.yaml.
Run training script.

PYTHONUNBUFFERED=1 CUDA_VISIBLE_DEVICES=0 python3.10 -m markushgrapher.train config/train.yaml

Acknowledgments

MarkushGrapher uses the code of UDOP and the MolScribe model.

MarkushGrapher was trained from the pre-trained UDOP weights available on HuggingFace (checkpoint: udop-unimodel-large-512-300k-steps.zip).

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
config		config
data		data
external		external
markushgrapher		markushgrapher
models		models
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MarkushGrapher

Citation

Installation

Model

Datasets

Inference

Training

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

DS4SD/MarkushGrapher

Folders and files

Latest commit

History

Repository files navigation

MarkushGrapher

Citation

Installation

Model

Datasets

Inference

Training

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages