This is the official implementation of the paper SelfDocSeg: A self-supervised vision-based approach towards Document Segmentation by S. Maity, S. Biswas, S. Manna, A. Banerjee, J. LladΓ³s, S. Bhatttacharya, U. Pal published in the proceedings of ICDAR 2023.
- 18 Aug 2023 : Pre-release is available! A stable release will be coming soon if required. If you face any problem, please read the FAQ & Issues section and raise an issue if necessary.
π Requirements
π Getting Started
π FAQ & Issues
π Acknowledgement
π Citation
| Methods | Self-Supervision | mAP on DocLayNet |
|---|---|---|
| Supervised Mask RCNN | 72.4 | |
| BYOL + Mask RCNN | βοΈ | 63.5 |
| SelfDocSeg + Mask RCNN | βοΈ | 74.3 |
- Python 3.9
- torch 1.12.0, torchvision 0.13.0, torchaudio 0.12.0
- pytorch-lightning 1.8.1
- lightly 1.2.35
- torchinfo 1.7.1
- torchmetric 0.11
- tensorboard 2.11
- scipy 1.9.3
- numpy 1.23
- scikit-learn 1.1.3
- opencv-python 4.6
- pillow 9.3
- pandas 1.5
- seaborn 0.12.1
- matplotlib 3.6
- tabulate 0.9
- tqdm 4.64
- pyyaml 6.0
- yacs 0.1.8
- pycocotools 2.0
- detectron2 0.6
β Dataset
β Pretraining
β Finetuning
- For the self-supervised pretraining of SelfDocSeg we have used the DocLayNet dataset. It is also available for download in π€ HuggingFace. The annotations are in COCO format. It should be extracted in the following structure.
βββ dataset # Dataset root directory
βββ DocLayNet # DocLayNet dataset root directory
βββ PNG # Directory containing all images
| βββ <image_file>.png
| βββ <image_file>.png
| βββ ...
|
βββ COCO # Directory containing annotations
βββ train.json # train set annotation
βββ val.json # validation set annotation
βββ test.json # test set annotation
- As DocLayNet is not a simple classification dataset we used a document classification dataset RVL CDIP, for linear evaluation protocols to make sure that the model is generalizing well. It is also available for download in π€ HuggingFace. The original dataset comes with separate image and annotation files. It needs to be restructured in
torchvision.datasets.ImageFolderformat as shown in the following, so that there are directories with each of the label names containing corresponding images for each of the dataset splits separately. We have used train + val split for training and test split for testing for kNN linear evaluation.
βββ dataset # Dataset root directory
βββ RVLCDIP_<split> # RVL CDIP dataset split root directory, eg. 'RVLCDIP_train', 'RVLCDIP_test'
βββ <label 0> # Directory containing all images with label 0
| βββ <image_file>.tif
| βββ <image_file>.tif
| βββ ...
|
βββ <label 1> # Directory containing all images with label 1
| βββ <image_file>.tif
| βββ <image_file>.tif
| βββ ...
βββ <label 2> # Directory containing all images with label 2
| βββ ...
|
βββ ...
|
βββ <label 15> # Directory containing all images with label 15
βββ <image_file>.tif
βββ <image_file>.tif
βββ ...
-
Run the script
pretraining/train_ssl.pyaspython pretraining/train_ssl.py --knn_train_root /path/to/train/ --knn_eval_root /path/to/test/ --dataset_root /path/to/pretraining/image/directory/where/path/to/train/,/path/to/test/refer to RVL-CDIP kNN training split root directory 'RVLCDIP_train' and testing split root directory 'RVLCDIP_test' respectively and/path/to/pretraining/image/directory/refer to the DocLayNet image directory path. The complete set of options with default values is given below.python pretraining/train_ssl.py --num_eval_classes 16 --dataset_name DocLayNet --knn_train_root /path/to/train/ --knn_eval_root /path/to/test/ --dataset_root /path/to/pretraining/image/directory/ --logs_root ./benchmark_logs --num_workers 0 --max_epochs 800 --batchsize 8 --n_runs 1 --learning_rate 0.2 --lr_decay 5e-4 --wt_momentum 0.99 --bin_threshold 239 --kernel_shape rect --kernel_size 3 --kernel_iter 2 --eeta 0.001 --alpha 1.0 --beta 1.0If you want to resume training from a previous checkpoint add
--resume /path/to/checkpoint/along with the command.To use multiple GPUs use
--distributedflag and as additional controls, use--sync_batchnormand--gather_distributedflags to synchronize batchnorms and gather features before loss calculation respectively across GPUs.Run
python pretraining/train_ssl.py --helpfor the details. -
The checkpoints and logs are being saved at
./benchmark_logs/<dataset_name>/version_<version num>/SelfDocSegdirectory. The<version num>depends on how many times the training is run and is automatically incremented from the largest<version num>available. If--n_runspassed is greater than 1, then/run<run_number>subdirectories are created to save data from each run. For the checkpoints, both the last epoch and the best kNN accuracy are the conditions to save weights in a subdirectorycheckpointsunder the aforementioned run directory. -
Run
python pretraining/extract_encoder.py --checkpoint /path/to/saved/checkpont.ckpt --weight_save_path /path/to/save/weights.pth --num_eval_classes 16 --knn_train_root /path/to/train/ --knn_eval_root /path/to/test/ --dataset_root /path/to/pretraining/image/directory/and the encoder weight will be extracted from the checkpoint and saved as the.pthfile given insave_pathin the default Torchvision ResNet50 format. The paths/path/to/train/,/path/to/test/refer to RVL-CDIP kNN training split root directory 'RVLCDIP_train' and testing split root directory 'RVLCDIP_test' respectively and/path/to/pretraining/image/directory/refer to the DocLayNet image directory path.
-
Before finetuning the pretrained encoder on document segmentation, the weights need to be converted to the Detectron2 format by running the following.
python finetuning/convert-torchvision-to-d2.py /path/to/save/weights.pth /path/to/save/d2/weights.pkl/path/to/save/weights.pthis the path to the extracted encoder weights from pretraining and/path/to/save/d2/weights.pklis the file path where the converted weight file is to be saved in.pklformat. -
Run the following command to start finetuning. The path
/path/to/DocLayNet/root/refers to the root directory of the DocLayNet dataset in COCO format.python finetuning/train_net.py --num-gpus 1 --config-file finetuning/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml MODEL.WEIGHTS /path/to/save/d2/weights.pkl --dataset_root /path/to/DocLayNet/root/The training configuration is defined in
finetuning/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yamlfile and can be modified directly there or by parsing arguments in the command line. The path to the weights file can also be provided in the.yamlconfig file in theWEIGHTSkey underMODEL.To train with multiple GPUs provide the number of available GPUs with
--num-gpusargument. The learning rate and batch size might be required to be adjusted accordingly in the.yamlconfig file or in the command line, eg.SOLVER.IMS_PER_BATCH 16 SOLVER.BASE_LR 0.02for 8 GPUs. -
The default path to save the logs and checkpoints is set in
finetuning/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yamlfile asfinetuning/output/doclaynet/mask_rcnn/rn50/. The checkpoint after finetuning can be used to perform the evaluation on the DocLayNet dataset by adding--eval-onlyflag along with the checkpoint in the command line as below.python finetuning/train_net.py --config-file finetuning/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml --eval-only MODEL.WEIGHTS /path/to/finetuning/checkpoint.pkl
For visualization, run the following command.
python visualize_json_results.py --input /path/to/output/evaluated/file.json --output /path/to/visualization/save/directory/ --dataset_root /path/to/DocLayNet/root/
The /path/to/output/evaluated/file.json refers to the .json file created during evaluation using Detectron2 in the output directory, default to finetuning/output/doclaynet/mask_rcnn/rn50. The /path/to/visualization/save/directory/ refers to the directory path where the visualization results would be saved. The path /path/to/DocLayNet/root/ refers to the root directory of the DocLayNet dataset in COCO format.
The confidence score threshold is set 0.6 by default and can be overridden by using --conf-threshold 0.6 as an option in the command line.
- The
num_eval_classes 16argument refers to the 16 classes in RVL-CDIP dataset used for linear evaluation. - The pre-training can be done with any dataset by setting the pre-training dataset image folder by
--dataset_root /path/to/pretraining/image/directory/and any dataset split intorchvision.datasets.ImageFolderformat can be used for linear evaluation by using proper root paths to the split and number of classes, eg.--num_eval_classes 16 --knn_train_root /path/to/train/ --knn_eval_root /path/to/test/. - The finetuning code in Detectron2 currently supports DocLayNet dataset only. If you wish to finetune on any other dataset, we recommend preparing the dataset in COCO format. Get help from Detectron2 - Custom DataSet Tutorial.
- The pretraining phase provides trained encoder weights in Torchvision format after extraction. Thus it can be used with any Mask RCNN implementation in PyTorch or any object detection framework instead of Detectron2.
- SelfDocSeg does not depend on textual guidance and hence can be used for documents of any language.
If there is any query, please raise an issue. We shall try our best to help you out!
The codes are implemented with the help of the two wonderful open-source repositories, Lightly and Detectron2.
If you use our code for your research, please cite our paper. Many thanks!
@inproceedings{maity2023selfdocseg,
title={SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation},
author={Subhajit Maity and Sanket Biswas and Siladittya Manna and Ayan Banerjee and Josep LladΓ³s and Saumik Bhattacharya and Umapada Pal},
booktitle={International Conference on Document Analysis and Recognition (ICDAR)},
year={2023}}


