PicAxe 1.0.0 Release Notes:
Thursday, February 14th, 2025
Welcome to PicAxe 1.0.0! This program automatically extracts figures from PDFs that contain text and images, returning images as PNGs. PicAxe was developed to extract figures that are not embedded separately from text in PDF syntax. Please enjoy two PicAxe pipelines, PicAxe-YOLO and PicAxe-OCR, for testing and use.
This release of PicAxe was developed by UChicago computer science master's students: Krishna Kamath, Qilin Zhou, and Bruno Felalaga, with supervision and testing by Dr. Anna Clemencia Guerrero (Santa Fe Institute), advised by Dr. Aaron K. Dinner (UChicago) and Dr. Julia Damerow (Arizona State University. Ground truth data collection was performed by Maria Guerrero.
Features
Both PicAxe pipelines include a function to remove tables before extracting images. PicAxe-OCR eliminates text with PaddleOCR and then extracts remaining marks, where PicAxe-YOLO performs figure detection with two pre-trained YOLOv8 models.
Please see the details of each pipeline in the README file in their respective folders.
Known Issues
- Extraction results will not be perfect from either pipeline. Users should always check the results of extraction before performing further data analysis. For more details about how we are working to improve extraction results, please see the main README file.
- Package dependencies can cause issues (noted in respective README files), so we have provided Docker files. If the Docker images are not pulled for some time, they will be deleted. Note that the Docker image might not exist at some point.