A Python application for viewing PDF files and extracting structured data through annotations.
- High-quality PDF viewing with support for multiple pages
- Text selection and annotation capabilities
- Data extraction from PDFs with field mapping
- Date standardization for extracted data
- SQLite database storage for annotations
- CSV export for extracted data
- Python 3.9+
- PySide6
- PyMuPDF (fitz)
- python-dateutil
- Clone the repository:
git clone https://github.com/yourusername/pdf-data-py.git
cd pdf-data-py
- Install the package:
pip install -e .
Run the application:
python -m pdf_data_viewer.main
Or use the entry point:
pdf-data-viewer
The application supports the extraction of the following data types:
- Document name
- Customer name
- Buyer name
- Buyer Email
- Buyer Phone
- Buyer Job Position
- Currency
- RFQ date
- Due date
- Line item number
- Material number
- Part number
- Description
- Full description
- Quantity
- Unit of measure
- Requested delivery date
- Delivery point
- Manufacturer name
pdf-data-py/
├── data/ # Data directory
│ ├── annotations.db # SQLite database file
│ └── exports/ # CSV export directory
└── pdf_data_viewer/ # Main package
├── core/ # Core functionality
├── database/ # Database operations
├── ui/ # User interface
└── utils/ # Utility functions
python setup.py build