A powerful Python command-line tool that extracts data from PDF files with a clean, modular architecture.
- Text Extraction: Extract plain text from PDFs using PyMuPDF (fitz)
- Table Extraction: Extract tables as CSV or JSON using pdfplumber
- Image Extraction: Extract embedded images from PDFs
- OCR Processing: Apply OCR to image-based PDFs using pytesseract
- Python 3.11 or higher
- Tesseract OCR (for OCR capabilities)
- Windows: Download and install from here
- macOS:
brew install tesseract
- Ubuntu/Debian:
sudo apt install tesseract-ocr
# Clone the repository
git clone https://github.com/sfkbstnc/pdf-extractor-cli.git
cd pdf-extractor-cli
# Install dependencies
pip install -r requirements.txt
# Make the CLI executable (Linux/macOS)
chmod +x pdf_extractor_cli.py
python pdf_extractor_cli.py --file [PDF_FILE] [OPTIONS]
Option | Description |
---|---|
--file |
Path to the PDF file (required) |
--text |
Extract plain text |
--tables |
Extract tables |
--images |
Extract images |
--ocr |
Apply OCR to images |
--pages |
Page numbers or ranges to process (e.g. 1-3,5,7 ) |
--output |
Directory to save outputs (default: output/ ) |
--table-format |
Format for table output (csv or json , default: csv ) |
--verbose |
Enable verbose logging |
python pdf_extractor_cli.py --file document.pdf --text
python pdf_extractor_cli.py --file document.pdf --tables --pages 1-3 --table-format json
python pdf_extractor_cli.py --file document.pdf --images --pages 5-10
python pdf_extractor_cli.py --file scanned_document.pdf --ocr
python pdf_extractor_cli.py --file document.pdf --text --tables --images --ocr
--- Page 1 ---
Annual Report 2023
Company XYZ
Our mission is to provide the best services...
Name,Age,Position
John Doe,35,CEO
Jane Smith,32,CTO
Mike Johnson,28,Developer
Images are saved to the output/[filename]_images/
directory with filenames like page1_image1.png
.
pdf-extractor-cli/
├── pdf_extractor/
│ ├── __init__.py
│ ├── main.py
│ ├── utils.py
│ ├── text_extractor.py
│ ├── table_extractor.py
│ ├── image_extractor.py
│ └── ocr_extractor.py
├── examples/
│ └── README.txt
├── output/
├── pdf_extractor_cli.py
├── requirements.txt
├── pyproject.toml
└── README.md
If you encounter the following error:
Error: tesseract is not installed or it's not in your PATH. See README file for more information.
Solutions:
- Ensure Tesseract is installed: Follow the installation instructions for your OS in the Prerequisites section.
- Add Tesseract to PATH:
- Windows: Add the Tesseract installation directory (e.g.,
C:\Program Files\Tesseract-OCR
) to your system's PATH variable. - Linux/macOS: Verify the installation with
which tesseract
.
- Windows: Add the Tesseract installation directory (e.g.,
- Specify Tesseract Path directly:
- If you know where tesseract is installed but can't modify PATH, you can set it in your code:
import pytesseract pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # Windows example
If no tables are found in your PDF, verify that:
- Your PDF actually contains tables (not just spaces/tabs formatting text)
- The tables have clear borders or structural elements that pdfplumber can identify
- PyMuPDF (fitz): For text and image extraction
- pdfplumber: For table extraction
- pandas: For data handling
- pytesseract: For OCR
- Pillow (PIL): For image processing
- rich: For colorful terminal output
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.