PDF Extractor CLI

A powerful Python command-line tool that extracts data from PDF files with a clean, modular architecture.

Features

Text Extraction: Extract plain text from PDFs using PyMuPDF (fitz)
Table Extraction: Extract tables as CSV or JSON using pdfplumber
Image Extraction: Extract embedded images from PDFs
OCR Processing: Apply OCR to image-based PDFs using pytesseract

Installation

Prerequisites

Python 3.11 or higher
Tesseract OCR (for OCR capabilities)
- Windows: Download and install from here
- macOS: brew install tesseract
- Ubuntu/Debian: sudo apt install tesseract-ocr

Installation Steps

# Clone the repository
git clone https://github.com/sfkbstnc/pdf-extractor-cli.git
cd pdf-extractor-cli

# Install dependencies
pip install -r requirements.txt

# Make the CLI executable (Linux/macOS)
chmod +x pdf_extractor_cli.py

Usage

python pdf_extractor_cli.py --file [PDF_FILE] [OPTIONS]

Options

Option	Description
`--file`	Path to the PDF file (required)
`--text`	Extract plain text
`--tables`	Extract tables
`--images`	Extract images
`--ocr`	Apply OCR to images
`--pages`	Page numbers or ranges to process (e.g. `1-3,5,7`)
`--output`	Directory to save outputs (default: `output/`)
`--table-format`	Format for table output (`csv` or `json`, default: `csv`)
`--verbose`	Enable verbose logging

Examples

Extract text from a PDF:

python pdf_extractor_cli.py --file document.pdf --text

Extract tables from pages 1-3 and save as JSON:

python pdf_extractor_cli.py --file document.pdf --tables --pages 1-3 --table-format json

Extract images from specific pages:

python pdf_extractor_cli.py --file document.pdf --images --pages 5-10

Apply OCR to extract text from an image-based PDF:

python pdf_extractor_cli.py --file scanned_document.pdf --ocr

Extract everything from a PDF:

python pdf_extractor_cli.py --file document.pdf --text --tables --images --ocr

Output Examples

Text Extraction

--- Page 1 ---
Annual Report 2023
Company XYZ

Our mission is to provide the best services...

Table Extraction (CSV)

Name,Age,Position
John Doe,35,CEO
Jane Smith,32,CTO
Mike Johnson,28,Developer

Image Extraction

Images are saved to the output/[filename]_images/ directory with filenames like page1_image1.png.

Project Structure

pdf-extractor-cli/
├── pdf_extractor/
│   ├── __init__.py
│   ├── main.py
│   ├── utils.py
│   ├── text_extractor.py
│   ├── table_extractor.py
│   ├── image_extractor.py
│   └── ocr_extractor.py
├── examples/
│   └── README.txt
├── output/
├── pdf_extractor_cli.py
├── requirements.txt
├── pyproject.toml
└── README.md

Troubleshooting

OCR Issues

Tesseract Not Found Error

If you encounter the following error:

Error: tesseract is not installed or it's not in your PATH. See README file for more information.

Solutions:

Ensure Tesseract is installed: Follow the installation instructions for your OS in the Prerequisites section.
Add Tesseract to PATH:
- Windows: Add the Tesseract installation directory (e.g., C:\Program Files\Tesseract-OCR) to your system's PATH variable.
- Linux/macOS: Verify the installation with which tesseract.

Specify Tesseract Path directly:

If you know where tesseract is installed but can't modify PATH, you can set it in your code:

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # Windows example

Table Extraction Issues

If no tables are found in your PDF, verify that:

Your PDF actually contains tables (not just spaces/tabs formatting text)
The tables have clear borders or structural elements that pdfplumber can identify

Dependencies

PyMuPDF (fitz): For text and image extraction
pdfplumber: For table extraction
pandas: For data handling
pytesseract: For OCR
Pillow (PIL): For image processing
rich: For colorful terminal output

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Extractor CLI

Features

Installation

Prerequisites

Installation Steps

Usage

Options

Examples

Extract text from a PDF:

Extract tables from pages 1-3 and save as JSON:

Extract images from specific pages:

Apply OCR to extract text from an image-based PDF:

Extract everything from a PDF:

Output Examples

Text Extraction

Table Extraction (CSV)

Image Extraction

Project Structure

Troubleshooting

OCR Issues

Tesseract Not Found Error

Table Extraction Issues

Dependencies

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
pdf_extractor		pdf_extractor
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pdf_extractor_cli.py		pdf_extractor_cli.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

sfkbstnc/pdf-extractor-cli

Folders and files

Latest commit

History

Repository files navigation

PDF Extractor CLI

Features

Installation

Prerequisites

Installation Steps

Usage

Options

Examples

Extract text from a PDF:

Extract tables from pages 1-3 and save as JSON:

Extract images from specific pages:

Apply OCR to extract text from an image-based PDF:

Extract everything from a PDF:

Output Examples

Text Extraction

Table Extraction (CSV)

Image Extraction

Project Structure

Troubleshooting

OCR Issues

Tesseract Not Found Error

Table Extraction Issues

Dependencies

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages