Chonkie PDF

A powerful Python cli tool for splitting large PDF files into smaller chunks with advanced compression capabilities to handle problematic PDFs where individual pages retain the size of the entire document.

🚀 Features

Smart PDF Chunking: Split PDFs by page count while respecting size limits
Advanced Compression: Multiple compression libraries for optimal results
Automatic Problem Detection: Identifies PDFs with oversized pages
Flexible Compression Options: Choose compression quality and methods
Detailed Reporting: Comprehensive reports with compression statistics
Fallback Support: Works even without optional compression libraries

🔧 Compression Libraries

The tool supports multiple compression libraries in order of preference:

pikepdf (Recommended) - Advanced compression with:
- Lossless content stream compression
- Image quality optimization
- Object deduplication
- Orphaned object removal
pypdf - Modern compression with:
- Content stream compression
- Object deduplication
- Efficient PDF optimization
PyPDF2 (Fallback) - Basic compression:
- Always available
- Basic content stream compression

📦 Installation

From requirements.txt

pip install -r requirements.txt

🎯 Usage

Run the tool:
```
python main.py
```
Follow the setup instructions (if needed):
- The app will automatically check if the files/ directory exists
- If not found, it will show clear instructions to create it
- If empty, it will remind you to add PDF files
Place your PDF files in the files/ directory when prompted
Configure settings:
- Enter maximum chunk size (e.g., 1024 KB)
- Choose whether to enable compression (Y/n)
- Set image compression quality (1-100, default 60)

First Time Setup

The application will guide you through setup automatically:

🔍 Checking setup...
⚠️  WARNING: 'files' directory not found!
📋 SETUP INSTRUCTIONS:
   1. Create a 'files' directory in the current folder
   2. Place your PDF files inside the 'files' directory
   3. Run the application again

💡 Quick setup commands:
   mkdir files
   # Then copy your PDF files to the files folder

📁 Directory Structure

├── files/                  # Place your PDF files here
├── chunks/                 # Generated chunks will be saved here
│   ├── filename1/         # Chunks for filename1.pdf
│   ├── filename2/         # Chunks for filename2.pdf
│   └── chunking_report.txt # Detailed processing report
|---utils                   # Utility functions
├── main.py                # Main application
├── requirements.txt       # Dependencies

🔍 How It Works

Problem Detection

The tool automatically detects problematic PDFs where:

Individual pages are unusually large (>80% of target chunk size)
Pages retain the size of the entire document
Embedded objects cause size inflation

Compression Strategy

Pre-analysis: Check if original PDF needs compression
Original Compression: Compress the source PDF if beneficial
Chunk-level Compression: Compress individual chunks if they're still large
Quality Control: Only keep compressed versions if they provide >5% size reduction

Compression Techniques

Content Stream Compression: Lossless compression of PDF content
Image Optimization: Reduce image quality while maintaining readability
Object Deduplication: Remove duplicate objects and references
Orphaned Object Removal: Clean up unused PDF objects

📊 Output

Chunks

Individual PDF files split by pages
Automatically compressed if beneficial
Named with clear numbering: filename-1.pdf, filename-2.pdf, etc.

Report

Detailed chunking_report.txt includes:

Processing statistics
Compression ratios
Available compression methods
Per-file breakdown
Oversized chunk warnings

⚙️ Configuration Options

Compression Quality

1-30: High compression, lower quality (good for text-heavy documents)
31-70: Balanced compression and quality (recommended)
71-100: Low compression, high quality (good for image-heavy documents)

Size Thresholds

Compression Trigger: Pages >80% of max size get compressed
Minimum Benefit: Compression must provide >5% size reduction
Oversized Warning: Pages exceeding max size are flagged

🛠️ Troubleshooting

Common Issues

"No compression libraries available"

pip install pikepdf pypdf

"Single page too large even after compression"

Try lower compression quality (20-40)
Increase maximum chunk size
Check if PDF contains high-resolution images

"Compression not helping"

Some PDFs are already optimized
Text-only PDFs may not compress much
Try different compression libraries

Performance Tips

pikepdf is fastest for image-heavy PDFs
pypdf is good for mixed content
Lower compression quality = faster processing
Larger chunk sizes = fewer files but potentially larger individual chunks

📈 Example Output

🚀 PDF Chunking Tool with Compression Started
============================================================

🔧 Available Compression Libraries:
   ✅ pikepdf (Advanced compression with image optimization)
   ✅ pypdf (Modern compression with object deduplication)
   ✅ PyPDF2 (Basic compression - always available)

📏 Enter maximum chunk size in KB (e.g., 1024): 1024
🗜️  Enable PDF compression? (Y/n): Y
🎨 Image compression quality (1-100, default 60): 60

🔍 Found 2 PDF files to process
📊 Maximum chunk size: 1024.0 KB
🗜️  Compression: Enabled
🎨 Image quality: 60%

📋 Progress: 1/2

📄 Processing: large_document.pdf
   Original size: 15234.56 KB
   Total pages: 45
   ⚠️  Single page size (14892.34 KB) is large, will attempt compression
   🗜️  Attempting to compress original PDF...
      Compressing PDF: large_document.pdf (15234.56 KB)
      ✅ Compressed using pikepdf: 3456.78 KB (77.3% reduction)
   ✅ Using compressed version for chunking
   ✅ Chunk 1: 15 pages, 987.65 KB
   ✅ Chunk 2: 15 pages, 1023.45 KB
   ✅ Chunk 3: 15 pages, 945.68 KB
   🎉 Successfully created 3 chunks

🤝 Contributing

Feel free to submit issues, feature requests, or pull requests to improve the tool!

📄 License

This project is open source. Feel free to use and modify as needed.

👨‍💻 Author

Your Name

Built with ❤️ By Afshin Nahian Tripto

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
utils		utils
README.md		README.md
chonkie-pdf.png		chonkie-pdf.png
cli.png		cli.png
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chonkie PDF

🚀 Features

🔧 Compression Libraries

📦 Installation

From requirements.txt

🎯 Usage

First Time Setup

📁 Directory Structure

🔍 How It Works

Problem Detection

Compression Strategy

Compression Techniques

📊 Output

Chunks

Report

⚙️ Configuration Options

Compression Quality

Size Thresholds

🛠️ Troubleshooting

Common Issues

Performance Tips

📈 Example Output

🤝 Contributing

📄 License

👨‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

TriptoAfsin/chonkie-pdf

Folders and files

Latest commit

History

Repository files navigation

Chonkie PDF

🚀 Features

🔧 Compression Libraries

📦 Installation

From requirements.txt

🎯 Usage

First Time Setup

📁 Directory Structure

🔍 How It Works

Problem Detection

Compression Strategy

Compression Techniques

📊 Output

Chunks

Report

⚙️ Configuration Options

Compression Quality

Size Thresholds

🛠️ Troubleshooting

Common Issues

Performance Tips

📈 Example Output

🤝 Contributing

📄 License

👨‍💻 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages