A powerful Python cli tool for splitting large PDF files into smaller chunks with advanced compression capabilities to handle problematic PDFs where individual pages retain the size of the entire document.
- Smart PDF Chunking: Split PDFs by page count while respecting size limits
- Advanced Compression: Multiple compression libraries for optimal results
- Automatic Problem Detection: Identifies PDFs with oversized pages
- Flexible Compression Options: Choose compression quality and methods
- Detailed Reporting: Comprehensive reports with compression statistics
- Fallback Support: Works even without optional compression libraries
The tool supports multiple compression libraries in order of preference:
-
pikepdf (Recommended) - Advanced compression with:
- Lossless content stream compression
- Image quality optimization
- Object deduplication
- Orphaned object removal
-
pypdf - Modern compression with:
- Content stream compression
- Object deduplication
- Efficient PDF optimization
-
PyPDF2 (Fallback) - Basic compression:
- Always available
- Basic content stream compression
pip install -r requirements.txt
-
Run the tool:
python main.py
-
Follow the setup instructions (if needed):
- The app will automatically check if the
files/
directory exists - If not found, it will show clear instructions to create it
- If empty, it will remind you to add PDF files
- The app will automatically check if the
-
Place your PDF files in the
files/
directory when prompted -
Configure settings:
- Enter maximum chunk size (e.g., 1024 KB)
- Choose whether to enable compression (Y/n)
- Set image compression quality (1-100, default 60)
The application will guide you through setup automatically:
🔍 Checking setup...
⚠️ WARNING: 'files' directory not found!
📋 SETUP INSTRUCTIONS:
1. Create a 'files' directory in the current folder
2. Place your PDF files inside the 'files' directory
3. Run the application again
💡 Quick setup commands:
mkdir files
# Then copy your PDF files to the files folder
├── files/ # Place your PDF files here
├── chunks/ # Generated chunks will be saved here
│ ├── filename1/ # Chunks for filename1.pdf
│ ├── filename2/ # Chunks for filename2.pdf
│ └── chunking_report.txt # Detailed processing report
|---utils # Utility functions
├── main.py # Main application
├── requirements.txt # Dependencies
The tool automatically detects problematic PDFs where:
- Individual pages are unusually large (>80% of target chunk size)
- Pages retain the size of the entire document
- Embedded objects cause size inflation
- Pre-analysis: Check if original PDF needs compression
- Original Compression: Compress the source PDF if beneficial
- Chunk-level Compression: Compress individual chunks if they're still large
- Quality Control: Only keep compressed versions if they provide >5% size reduction
- Content Stream Compression: Lossless compression of PDF content
- Image Optimization: Reduce image quality while maintaining readability
- Object Deduplication: Remove duplicate objects and references
- Orphaned Object Removal: Clean up unused PDF objects
- Individual PDF files split by pages
- Automatically compressed if beneficial
- Named with clear numbering:
filename-1.pdf
,filename-2.pdf
, etc.
Detailed chunking_report.txt
includes:
- Processing statistics
- Compression ratios
- Available compression methods
- Per-file breakdown
- Oversized chunk warnings
- 1-30: High compression, lower quality (good for text-heavy documents)
- 31-70: Balanced compression and quality (recommended)
- 71-100: Low compression, high quality (good for image-heavy documents)
- Compression Trigger: Pages >80% of max size get compressed
- Minimum Benefit: Compression must provide >5% size reduction
- Oversized Warning: Pages exceeding max size are flagged
"No compression libraries available"
pip install pikepdf pypdf
"Single page too large even after compression"
- Try lower compression quality (20-40)
- Increase maximum chunk size
- Check if PDF contains high-resolution images
"Compression not helping"
- Some PDFs are already optimized
- Text-only PDFs may not compress much
- Try different compression libraries
- pikepdf is fastest for image-heavy PDFs
- pypdf is good for mixed content
- Lower compression quality = faster processing
- Larger chunk sizes = fewer files but potentially larger individual chunks
🚀 PDF Chunking Tool with Compression Started
============================================================
🔧 Available Compression Libraries:
✅ pikepdf (Advanced compression with image optimization)
✅ pypdf (Modern compression with object deduplication)
✅ PyPDF2 (Basic compression - always available)
📏 Enter maximum chunk size in KB (e.g., 1024): 1024
🗜️ Enable PDF compression? (Y/n): Y
🎨 Image compression quality (1-100, default 60): 60
🔍 Found 2 PDF files to process
📊 Maximum chunk size: 1024.0 KB
🗜️ Compression: Enabled
🎨 Image quality: 60%
📋 Progress: 1/2
📄 Processing: large_document.pdf
Original size: 15234.56 KB
Total pages: 45
⚠️ Single page size (14892.34 KB) is large, will attempt compression
🗜️ Attempting to compress original PDF...
Compressing PDF: large_document.pdf (15234.56 KB)
✅ Compressed using pikepdf: 3456.78 KB (77.3% reduction)
✅ Using compressed version for chunking
✅ Chunk 1: 15 pages, 987.65 KB
✅ Chunk 2: 15 pages, 1023.45 KB
✅ Chunk 3: 15 pages, 945.68 KB
🎉 Successfully created 3 chunks
Feel free to submit issues, feature requests, or pull requests to improve the tool!
This project is open source. Feel free to use and modify as needed.
Your Name
- GitHub: @TriptoAfsin
- Email: AfsinTripto@gmail.com
- LinkedIn: Afshin Nahian Tripto
Built with ❤️ By Afshin Nahian Tripto