An MCP (Model Context Protocol) server that provides intelligent document ingestion capabilities using the Docling toolkit. Convert any document (PDF, DOCX, images, HTML, etc.) into clean Markdown for AI processing and RAG pipelines.
- Universal File Support: PDFs, DOCX/XLSX/PPTX, images (PNG/JPEG/TIFF/BMP/WEBP), HTML, Markdown, CSV, audio files, and more
- Flexible Input: Process local files or remote URLs
- Multiple Processing Pipelines: Standard (fast, high-quality), VLM (vision-language models), ASR (audio transcription)
- Intelligent Auto-Detection: Automatically selects optimal settings based on file type and content
- Queue Management: Handles concurrent requests with proper job queuing
- Mac M2 Optimized: Efficient memory usage and MLX acceleration support
- Clean Markdown Output: High-quality structured text ready for AI consumption
- Python 3.9+ (recommended: 3.11+)
- macOS (optimized for Apple Silicon M2)
- 8GB+ RAM recommended
- Clone and install dependencies:
git clone <repository-url>
cd doc-ingestor-mcp
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Install Docling with Mac optimizations:
# Core Docling with MLX acceleration for Apple Silicon
pip install docling
# For MLX support (Apple Silicon only):
pip install docling[mlx]
# Optional: additional OCR engines
pip install easyocr
# Install tesseract via homebrew: brew install tesseract
- Start the MCP server:
python -m doc_ingestor_mcp
The server will start and listen for MCP connections using stdio transport.
The server provides the following MCP tools:
Converts any supported document to Markdown.
Parameters:
source
(required): File path or URL to the documentpipeline
(optional): Processing pipeline -"standard"
,"vlm"
, or"asr"
options
(optional): Additional processing options
Example:
{
"name": "convert_document",
"arguments": {
"source": "https://arxiv.org/pdf/2408.09869",
"pipeline": "standard"
}
}
Response:
{
"content": [
{
"type": "text",
"text": "# Document Title\n\nConverted markdown content here..."
}
]
}
Advanced conversion with detailed configuration options.
Parameters:
source
(required): File path or URLpipeline
(optional):"standard"
,"vlm"
,"asr"
ocr_enabled
(optional): Enable/disable OCR (default: auto-detect)ocr_language
(optional): OCR language codes (e.g., "eng,spa")table_mode
(optional):"fast"
or"accurate"
pdf_backend
(optional):"dlparse_v4"
or"pypdfium2"
enable_enrichments
(optional): Enable code/formula/picture enrichments
Example:
{
"name": "convert_document_advanced",
"arguments": {
"source": "./scanned-document.pdf",
"pipeline": "standard",
"ocr_enabled": true,
"ocr_language": "eng",
"table_mode": "accurate"
}
}
Check the status of ongoing conversions (useful for large files).
Parameters:
job_id
(required): Job identifier returned from conversion requests
Returns all supported input and output formats.
Response:
{
"input_formats": ["pdf", "docx", "xlsx", "pptx", "png", "jpeg", "html", "md", "csv", "mp3", "wav"],
"output_formats": ["markdown", "html", "json", "text", "doctags"],
"pipelines": ["standard", "vlm", "asr"]
}
{
"name": "convert_document",
"arguments": {
"source": "./research-paper.pdf"
}
}
{
"name": "convert_document",
"arguments": {
"source": "https://example.com/complex-document.pdf",
"pipeline": "vlm"
}
}
{
"name": "convert_document",
"arguments": {
"source": "./meeting-recording.mp3",
"pipeline": "asr"
}
}
{
"name": "convert_document_advanced",
"arguments": {
"source": "./scanned-invoice.pdf",
"ocr_enabled": true,
"ocr_language": "eng",
"table_mode": "accurate"
}
}
- Best for: Born-digital PDFs, Office documents, clean layouts
- Features: Advanced layout analysis, table structure recovery, optional OCR
- Performance: Fast, memory-efficient
- Use when: Document has programmatic text and standard layouts
- Best for: Complex layouts, handwritten notes, screenshots, scanned documents
- Features: Vision-language model processing, end-to-end page understanding
- Performance: Slower, higher memory usage, MLX-accelerated on M2
- Use when: Standard pipeline fails or document has unusual layouts
- Best for: Audio files (meetings, lectures, interviews)
- Features: Whisper-based transcription, multiple model sizes
- Performance: CPU/GPU intensive depending on model size
- Use when: Processing audio content
The server automatically selects optimal settings:
- File Type Detection: Based on extension and content analysis
- OCR Decision: Enabled for scanned PDFs and images, disabled for text-based documents
- Pipeline Selection: Standard for most documents, VLM suggested for images and complex layouts
- Backend Selection: Native parser (dlparse_v4) for quality, pypdfium2 for speed/compatibility
- Large Files: Automatic chunking and streaming processing
- Queue System: Prevents memory overflow from concurrent requests
- Cleanup: Automatic temporary file cleanup after processing
- VLM models run with MLX optimization on Apple Silicon
- Reduced memory footprint compared to standard PyTorch
- Automatic fallback to CPU if MLX unavailable
# Environment variables for optimization
export DOCLING_MAX_MEMORY_GB=6 # Limit memory usage
export DOCLING_QUEUE_SIZE=3 # Max concurrent jobs
export DOCLING_ENABLE_MLX=true # Enable MLX acceleration
- Network timeouts for URL-based files
- Fallback pipelines if primary fails
- Alternative OCR engines if primary fails
{
"error": {
"type": "ConversionError",
"message": "Failed to process document",
"details": "Specific error information",
"suggestions": ["Try VLM pipeline", "Enable OCR"]
}
}
Issue | Cause | Solution |
---|---|---|
Memory error with large PDF | Insufficient RAM | Split document or reduce queue size |
Poor OCR quality | Wrong language/engine | Specify language with ocr_language |
Scrambled text order | PDF parsing issues | Try "pdf_backend": "pypdfium2" |
Tables not detected | Layout complexity | Use "table_mode": "accurate" |
Slow processing | Large/complex document | Try "pipeline": "standard" first |
Add this to your Claude Desktop configuration file (~/Library/Application Support/Claude/claude_desktop_config.json
):
{
"mcpServers": {
"doc-ingestor": {
"command": "python",
"args": ["-m", "doc_ingestor_mcp"],
"cwd": "/path/to/doc-ingestor-mcp"
}
}
}
- Test basic functionality:
# Start the server in debug mode
python -m doc_ingestor_mcp --debug
# In another terminal, test with a sample file
echo '{"jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": {"name": "convert_document", "arguments": {"source": "test.pdf"}}}' | python -m doc_ingestor_mcp
-
Test with Claude Desktop:
- Restart Claude Desktop after adding the MCP configuration
- In a new conversation, try: "Can you convert this PDF to markdown?" and attach a PDF file
- The server should appear in Claude's available tools
-
Test different file types:
# Test with different pipelines
python test_server.py
Create test_server.py
:
import asyncio
import json
from doc_ingestor_mcp.server import DocIngestorMCPServer
from doc_ingestor_mcp.config import load_config
async def test_conversion():
config = load_config("config.yaml")
server = DocIngestorMCPServer(config)
# Test basic conversion
result = await server._handle_convert_document({
"source": "https://arxiv.org/pdf/2408.09869",
"pipeline": "standard"
})
print("Conversion successful!")
print(f"Output length: {len(result[0].text)} characters")
if __name__ == "__main__":
asyncio.run(test_conversion())
- PDFs: Up to 500MB (auto-chunked)
- Images: Up to 50MB per image
- Audio: Up to 2GB (processed in segments)
- Office Docs: Up to 200MB
- URLs: 10-minute timeout for downloads
- Local Processing: All processing happens locally by default
- Remote Services: Optional (disabled by default) for VLM APIs
- File Cleanup: Temporary files automatically deleted
- URL Validation: Safe URL patterns enforced
python -m doc_ingestor_mcp --debug
tail -f ./logs/server.log
python test_server.py
"ModuleNotFoundError: No module named 'docling'"
pip install docling
"MLX not available" warnings
- This is normal on non-Apple Silicon Macs
- MLX acceleration is optional and will fallback to CPU
"Queue is full" errors
- Wait for current jobs to complete
- Increase
max_queue_size
in config.yaml
"Download failed" for URLs
- Check internet connection
- Verify URL is accessible
- Some sites may block automated downloads
Memory errors with large files
- Reduce
max_memory_gb
in config.yaml - Try smaller files first
- Use
pipeline: "standard"
instead ofvlm
OCR not working
- Install tesseract:
brew install tesseract
- Install easyocr:
pip install easyocr
- Check language settings in config.yaml
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
MIT License - see LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Docling Project Docs