A serverless AWS Lambda pipeline that processes PDF files uploaded to S3, converts them to PNG images, and creates Label Studio annotation tasks.
The pipeline uses:
- S3: File storage (PDFs uploaded to
upload/
, PNGs stored inraw/
, tasks iningest/
) - SQS: Decouples S3 events from Lambda processing
- Lambda: Processes files using PyMuPDF for PDF-to-PNG conversion
- PyMuPDF: Fast PDF rendering library for high-quality PNG output
- 📄 PDF Processing: Converts multi-page PDFs to individual PNG images
- 🖼️ PNG Support: Direct PNG file processing and task creation
- 📊 Label Studio Integration: Automatically creates annotation tasks
- ⚡ Serverless: Scales automatically with upload volume
- 🔄 Error Handling: Robust error handling and logging
- Input:
s3://bucket/upload/document.pdf
- Output:
s3://bucket/raw/document_0001.png
,document_0002.png
, etc. - Tasks: Creates JSON task files in
s3://bucket/ingest/TASK_XXXXXXX.json
- Original: PDF remains in
upload/
folder
- Input:
s3://bucket/upload/image.png
- Output:
s3://bucket/raw/image.png
(moved from upload) - Tasks: Creates JSON task file in
s3://bucket/ingest/TASK_XXXXXXX.json
- AWS CLI configured with appropriate permissions
- Terraform >= 1.6
- Python 3.11
- pip
- zip
# Make scripts executable (if not already)
chmod +x scripts/build_layer.sh scripts/deploy.sh
# Deploy everything
./scripts/deploy.sh
-
Build PyMuPDF Layer:
./scripts/build_layer.sh
-
Deploy Infrastructure:
terraform init terraform plan terraform apply
# Install development dependencies
pip install -r requirements.txt
# Run tests
pytest __test__/
ingest-pipeline/
├── src/ingest_pipeline/
│ ├── main.py # Lambda handler
│ └── __init__.py
├── __test__/
│ ├── test_main.py # Test suite
│ └── artifacts/
│ └── test_pdf.pdf # Test file
├── scripts/
│ ├── build_layer.sh # Build PyMuPDF layer
│ └── deploy.sh # Complete deployment
├── terraform.tf # Infrastructure as code
├── requirements.txt # Python dependencies
└── README.md
The pipeline is configured for:
- S3 Bucket:
brij-v1-bucket
- Upload Prefix:
upload/
- Raw Storage:
raw/
- Task Storage:
ingest/
- Lambda Timeout: 15 minutes
- SQS Visibility: 25 minutes
- CloudWatch Logs: Lambda execution logs
- SQS Metrics: Queue depth and processing rates
- S3 Events: File upload notifications
PNG files are generated with:
- 2x scaling: High-resolution output for better annotation quality
- Lossless compression: Preserves image fidelity
- Standard PNG format: Compatible with Label Studio
The pipeline handles:
- Invalid PDF files
- Corrupted uploads
- Processing timeouts
- S3 access errors
- Memory limitations
Failed processing attempts are logged to CloudWatch for debugging.
- Lambda: Only runs when files are uploaded
- SQS: Buffers requests to handle traffic spikes
- S3: Lifecycle policies can archive old files
- Layers: PyMuPDF shared across deployments
graph TD;
subgraph "Your Local Machine"
A["uv (for local dev)"]
B["scripts/deploy.sh"] -- runs --> C["scripts/build_layer.sh"];
C -- uses --> D["pip install -t"];
D -- creates --> E["pymupdf_layer.zip"];
B -- runs --> F["terraform apply"];
end
subgraph "AWS Cloud"
G["S3 Bucket"] -- triggers --> H["SQS Queue"] -- triggers --> I["Lambda Function"];
J["aws_lambda_layer_version (PyMuPDF)"];
I -- uses --> J;
end
F -- uploads --> E;
F -- provisions --> G;
F -- provisions --> H;
F -- provisions & attaches layer --> I;
F -- creates resource for --> J;
subgraph "Inside Lambda Function"
K["main.py"] -- can now --> L["import fitz"];
end
I -- contains --> K;