An API that converts PDFs stored in Google Cloud Storage to Markdown format using OCR or direct text extraction.
gcspdf2mdapi is a Flask-based API service that converts PDF documents stored in Google Cloud Storage to Markdown format. It offers two conversion methods:
-
OCR-based conversion: Uses Tesseract OCR via pytesseract to extract text from PDF pages rendered as images. This method is helpful for scanned documents or PDFs with text embedded in images.
-
Direct text extraction: Leverages PyMuPDF (fitz) and pymupdf4llm to extract text content directly from PDF documents while preserving structure.
Key technologies used:
- Flask: Web framework for the API endpoints
- PyMuPDF: PDF parsing and rendering
- pymupdf4llm: Converts PDF content to structured markdown
- pytesseract & Pillow: OCR processing
- Google Cloud Storage: For accessing PDF documents
The API is containerized using Docker and can be deployed to any container-supporting environment.
The API provides endpoints to convert PDF files stored in Google Cloud Storage to Markdown format.
POST /convert
Request body:
{
"file": "gs://bucket-name/path/to/file.pdf",
"mode": "ocr|direct"
}
Parameters:
file
: GCS path to the PDF file (must start withgs://
)mode
: (Optional) Conversion methodocr
: Uses Optical Character Recognition (default)direct
: Uses direct text extraction
Response:
{
"markdown": "Extracted markdown content..."
}
GET /
Returns API status:
{
"status": "ok"
}
Convert using OCR (default):
curl -X POST https://your-api-endpoint/convert \
-H "Content-Type: application/json" \
-d '{"file": "gs://my-bucket/documents/report.pdf"}'
Convert using direct text extraction:
curl -X POST https://your-api-endpoint/convert \
-H "Content-Type: application/json" \
-d '{"file": "gs://my-bucket/documents/report.pdf", "mode": "direct"}'