Skip to content

An API that converts PDFs stored in Google Cloud Storage to Markdown format using OCR or direct text extraction.

License

Notifications You must be signed in to change notification settings

UnitVectorY-Labs/gcspdf2mdapi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gcspdf2mdapi

An API that converts PDFs stored in Google Cloud Storage to Markdown format using OCR or direct text extraction.

Overview

gcspdf2mdapi is a Flask-based API service that converts PDF documents stored in Google Cloud Storage to Markdown format. It offers two conversion methods:

  1. OCR-based conversion: Uses Tesseract OCR via pytesseract to extract text from PDF pages rendered as images. This method is helpful for scanned documents or PDFs with text embedded in images.

  2. Direct text extraction: Leverages PyMuPDF (fitz) and pymupdf4llm to extract text content directly from PDF documents while preserving structure.

Key technologies used:

  • Flask: Web framework for the API endpoints
  • PyMuPDF: PDF parsing and rendering
  • pymupdf4llm: Converts PDF content to structured markdown
  • pytesseract & Pillow: OCR processing
  • Google Cloud Storage: For accessing PDF documents

The API is containerized using Docker and can be deployed to any container-supporting environment.

Usage

The API provides endpoints to convert PDF files stored in Google Cloud Storage to Markdown format.

Endpoints

Convert PDF to Markdown

POST /convert

Request body:

{
  "file": "gs://bucket-name/path/to/file.pdf",
  "mode": "ocr|direct"
}

Parameters:

  • file: GCS path to the PDF file (must start with gs://)
  • mode: (Optional) Conversion method
    • ocr: Uses Optical Character Recognition (default)
    • direct: Uses direct text extraction

Response:

{
  "markdown": "Extracted markdown content..."
}

Health Check

GET /

Returns API status:

{
  "status": "ok"
}

Examples

Convert using OCR (default):

curl -X POST https://your-api-endpoint/convert \
  -H "Content-Type: application/json" \
  -d '{"file": "gs://my-bucket/documents/report.pdf"}'

Convert using direct text extraction:

curl -X POST https://your-api-endpoint/convert \
  -H "Content-Type: application/json" \
  -d '{"file": "gs://my-bucket/documents/report.pdf", "mode": "direct"}'

About

An API that converts PDFs stored in Google Cloud Storage to Markdown format using OCR or direct text extraction.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 2

  •  
  •