A FastAPI-based server that provides REST API endpoints for converting various file formats to Markdown using Microsoft's MarkItDown library.
- Multiple File Format Support: Convert PDF, DOCX, XLSX, PPTX, and HTML files to Markdown
- Async & Thread-Safe: Built with FastAPI for high performance and concurrent request handling
- Docker Ready: Fully containerized with Docker and Docker Compose support
- REST API: Clean RESTful endpoints for easy integration
- Error Handling: Comprehensive error handling and logging
- Health Checks: Built-in health check endpoints
- CORS Enabled: Cross-origin resource sharing support
Format | Endpoint | File Extensions |
---|---|---|
/parse_pdf |
.pdf |
|
Word Documents | /parse_docx |
.docx |
Excel Spreadsheets | /parse_xlsx |
.xlsx , .xls |
PowerPoint Presentations | /parse_pptx |
.pptx , .ppt |
HTML Files | /parse_html |
.html , .htm |
-
Clone the repository:
git clone <repository-url> cd Kolosal-RMS-MarkItDown
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Run the server:
python main.py
The API will be available at http://localhost:8000
-
Build and run with Docker:
docker build -t markitdown-api . docker run -p 8000:8000 markitdown-api
-
Or use Docker Compose:
docker-compose up -d
Once the server is running, visit:
- Swagger UI:
http://localhost:8000/docs
- ReDoc:
http://localhost:8000/redoc
curl -X POST "http://localhost:8000/parse_pdf" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@document.pdf"
curl -X POST "http://localhost:8000/parse_docx" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@document.docx"
curl -X POST "http://localhost:8000/parse_xlsx" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@spreadsheet.xlsx"
curl -X POST "http://localhost:8000/parse_pptx" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@presentation.pptx"
curl -X POST "http://localhost:8000/parse_html" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@webpage.html"
All endpoints return a JSON response with the following structure:
{
"success": true,
"filename": "document.pdf",
"markdown_content": "# Document Title\n\nDocument content in markdown format...",
"title": "Document Title",
"metadata": {
"original_filename": "document.pdf",
"file_size": 1024576,
"mime_type": "application/pdf"
}
}
The server provides a health check endpoint:
curl http://localhost:8000/health
Response:
{
"status": "healthy",
"service": "markitdown-api"
}
- Async Processing: All file operations are handled asynchronously
- Thread Pool: CPU-intensive conversions run in a dedicated thread pool
- Concurrent Requests: Supports multiple simultaneous file conversions
- Memory Efficient: Uses streaming for file processing
- Error Recovery: Graceful error handling without server crashes
The server can be configured through environment variables:
HOST
: Server host (default:0.0.0.0
)PORT
: Server port (default:8000
)LOG_LEVEL
: Logging level (default:info
)
docker build -t kolosal-markitdown-api .
docker run -d \
--name markitdown-api \
-p 8000:8000 \
kolosal-markitdown-api
# Start services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
This project is built using Microsoft's MarkItDown library, which provides the core functionality for converting various file formats to Markdown.
- MarkItDown: https://github.com/microsoft/markitdown
- Microsoft: For creating and maintaining the MarkItDown library
- FastAPI: For the excellent async web framework
- Uvicorn: For the ASGI server implementation
For support and questions, please:
- Check the documentation
- Search existing issues in the repository
- Create a new issue if needed
Kolosal Inc - Retrieval Management Service Implementation