-
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
The current implementation of DocProc is written in Python. While Python enables rapid prototyping and simplicity, it's becoming a bottleneck in production workloads due to:
- High memory consumption during large document parsing
- Slower execution times for concurrent processing
- Difficulty in scaling under load without complex optimization
Proposal:
Migrate critical parts of the DocProc backend to a compiled, memory-safe systems language — Rust or Go.
Benefits:
- ⚡ Improved performance: Faster execution, especially for CPU-intensive tasks (e.g., parsing, tokenization)
- 🧠 Lower memory usage: Avoids Python’s GC overhead and object memory bloat
- 🔒 Safety: Rust provides memory safety guarantees without GC
- 📦 Deployment: Single static binaries, easier CI/CD and containerization
- 🤝 Interoperability: Can be embedded into existing Python pipeline via FFI if needed (e.g.
PyO3
,cgo
)
Tasks:
- Identify performance-critical modules in the current pipeline
- Choose target language (Rust vs Go) based on I/O needs, team comfort, and ecosystem
- Prototype one processing stage (e.g., text extraction or format parsing)
- Benchmark vs Python equivalent
- Plan and execute incremental migration (module-by-module)
- Update build and deployment workflows to include compiled binaries
Considerations:
- Maintain feature parity with the Python version
- Ensure cross-platform compatibility (Linux, macOS)
- Allow fallback to Python for less performance-critical tasks (if needed)
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request