doc-extract-parse-index
is designed to streamline the workflow of extracting, parsing, and indexing documents via a web interface. It supports various document formats and provides an easy-to-use API and UI for managing document data.
It uses LlamaCloud to parse and extract structured information from unstructured documents like PDFs and stores it to postgres database.
The project is built with Flask for the backend and React for the frontend, providing a modern web application experience.
Document_Extract_LlamaCloud_Demo-VEED.mp4
- Python 3.13.3+
- Node.js 18+ (if frontend is present)
- UV/PIP (Python package manager)
- Docker (for containerized setup)
- Docker Compose (for multi-service orchestration)
- LlamaCloud (for document indexing and search capabilities)
- Flask (backend web framework)
- PostgreSQL (database)
- React(frontend)
- Dependencies as listed in
requirements.txt
orpackage.json
doc-extract-parse-index/
├── backend/ # Backend source code (API, models, services)
│ ├── app.py
│ ├── requirements.txt
│ └── ...
├── frontend/ # Frontend source code (optional)
│ ├── package.json
│ └── ...
├── uploads/ # Uploaded or processed documents
├── Dockerfile # Dockerfile for backend (and/or frontend)
├── docker-compose.yml # Docker Compose configuration
├── README.md
└── ...
-
Clone the repository:
git clone <repo-url> cd doc-extract-parse-index
-
Backend Setup: Create and updated the environment variables in
.env
file in the backend directory, you can use the.env.example
as a template. From the project root directorycp .env.example .env
Then, update the
.env
file with your PostgreSQL database credentials and other LlamaCloud configurations.Then run backend service first,
Using
pip
:cd backend python3 -m venv venv source venv/bin/activate pip install -r requirements.txt python app.py
Using
uv
:cd backend uv venv --python 3.13.3 venv source venv/bin/activate uv sync uv run app.py
-
Frontend Setup (if needed):
cd frontend npm install npm start
-
Access the app:
- Backend API:
http://localhost:5000
- Frontend:
http://localhost:3000
- Backend API:
-
Build and run using Docker Compose:
docker-compose up --build
-
Access the app:
- Backend API:
http://localhost:5000
- Frontend:
http://localhost:3000
- Backend API:
- Upload documents via the web UI or API.
- Extract and parse content automatically.
- Search and retrieve indexed documents, right now it stores the documents in a local postgres database instead of LlamaCloud to keep it simple
MIT License
Contributions are welcome! Please submit a pull request or open an issue for discussion. Check the CONTRIBUTING.md
file for more details and To Do items before start contributing.
This project is a comprehensive tutorial on using LlamaExtract, a tool by LamaIndex, to automatically extract structured information from unstructured documents like PDFs and images. Thanks to the creator of LlamaCloud for providing such an informative resource and also a big thanks to Alejandro AO for the initial codebase and inspiration on his youtube video here: LlamaExtract Tutorial
- Use LlamaCloud to index uploaded documents.
- Implement advanced search features using LlamaCloud on indexed documents.
- Add unit tests for backend services.
- Improve frontend UI/UX.
- Add more document format support (e.g., PDF, DOCX).
- Implement user authentication and authorization.
- Optimize performance for large document sets.
- Add error handling and logging.
- Create comprehensive documentation for API endpoints.
- Set up CI/CD pipeline for automated testing and deployment.
- Implement rate limiting and security measures for the API.
- Add support for multiple languages in document processing.