GitHub - AlwaysSany/doc-extract-parse-index: The project is designed to streamline the workflow of extracting, parsing, and indexing documents via a web interface. It supports various document formats and provides an easy-to-use API and UI for managing document data. It uses LlamaCloud to parse and extract structured information from unstructured documents like PDFs and stores it to postgres

Introduction

doc-extract-parse-index is designed to streamline the workflow of extracting, parsing, and indexing documents via a web interface. It supports various document formats and provides an easy-to-use API and UI for managing document data. It uses LlamaCloud to parse and extract structured information from unstructured documents like PDFs and stores it to postgres database. The project is built with Flask for the backend and React for the frontend, providing a modern web application experience.

Demo

Document_Extract_LlamaCloud_Demo-VEED.mp4

Project Requirements

Python 3.13.3+
Node.js 18+ (if frontend is present)
UV/PIP (Python package manager)
Docker (for containerized setup)
Docker Compose (for multi-service orchestration)
LlamaCloud (for document indexing and search capabilities)

Dependencies

Flask (backend web framework)
PostgreSQL (database)
React(frontend)
Dependencies as listed in requirements.txt or package.json

Project Structure

doc-extract-parse-index/
├── backend/                # Backend source code (API, models, services)
│   ├── app.py
│   ├── requirements.txt
│   └── ...
├── frontend/               # Frontend source code (optional)
│   ├── package.json
│   └── ...
├── uploads/                   # Uploaded or processed documents
├── Dockerfile              # Dockerfile for backend (and/or frontend)
├── docker-compose.yml      # Docker Compose configuration
├── README.md
└── ...

Setup Instructions

Standalone Setup

Clone the repository:

git clone <repo-url>
cd doc-extract-parse-index

Backend Setup: Create and updated the environment variables in .env file in the backend directory, you can use the .env.example as a template. From the project root directory
```
cp .env.example .env
```
Then, update the .env file with your PostgreSQL database credentials and other LlamaCloud configurations.

Then run backend service first,

Using pip:
```
cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python app.py
```
Using uv:
```
cd backend
uv venv --python 3.13.3 venv
source venv/bin/activate 
uv sync
uv run app.py
```
Frontend Setup (if needed):
```
cd frontend
npm install
npm start
```
Access the app:
- Backend API: http://localhost:5000
- Frontend: http://localhost:3000

Dockerized Setup

Build and run using Docker Compose:
```
docker-compose up --build
```
Access the app:
- Backend API: http://localhost:5000
- Frontend: http://localhost:3000

Usage

Upload documents via the web UI or API.
Extract and parse content automatically.
Search and retrieve indexed documents, right now it stores the documents in a local postgres database instead of LlamaCloud to keep it simple

License

MIT License

Contributing

Contributions are welcome! Please submit a pull request or open an issue for discussion. Check the CONTRIBUTING.md file for more details and To Do items before start contributing.

Inspiration

This project is a comprehensive tutorial on using LlamaExtract, a tool by LamaIndex, to automatically extract structured information from unstructured documents like PDFs and images. Thanks to the creator of LlamaCloud for providing such an informative resource and also a big thanks to Alejandro AO for the initial codebase and inspiration on his youtube video here: LlamaExtract Tutorial

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Demo

Table of Contents

Project Requirements

Dependencies

Project Structure

Setup Instructions

Standalone Setup

Dockerized Setup

Usage

License

Contributing

Inspiration

ToDo's

Resources

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
package.json		package.json

License

AlwaysSany/doc-extract-parse-index

Folders and files

Latest commit

History

Repository files navigation

Introduction

Demo

Table of Contents

Project Requirements

Dependencies

Project Structure

Setup Instructions

Standalone Setup

Dockerized Setup

Usage

License

Contributing

Inspiration

ToDo's

Resources

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages