Skip to content

The project is designed to streamline the workflow of extracting, parsing, and indexing documents via a web interface. It supports various document formats and provides an easy-to-use API and UI for managing document data. It uses LlamaCloud to parse and extract structured information from unstructured documents like PDFs and stores it to postgres

License

Notifications You must be signed in to change notification settings

AlwaysSany/doc-extract-parse-index

Repository files navigation

Introduction

doc-extract-parse-index is designed to streamline the workflow of extracting, parsing, and indexing documents via a web interface. It supports various document formats and provides an easy-to-use API and UI for managing document data. It uses LlamaCloud to parse and extract structured information from unstructured documents like PDFs and stores it to postgres database. The project is built with Flask for the backend and React for the frontend, providing a modern web application experience.

Demo

Document_Extract_LlamaCloud_Demo-VEED.mp4

Table of Contents

Project Requirements

  • Python 3.13.3+
  • Node.js 18+ (if frontend is present)
  • UV/PIP (Python package manager)
  • Docker (for containerized setup)
  • Docker Compose (for multi-service orchestration)
  • LlamaCloud (for document indexing and search capabilities)

Dependencies

  • Flask (backend web framework)
  • PostgreSQL (database)
  • React(frontend)
  • Dependencies as listed in requirements.txt or package.json

Project Structure

doc-extract-parse-index/
├── backend/                # Backend source code (API, models, services)
│   ├── app.py
│   ├── requirements.txt
│   └── ...
├── frontend/               # Frontend source code (optional)
│   ├── package.json
│   └── ...
├── uploads/                   # Uploaded or processed documents
├── Dockerfile              # Dockerfile for backend (and/or frontend)
├── docker-compose.yml      # Docker Compose configuration
├── README.md
└── ...

Setup Instructions

Standalone Setup

  1. Clone the repository:

    git clone <repo-url>
    cd doc-extract-parse-index
  2. Backend Setup: Create and updated the environment variables in .env file in the backend directory, you can use the .env.example as a template. From the project root directory

    cp .env.example .env

    Then, update the .env file with your PostgreSQL database credentials and other LlamaCloud configurations.

    Then run backend service first,

    Using pip:

    cd backend
    python3 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
    python app.py

    Using uv:

    cd backend
    uv venv --python 3.13.3 venv
    source venv/bin/activate 
    uv sync
    uv run app.py
  3. Frontend Setup (if needed):

    cd frontend
    npm install
    npm start
  4. Access the app:

    • Backend API: http://localhost:5000
    • Frontend: http://localhost:3000

Dockerized Setup

  1. Build and run using Docker Compose:

    docker-compose up --build
  2. Access the app:

    • Backend API: http://localhost:5000
    • Frontend: http://localhost:3000

Usage

  • Upload documents via the web UI or API.
  • Extract and parse content automatically.
  • Search and retrieve indexed documents, right now it stores the documents in a local postgres database instead of LlamaCloud to keep it simple

License

MIT License

Contributing

Contributions are welcome! Please submit a pull request or open an issue for discussion. Check the CONTRIBUTING.md file for more details and To Do items before start contributing.

Inspiration

This project is a comprehensive tutorial on using LlamaExtract, a tool by LamaIndex, to automatically extract structured information from unstructured documents like PDFs and images. Thanks to the creator of LlamaCloud for providing such an informative resource and also a big thanks to Alejandro AO for the initial codebase and inspiration on his youtube video here: LlamaExtract Tutorial

ToDo's

  • Use LlamaCloud to index uploaded documents.
  • Implement advanced search features using LlamaCloud on indexed documents.
  • Add unit tests for backend services.
  • Improve frontend UI/UX.
  • Add more document format support (e.g., PDF, DOCX).
  • Implement user authentication and authorization.
  • Optimize performance for large document sets.
  • Add error handling and logging.
  • Create comprehensive documentation for API endpoints.
  • Set up CI/CD pipeline for automated testing and deployment.
  • Implement rate limiting and security measures for the API.
  • Add support for multiple languages in document processing.

Resources

About

The project is designed to streamline the workflow of extracting, parsing, and indexing documents via a web interface. It supports various document formats and provides an easy-to-use API and UI for managing document data. It uses LlamaCloud to parse and extract structured information from unstructured documents like PDFs and stores it to postgres

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published