AlphaExtract is a cutting-edge PDF summarization tool that leverages state-of-the-art AI models to extract and synthesize information from PDF documents. Built with Meta's LLaMA 4 MOE Maverick model and powered by Groq's inference engine, it provides blazing-fast, high-precision summaries for any PDF document.
- Intelligent PDF Processing: Convert PDFs to images and extract detailed information
- Advanced Summarization: Generate comprehensive, well-structured summaries using LLaMA 4 MOE Maverick
- Professional PDF Export: Download summaries as beautifully formatted PDF documents
- Modern Web Interface: Clean, responsive UI built with Streamlit
- Parallel Processing: Multi-threaded extraction for improved performance
- Docker Support: Easy deployment with containerization
- CI/CD Integration: Automated Docker image builds and pushes
- Architecture
- Technical Stack
- Requirements
- Installation
- Usage
- Docker Deployment
- Project Structure
- Screenshots
AlphaExtract follows a pipeline architecture with three main components:
- PDF Processing: Converts PDF documents to images for processing
- Detail Extraction: Uses LLaMA 4 MOE Maverick to extract detailed information from each page
- Summary Generation: Synthesizes extracted information into a coherent, analytical summary
The pipeline is optimized for parallel processing and handles documents of varying lengths efficiently.
- Language: Python 3.10
- Web Framework: Streamlit
- AI Models: Meta's LLaMA 4 MOE Maverick
- Inference Engine: Groq
- PDF Processing: pdf2image, ReportLab
- Package Management: uv
- Containerization: Docker
- CI/CD: GitHub Actions
- Python 3.10 or higher
- Dependencies listed in
pyproject.toml
- Groq API key for inference
- Poppler utils for PDF processing
-
Clone the repository:
git clone https://github.com/yourusername/AlphaExtract.git cd AlphaExtract
-
Install dependencies using uv:
curl -LsSf https://astral.sh/uv/install.sh | sh uv sync
-
Set up environment variables:
export GROQ_API_KEY=your_api_key_here
-
Run the application:
streamlit run main.py
- Access the web interface at
http://localhost:7860
- Upload your PDF document using the sidebar
- Wait for the processing to complete
- View the generated summary
- Download the summary as a PDF document
-
Build the Docker image:
docker build -t alphaextract .
-
Run the container:
docker run -p 7860:7860 -e GROQ_API_KEY=your_api_key_here alphaextract
The application will be available at http://localhost:7860
.
AlphaExtract/
├── .github/
│ └── workflows/
│ └── dockerhubPush.yaml
├── src/
│ ├── components/
│ │ ├── extractPdfDetails.py
│ │ └── summaryEngine.py
│ ├── pipelines/
│ │ └── pipeline.py
│ └── utils/
│ ├── functions.py
│ └── logger.py
├── config.ini
├── Dockerfile
├── main.py
├── prompts.yaml
└── pyproject.toml
main.py
: Streamlit web application entry pointsrc/components/
: Core processing modulessrc/pipelines/
: Pipeline orchestrationconfig.ini
: Configuration settingsprompts.yaml
: LLM system promptsDockerfile
: Container configuration.github/workflows/
: CI/CD configuration
-
Project Demo
Complete demonstration of PDF upload, processing, and summary generation
-
Application Interface
The clean and intuitive application interface
This project is licensed under the MIT License.
Created with ❤️ by Rauhan Ahmed Siddiqui.
For questions or support, please open an issue on the GitHub repository.