Skip to content

Cognivia AI is a powerful AI-powered PDF search and question-answering system built with LangChain, Pinecone Vector Store, OpenAI, and Supabase. Upload PDFs, ask questions, and get intelligent answers with persistent conversation memory.

Notifications You must be signed in to change notification settings

M-Husnain-Ali/Cognivia-AI

Repository files navigation

πŸ” Cognivia AI

Python Platform LangChain OpenAI Pinecone Supabase

Cognivia AI leverages cutting-edge technologies to provide intelligent document analysis and natural language interactions.

Author: Muhammad Husnain Ali

πŸ› οΈ Technologies Used

Core Technologies

Data Processing & Storage

  • Pinecone - Vector database for similarity search
  • Supabase - PostgreSQL database for conversation history
  • PyPDF2 - PDF processing library

AI/ML Components

  • OpenAI Embeddings - text-embedding-3-small model (512 dimensions)
  • Vector Search - Semantic similarity matching
  • Conversation Memory - Context-aware chat history

Development Tools

  • Python Virtual Environment - Dependency isolation
  • Environment Variables - Secure configuration management
  • SQL - Database schema management

πŸš€ Features

  • Advanced PDF Processing

    • Automatic text extraction and semantic chunking
    • Support for multiple PDF uploads
    • Intelligent document metadata preservation
    • OCR support for scanned documents
  • Optimized RAG (Retrieval-Augmented Generation)

    • Document-Only Responses: Strictly answers based on uploaded documents
    • Existing Document Support: Automatically detects and works with pre-existing PDFs
    • Similarity Threshold Filtering: Configurable relevance scoring
    • Generic Response Detection: Prevents hallucination and general knowledge responses
    • Source Attribution: Always cites document sources with page numbers
    • Context Validation: Ensures answers are grounded in document content
  • AI-Powered Question Answering

    • Natural language understanding with document constraints
    • Context-aware responses from your PDFs only
    • Multi-document correlation and analysis
    • Intelligent "I don't know" responses when information isn't available
  • Enterprise-Grade Vector Search

    • High-performance similarity matching with thresholds
    • Scalable document indexing
    • Real-time search capabilities
    • Configurable search parameters and document limits
  • Smart Conversation Management

    • Persistent chat history with Supabase
    • Context retention across sessions
    • Document-aware conversation flow
    • Multi-user support with session isolation
  • Modern Chatbot Interface

    • Chat Bubble Design: WhatsApp-style message interface
    • Real-time Conversations: Instant responses with typing indicators
    • Source Document Display: Expandable source citations
    • Responsive Design: Mobile-friendly chat experience
    • Document Status Tracking: Upload progress and document counts

πŸ—οΈ Architecture

  • Frontend: Streamlit web interface
  • LLM: OpenAI GPT-3.5-turbo for intelligent responses
  • Embeddings: OpenAI text-embedding-3-small (512 dimensions)
  • Vector Store: Pinecone for document similarity search
  • Memory: Supabase PostgreSQL for conversation persistence
  • PDF Processing: PyPDF2 with intelligent text chunking

βš™οΈ Requirements

  • Python 3.8+
  • OpenAI API key
  • Pinecone account and API key
  • Supabase project (for conversation memory)

πŸš€ Quick Setup

1. Clone and Setup Virtual Environment

# Clone the repository
git clone <repository-url>
cd ai-pdf-search-engine

# Create and activate virtual environment
# For Windows
python -m venv venv
.\venv\Scripts\activate

# For macOS/Linux
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Configure Environment

Create .env file in the project root:

# Required API Keys
OPENAI_API_KEY=your_openai_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_INDEX_NAME=your_index_name
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_DIMENSION=512
SUPABASE_URL=your_supabase_project_url
SUPABASE_KEY=your_supabase_anon_key

# RAG Optimization Settings (Optional)
SIMILARITY_THRESHOLD=0.7          # Document relevance threshold (0.0-1.0)
MAX_DOCUMENTS_PER_QUERY=5         # Maximum documents to retrieve
LLM_TEMPERATURE=0                 # Response creativity (0.0-1.0)
MAX_TOKENS=1000                   # Maximum response length

# Chatbot Settings (Optional)
MAX_CHAT_HISTORY=20               # Messages to keep in memory
ENABLE_SOURCE_DISPLAY=true        # Show source documents

3. Setup Supabase Tables

  1. Navigate to your Supabase project dashboard
  2. Go to the SQL Editor
  3. Open the provided setup_supabase.sql file in the project root
  4. Execute the SQL commands to:
    • Create chat sessions and messages tables
    • Set up appropriate indexes
    • Enable Row Level Security (RLS)
    • Configure access policies

The SQL file includes all necessary table definitions, indexes, and security policies for the chat system.

4. Run Application

# Option 1: Use the runner script (recommended)
python run_app.py

# Option 2: Run directly with Streamlit
# Make sure your virtual environment is activated
# For Windows
.\venv\Scripts\activate

# For macOS/Linux
source venv/bin/activate

# Run the application
streamlit run app.py

5. Test the System

Run the test scripts to verify functionality:

# Test that the system only responds based on documents
python test_optimized_rag.py

# Demo working with existing documents (if any)
python demo_existing_docs.py

6. Deactivating Virtual Environment

When you're done working on the project, you can deactivate the virtual environment:

deactivate

πŸ—οΈ Project Structure

ai-pdf-search-engine/
β”œβ”€β”€ app.py                 # Streamlit web interface
β”œβ”€β”€ config.py             # Configuration and environment variables
β”œβ”€β”€ pdf_processor.py      # PDF text extraction and chunking
β”œβ”€β”€ vector_store.py       # Pinecone vector database integration
β”œβ”€β”€ qa_system.py          # Question-answering logic
β”œβ”€β”€ pdf_search_engine.py  # Main orchestration class
β”œβ”€β”€ supabase_memory.py    # Conversation memory with Supabase
β”œβ”€β”€ requirements.txt      # Python dependencies
β”œβ”€β”€ .env.example         # Environment variables template
β”œβ”€β”€ setup_supabase.sql   # Database schema for memory
β”œβ”€β”€ .gitignore          # Git ignore configuration
└── README.md           # This file

πŸ’‘ Advanced Configuration

RAG Optimization

# Fine-tune document retrieval and response quality
SIMILARITY_THRESHOLD = 0.7     # Higher = more strict document relevance
MAX_DOCUMENTS_PER_QUERY = 5    # More documents = better context, slower response
LLM_TEMPERATURE = 0            # 0 = deterministic, 1 = creative responses
MAX_TOKENS = 1000              # Longer responses vs. faster generation

Performance Tuning

# config.py
CHUNK_SIZE = 1000          # Adjust based on document complexity
CHUNK_OVERLAP = 200        # Increase for better context preservation
MAX_CHAT_HISTORY = 20      # Balance memory vs. performance
CACHE_TTL = 3600          # Cache lifetime in seconds

Scaling Considerations

  • Recommended Pinecone tier: Standard or Enterprise
  • Minimum RAM: 4GB
  • Recommended CPU: 4 cores
  • Storage: 10GB+ for document cache

πŸ”§ Troubleshooting

Common Issues

  1. PDF Processing Fails

    • Ensure PDF is not password protected
    • Check file permissions
    • Verify PDF is not corrupted
  2. Vector Store Errors

    • Confirm Pinecone API key is valid
    • Check index dimensions match configuration
    • Verify network connectivity
  3. Memory Issues

    • Clear browser cache
    • Restart application
    • Check Supabase connection
  4. Existing Documents Not Found

    • Verify correct Pinecone index name in .env
    • Check if using different API keys
    • Run python demo_existing_docs.py to diagnose
    • Use "Refresh Documents" button in the app

🀝 Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit changes (git commit -m 'Add AmazingFeature')
  4. Push to branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ™ Acknowledgments

  • OpenAI team for their powerful language models
  • Pinecone for vector search capabilities
  • Supabase team for the excellent database platform
  • LangChain community for the framework
  • All contributors and users of this project

πŸ“ž Support


Made with ❀️ by Muhammad Husnain Ali

About

Cognivia AI is a powerful AI-powered PDF search and question-answering system built with LangChain, Pinecone Vector Store, OpenAI, and Supabase. Upload PDFs, ask questions, and get intelligent answers with persistent conversation memory.

Topics

Resources

Stars

Watchers

Forks

Languages