Cognivia AI leverages cutting-edge technologies to provide intelligent document analysis and natural language interactions.
Author: Muhammad Husnain Ali
- Python - Primary programming language
- LangChain - Framework for building LLM applications
- OpenAI GPT-3.5 - Large Language Model for text processing
- Streamlit - Web application framework
- Pinecone - Vector database for similarity search
- Supabase - PostgreSQL database for conversation history
- PyPDF2 - PDF processing library
- OpenAI Embeddings - text-embedding-3-small model (512 dimensions)
- Vector Search - Semantic similarity matching
- Conversation Memory - Context-aware chat history
- Python Virtual Environment - Dependency isolation
- Environment Variables - Secure configuration management
- SQL - Database schema management
-
Advanced PDF Processing
- Automatic text extraction and semantic chunking
- Support for multiple PDF uploads
- Intelligent document metadata preservation
- OCR support for scanned documents
-
Optimized RAG (Retrieval-Augmented Generation)
- Document-Only Responses: Strictly answers based on uploaded documents
- Existing Document Support: Automatically detects and works with pre-existing PDFs
- Similarity Threshold Filtering: Configurable relevance scoring
- Generic Response Detection: Prevents hallucination and general knowledge responses
- Source Attribution: Always cites document sources with page numbers
- Context Validation: Ensures answers are grounded in document content
-
AI-Powered Question Answering
- Natural language understanding with document constraints
- Context-aware responses from your PDFs only
- Multi-document correlation and analysis
- Intelligent "I don't know" responses when information isn't available
-
Enterprise-Grade Vector Search
- High-performance similarity matching with thresholds
- Scalable document indexing
- Real-time search capabilities
- Configurable search parameters and document limits
-
Smart Conversation Management
- Persistent chat history with Supabase
- Context retention across sessions
- Document-aware conversation flow
- Multi-user support with session isolation
-
Modern Chatbot Interface
- Chat Bubble Design: WhatsApp-style message interface
- Real-time Conversations: Instant responses with typing indicators
- Source Document Display: Expandable source citations
- Responsive Design: Mobile-friendly chat experience
- Document Status Tracking: Upload progress and document counts
- Frontend: Streamlit web interface
- LLM: OpenAI GPT-3.5-turbo for intelligent responses
- Embeddings: OpenAI text-embedding-3-small (512 dimensions)
- Vector Store: Pinecone for document similarity search
- Memory: Supabase PostgreSQL for conversation persistence
- PDF Processing: PyPDF2 with intelligent text chunking
- Python 3.8+
- OpenAI API key
- Pinecone account and API key
- Supabase project (for conversation memory)
# Clone the repository
git clone <repository-url>
cd ai-pdf-search-engine
# Create and activate virtual environment
# For Windows
python -m venv venv
.\venv\Scripts\activate
# For macOS/Linux
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Create .env
file in the project root:
# Required API Keys
OPENAI_API_KEY=your_openai_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_INDEX_NAME=your_index_name
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_DIMENSION=512
SUPABASE_URL=your_supabase_project_url
SUPABASE_KEY=your_supabase_anon_key
# RAG Optimization Settings (Optional)
SIMILARITY_THRESHOLD=0.7 # Document relevance threshold (0.0-1.0)
MAX_DOCUMENTS_PER_QUERY=5 # Maximum documents to retrieve
LLM_TEMPERATURE=0 # Response creativity (0.0-1.0)
MAX_TOKENS=1000 # Maximum response length
# Chatbot Settings (Optional)
MAX_CHAT_HISTORY=20 # Messages to keep in memory
ENABLE_SOURCE_DISPLAY=true # Show source documents
- Navigate to your Supabase project dashboard
- Go to the SQL Editor
- Open the provided
setup_supabase.sql
file in the project root - Execute the SQL commands to:
- Create chat sessions and messages tables
- Set up appropriate indexes
- Enable Row Level Security (RLS)
- Configure access policies
The SQL file includes all necessary table definitions, indexes, and security policies for the chat system.
# Option 1: Use the runner script (recommended)
python run_app.py
# Option 2: Run directly with Streamlit
# Make sure your virtual environment is activated
# For Windows
.\venv\Scripts\activate
# For macOS/Linux
source venv/bin/activate
# Run the application
streamlit run app.py
Run the test scripts to verify functionality:
# Test that the system only responds based on documents
python test_optimized_rag.py
# Demo working with existing documents (if any)
python demo_existing_docs.py
When you're done working on the project, you can deactivate the virtual environment:
deactivate
ai-pdf-search-engine/
βββ app.py # Streamlit web interface
βββ config.py # Configuration and environment variables
βββ pdf_processor.py # PDF text extraction and chunking
βββ vector_store.py # Pinecone vector database integration
βββ qa_system.py # Question-answering logic
βββ pdf_search_engine.py # Main orchestration class
βββ supabase_memory.py # Conversation memory with Supabase
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variables template
βββ setup_supabase.sql # Database schema for memory
βββ .gitignore # Git ignore configuration
βββ README.md # This file
# Fine-tune document retrieval and response quality
SIMILARITY_THRESHOLD = 0.7 # Higher = more strict document relevance
MAX_DOCUMENTS_PER_QUERY = 5 # More documents = better context, slower response
LLM_TEMPERATURE = 0 # 0 = deterministic, 1 = creative responses
MAX_TOKENS = 1000 # Longer responses vs. faster generation
# config.py
CHUNK_SIZE = 1000 # Adjust based on document complexity
CHUNK_OVERLAP = 200 # Increase for better context preservation
MAX_CHAT_HISTORY = 20 # Balance memory vs. performance
CACHE_TTL = 3600 # Cache lifetime in seconds
- Recommended Pinecone tier: Standard or Enterprise
- Minimum RAM: 4GB
- Recommended CPU: 4 cores
- Storage: 10GB+ for document cache
-
PDF Processing Fails
- Ensure PDF is not password protected
- Check file permissions
- Verify PDF is not corrupted
-
Vector Store Errors
- Confirm Pinecone API key is valid
- Check index dimensions match configuration
- Verify network connectivity
-
Memory Issues
- Clear browser cache
- Restart application
- Check Supabase connection
-
Existing Documents Not Found
- Verify correct Pinecone index name in .env
- Check if using different API keys
- Run
python demo_existing_docs.py
to diagnose - Use "Refresh Documents" button in the app
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature
) - Commit changes (
git commit -m 'Add AmazingFeature'
) - Push to branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
- OpenAI team for their powerful language models
- Pinecone for vector search capabilities
- Supabase team for the excellent database platform
- LangChain community for the framework
- All contributors and users of this project
- π§ Email: m.husnainali.work@gmail.com
- π Issues: GitHub Issues
Made with β€οΈ by Muhammad Husnain Ali