A comprehensive Retrieval-Augmented Generation (RAG) system that combines voice input, multimodal document processing, and intelligent search capabilities across multiple sources. Built with FastAPI and Next.js, this system provides real-time AI-powered question answering with visual grounding and citation support.
- Real-time Speech-to-Text: Streaming voice input using Google Cloud Speech-to-Text
- WebSocket-based Audio Processing: Low-latency voice recognition with partial results
- Multi-language Support: Configurable language detection and transcription
- Voice-enabled Chat Interface: Natural conversation flow with voice commands
- Advanced PDF Processing: Extract text, images, charts, and tables from PDFs
- Image Understanding: AI-powered analysis of charts, diagrams, and visual content
- OCR Integration: Text extraction from scanned documents and images
- Smart Chunking: Intelligent text segmentation with context preservation
- Visual Grounding: Link answers to specific document images and pages
- Local RAG System: Vector-based document retrieval with ChromaDB
- Web Search Integration: Real-time web search via SERP API
- Google Drive MCP: Model Context Protocol integration for Drive documents
- Parallel Search Execution: Simultaneous queries across all sources
- Smart Result Fusion: Intelligent combination of results from multiple sources
- Comprehensive Citations: Detailed source attribution for every answer
- Visual Citations: Click-through access to source images and documents
- Confidence Scoring: Reliability indicators for each source
- Source Traceability: Full audit trail of information sources
- Interactive Content Viewer: In-app display of PDFs, images, and web content
- WebSocket Communication: Real-time chat and voice processing
- Streaming Responses: Progressive answer generation
- Live Transcription: Real-time speech-to-text with partial results
- Concurrent Processing: Parallel execution of search and generation tasks
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Frontend β β Backend β β External β
β (Next.js) β β (FastAPI) β β Services β
βββββββββββββββββββ€ ββββββββββββββββββββ€ βββββββββββββββββββ€
β β’ Voice Input βββββΊβ β’ STT Service β β β’ Gemini/Claude β
β β’ Chat UI β β β’ RAG Engine βββββΊβ β’ Google Drive β
β β’ Citations β β β’ Web Search β β β’ SERP API β
β β’ Image Display β β β’ Document Proc. β β β’ ChromaDB β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
- Framework: FastAPI 0.104+ with async/await support
- AI Providers: Google Gemini 1.5 Pro, Anthropic Claude 3 Sonnet
- Vector Database: ChromaDB for embedding storage and retrieval
- Speech Processing: Google Cloud Speech-to-Text API
- Document Processing: PyPDF2, Pillow, pytesseract for OCR
- Search Integration: SERP API for web search, Google Drive API
- WebSocket: Real-time communication with connection management
- Authentication: OAuth 2.0 for Google services
- Framework: Next.js 14 with App Router
- Language: TypeScript for type safety
- UI Library: React 18 with Tailwind CSS
- State Management: Zustand for client state
- Audio Processing: Web Audio API with WebRTC
- Real-time: WebSocket client with auto-reconnection
- Testing: Jest and React Testing Library
- Embeddings: sentence-transformers/all-MiniLM-L6-v2
- Vector Search: Similarity search with configurable thresholds
- Multimodal AI: Vision models for image understanding
- Text Generation: Context-aware response generation
- Confidence Scoring: Relevance and reliability metrics
- Python 3.9+ (3.11 recommended)
- Node.js 18+ with npm/yarn
- Google Cloud Account (for Speech-to-Text)
- AI Provider Account (Gemini or Claude)
git clone <repository-url>
cd agnt
cd backend
pip install -r requirements.txt
cp env.example .env
# Edit .env with your configuration (see Configuration section)
- Create a project in Google Cloud Console
- Enable Speech-to-Text API and Drive API
- Create a service account and download the JSON key
- Set
GOOGLE_CLOUD_SERVICE_ACCOUNT_PATH
in your.env
file
# ChromaDB will be initialized automatically on first run
# Data will be stored in ./chroma_db/ directory
cd frontend
npm install
# or
yarn install
# Create .env.local file
echo "NEXT_PUBLIC_API_URL=http://localhost:8000" > .env.local
echo "NEXT_PUBLIC_WS_URL=ws://localhost:8000" >> .env.local
cd backend
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
cd frontend
npm run dev
# or
yarn dev
- Frontend: http://localhost:3000
- Backend API Docs: http://localhost:8000/docs
- Backend Health Check: http://localhost:8000/health
# Choose your AI provider
AI_PROVIDER=claude # or "gemini"
# API Keys (get one based on your provider choice)
CLAUDE_API_KEY=your_claude_api_key
GEMINI_API_KEY=your_gemini_api_key
# Google Cloud Speech-to-Text (required for voice features)
GOOGLE_CLOUD_SERVICE_ACCOUNT_PATH=/path/to/service-account.json
STT_PROVIDER=google
GOOGLE_SPEECH_MODEL=latest_long
# Web Search (choose one)
SERP_API_KEY=your_serp_api_key # Recommended
GOOGLE_API_KEY=your_google_api_key # Alternative
# Google Drive Integration
GOOGLE_DRIVE_CLIENT_ID=your_client_id
GOOGLE_DRIVE_CLIENT_SECRET=your_client_secret
# Vector Database
CHROMA_PERSIST_DIRECTORY=./chroma_db
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
# Performance Tuning
MAX_SEARCH_RESULTS=10
SIMILARITY_THRESHOLD=0.7
MAX_TOKENS_PER_CHUNK=1000
CHUNK_OVERLAP=200
MAX_CONCURRENT_REQUESTS=100
# Security
CORS_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
SECRET_KEY=your-secret-key-change-in-production
RATE_LIMIT_PER_MINUTE=60
# Feature Flags
ENABLE_WEB_SEARCH=true
ENABLE_GOOGLE_DRIVE=true
ENABLE_VOICE_INPUT=true
ENABLE_IMAGE_ANALYSIS=true
- Claude: Better reasoning, more conservative responses
- Gemini: Faster processing, better multimodal understanding
# Gemini Models
GEMINI_CHAT_MODEL=gemini-1.5-pro
GEMINI_VISION_MODEL=gemini-1.5-pro-vision
# Claude Models
CLAUDE_CHAT_MODEL=claude-3-sonnet-20240229
CLAUDE_VISION_MODEL=claude-3-sonnet-20240229
GET /health
Returns system status and service availability.
POST /upload
Content-Type: multipart/form-data
file: <PDF file>
Upload and process a PDF document with image extraction.
Response:
{
"success": true,
"document_id": "uuid",
"filename": "document.pdf",
"pages_processed": 10,
"images_extracted": 5,
"text_chunks": 25,
"processing_time_ms": 1500
}
POST /query
Content-Type: application/json
{
"query": "What is the main conclusion of the research?",
"num_results": 5,
"include_web_search": true,
"include_drive_search": true
}
Response:
{
"answer": "Based on the research findings...",
"citations": [
{
"id": "cite_1",
"source_type": "document",
"citation_type": "text",
"title": "Research Paper.pdf",
"content": "The main conclusion shows...",
"page_number": 15,
"confidence_score": 0.95
}
],
"confidence_score": 0.87,
"processing_time_ms": 2300
}
GET /citation/{citation_id}
Retrieve full content and metadata for a specific citation.
const ws = new WebSocket('ws://localhost:8000/ws/stt');
// Send audio data
ws.send(audioBuffer);
// Receive transcription
ws.onmessage = (event) => {
const result = JSON.parse(event.data);
console.log(result.text, result.confidence);
};
const ws = new WebSocket('ws://localhost:8000/ws/chat');
// Send message
ws.send(JSON.stringify({
type: 'query',
message: 'Hello, how can you help me?',
session_id: 'session_123'
}));
The frontend includes a comprehensive API client in src/lib/api.ts
:
import { QueryRequest, QueryResponse } from '@/types/api';
// Query the system
const response = await api.query({
query: 'What is machine learning?',
num_results: 5
});
// Upload document
const result = await api.uploadDocument(file);
// Get citation details
const citation = await api.getCitation(citationId);
- Open the application at http://localhost:3000
- Type your question in the chat input
- View the AI-generated response with citations
- Click citations to view source content
- Click the microphone icon in the chat interface
- Speak your question clearly
- Watch real-time transcription appear
- Release to send the query
- Receive voice-enabled response
- Click the upload button or drag files into the interface
- Select a PDF document (with images/charts)
- Wait for processing to complete
- Ask questions about the document content
- View responses with page-specific citations
# Query with specific filters
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{
"query": "quarterly revenue trends",
"num_results": 10,
"include_web_search": true,
"filters": {
"date_range": "2023-2024",
"document_type": "financial"
}
}'
cd backend
pytest -v
pytest --cov=app tests/ # With coverage
cd frontend
npm test
npm run test:watch # Watch mode
# Backend
cd backend
black .
flake8 .
# Frontend
cd frontend
npm run lint
npm run type-check
-
Feature Development
- Create feature branch from
main
- Add tests for new functionality
- Update documentation as needed
- Create feature branch from
-
Testing
- Run full test suite
- Test with different AI providers
- Verify WebSocket functionality
-
Code Review
- Check API compatibility
- Verify error handling
- Test edge cases
agnt/
βββ backend/ # FastAPI backend
β βββ app/
β β βββ config.py # Configuration management
β β βββ models/ # Pydantic schemas
β β βββ services/ # Business logic
β β βββ websocket/ # WebSocket handlers
β βββ main.py # FastAPI application
β βββ requirements.txt # Python dependencies
βββ frontend/ # Next.js frontend
β βββ src/
β β βββ app/ # App router pages
β β βββ components/ # React components
β β βββ lib/ # Utilities and API client
β β βββ store/ # State management
β βββ package.json # Node.js dependencies
βββ README.md # This file
# Check Google Cloud credentials
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
# Verify API is enabled
gcloud services list --enabled | grep speech
# Test authentication
python -c "from google.cloud import speech; print('Auth OK')"
# Reset ChromaDB
rm -rf backend/chroma_db/
# Restart backend to reinitialize
# Update CORS_ORIGINS in .env
CORS_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
# Check firewall settings
# Verify WebSocket URL in frontend config
# Check backend logs for connection errors
- Reduce image resolution in processing
- Increase
MAX_CONCURRENT_REQUESTS
- Use SSD storage for database
- Adjust
EMBEDDING_BATCH_SIZE
- Limit
MAX_TOKENS_PER_CHUNK
- Monitor vector database size
- Enable caching with Redis
- Optimize similarity threshold
- Use parallel search execution
# Backend
LOG_LEVEL=DEBUG
# Frontend
NEXT_PUBLIC_DEBUG=true
# Check service status
curl http://localhost:8000/health
# View logs
tail -f backend/app.log
# Build and run with Docker
docker-compose up --build -d
# Production environment variables
DEBUG=false
RELOAD=false
LOG_LEVEL=INFO
CORS_ORIGINS=https://yourdomain.com
- Use PostgreSQL for metadata storage
- Implement Redis for caching
- Set up load balancing for multiple instances
- Configure CDN for static assets
- Change default SECRET_KEY
- Enable HTTPS in production
- Implement rate limiting
- Secure API endpoints
- Validate file uploads
- Monitor for suspicious activity
- Fork the repository
- Create a feature branch
- Install development dependencies
- Run tests to ensure everything works
- Make your changes
- Add tests for new functionality
- Submit a pull request
- Python: Follow PEP 8, use type hints
- TypeScript: Use strict mode, proper interfaces
- Documentation: Update README for new features
- Testing: Maintain test coverage above 80%
- Check existing issues first
- Provide detailed reproduction steps
- Include system information
- Add relevant logs and error messages
MIT License - see LICENSE file for details.
- Documentation: Check this README and API docs
- Issues: Create GitHub issue for bugs
- Discussions: Use GitHub Discussions for questions