Optimizes & Tests prompts for LLM with PromptOps
A comprehensive platform for testing and evaluating large language model (LLM) prompts through systematic perturbation analysis and robustness testing. This project enables researchers and developers to assess prompt reliability across different LLM providers and testing scenarios.
- Multi-LLM Support: Integration with OpenAI GPT and Google Gemini
- Prompt Testing Framework: Systematic testing with perturbation analysis
- Interactive Web Interface: Drag-and-drop project builder with real-time testing
- Robustness Analysis: 10+ perturbation types including taxonomy, NER, temporal, negation, and fairness
- Score Comparison: Visualization and analysis of model performance across different configurations
- Applicability Check: Automated assessment of which perturbations are relevant for your data
- Sentiment Analysis: Test sentiment classification robustness
- Question Answering: Evaluate QA performance with and without context
- Custom Prompts: Build and test your own prompt configurations
- OpenAI: GPT-3.5, GPT-4, GPT-4o
- Google Gemini: 2.0 Flash
- React-based UI with TypeScript
- Drag-and-drop interface for building test configurations
- Real-time dashboards with Chart.js visualizations
- Authentication system with NextAuth.js
- Project management with MongoDB integration
- RESTful API with automatic documentation
- Celery task queue for async processing
- Redis for caching and session management
- Comprehensive testing suite with 10+ perturbation types
- Multi-format export (JSON, CSV, Excel)
- Docker containerization with multi-service orchestration
- Nginx reverse proxy with SSL support
- Horizontal scaling with worker processes
- Health monitoring and logging
- Docker and Docker Compose
- Node.js 18+ (for development)
- Python 3.8+ (for local development)
-
Clone the repository
git clone https://github.com/MUICT-SERU/SP2024-Noppomummum.git cd SP2024-Noppomummum
-
Set up environment variables
cp .env.example .env cp api/.env.example api/.env # Edit the files with your configuration
-
Start the application
docker-compose up -d
-
Access the application
- Web Interface: http://localhost:3000
-
Install dependencies
# Frontend pnpm install # Backend cd api && pip install -r requirements.txt
-
Start development services
# Start MongoDB and Redis docker-compose up redis -d # Start backend cd api && uvicorn index:app --reload --port 5328 # Start frontend pnpm dev
Frontend (.env)
NEXTAUTH_SECRET=your-nextauth-secret-key-here
NEXTAUTH_URL=http://localhost:3000
MONGODB_URI=mongodb://localhost:27017/promptops
FASTAPI_URL=http://localhost:5328
NEXT_PUBLIC_FASTAPI_URL=http://localhost:5328
NEXT_PUBLIC_API_KEY=your-api-key-here
API_KEY_ENCRYPTION_KEY=your-32-character-encryption-key-here
Backend (api/.env)
API_KEY=your-api-key-here
REDIS_URL=redis://redis:6379
ALLOWED_ORIGINS=http://localhost:3000
HTTPS_REDIRECT=FALSE
API_KEY_ENCRYPTION_KEY=your-32-character-encryption-key-here
Configure your LLM provider API keys in the web interface:
- OpenAI API Key
- Google Gemini API Key
- Navigate to the dashboard and click "New Project"
- Select project type: Sentiment Analysis, QA with Context, or QA without Context
- Configure LLM settings: Choose your model provider and settings
- Upload test data: CSV file with your prompts and expected results
- Select perturbations: Choose which robustness tests to apply
- Run tests: Execute the test suite and view results
- Similarity Scores: Cosine similarity between original and perturbed responses
- Robustness Metrics: Performance degradation under perturbations
- Applicability Analysis: Which perturbations are relevant for your data
- Comparative Analysis: Side-by-side model performance comparison
- Taxonomy: Semantic word replacement using WordNet hierarchies
- Named Entity Recognition (NER): Entity substitution while preserving context
- Temporal: Time-related modifications and temporal logic changes
- Negation: Logical negation insertion and removal
- Coreference: Pronoun resolution and reference changes
- Semantic Role Labeling (SRL): Argument structure modification
- Logic: Logical operator and connector changes
- Vocabulary: Synonym replacement and lexical variations
- Fairness: Bias detection through demographic attribute changes
- Robustness: General stress testing with noise and variations
Each perturbation type includes:
- Applicability checking: Automatic detection if perturbation applies to your data
- Severity levels: Control the intensity of modifications
- Context preservation: Maintain semantic meaning while introducing variations
- Batch processing: Apply perturbations to entire datasets efficiently
This project is based on research from "Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning" and builds upon various open-source libraries and frameworks:
- Next.js and React ecosystem for the frontend interface
- FastAPI and Python data science stack for backend processing
- Redis and Celery for distributed task processing
- Chart.js and Recharts for data visualization
- Docker for containerization and deployment
- NLTK, spaCy, and transformers for natural language processing
- OpenAI, Google AI for LLM integrations