EvalX is a comprehensive evaluation framework for Large Language Model applications that combines traditional metrics, LLM-as-judge evaluations, and intelligent agentic orchestration with research-grade validation. graph TB
- π€ Agentic Orchestration: Natural language instructions β automatic evaluation planning
- π Comprehensive Metrics: Traditional + LLM-as-judge + hybrid approaches
- π¬ Research-Grade Validation: Statistical analysis, confidence intervals, meta-evaluation
- π¨ Multimodal Support: Vision-language, code, audio evaluation
- β‘ Production Ready: Async processing, caching, CLI interface
- π― Adaptive Selection: AI-powered optimal metric selection
EvalX includes the industry's first meta-evaluation system that assesses the quality of evaluation metrics themselves:
- Reliability assessment through test-retest analysis
- Validity measurement against ground truth
- Bias detection across demographic groups
- Interpretability scoring
Automatically selects optimal metrics based on:
- Task type and domain
- Quality requirements (research vs. production)
- Computational constraints
- Fairness requirements
pip install evalx
For development:
pip install evalx[dev]
For research features:
pip install evalx[research]
For production deployment:
pip install evalx[production]
import evalx
# Create evaluation suite from natural language instruction
suite = evalx.EvaluationSuite.from_instruction(
"Evaluate my chatbot responses for helpfulness and accuracy"
)
# Your data
data = [
{
"input": "What's the capital of France?",
"output": "The capital of France is Paris.",
"reference": "Paris is the capital city of France."
}
]
# Run evaluation
results = await suite.evaluate_async(data)
print(results.summary())
from evalx import MetricSuite
# Create custom metric combination
suite = MetricSuite()
suite.add_traditional_metric("bleu_score")
suite.add_traditional_metric("semantic_similarity", threshold=0.8)
suite.add_llm_judge("accuracy", model="gpt-4")
results = suite.evaluate(data)
from evalx import ResearchSuite
# Comprehensive statistical analysis
suite = ResearchSuite(
metrics=["accuracy", "helpfulness", "bleu"],
confidence_level=0.95,
bootstrap_samples=1000
)
results = await suite.evaluate_research_grade(data)
print(f"Mean Β± Std: {results.mean:.3f} Β± {results.std:.3f}")
print(f"95% CI: [{results.confidence_interval[0]:.3f}, {results.confidence_interval[1]:.3f}]")
from evalx.metrics.multimodal import MultimodalInput, ImageCaptionQualityMetric
# Image captioning evaluation
input_data = MultimodalInput(
input_text="Describe this image",
output_text="A beautiful sunset over the ocean",
image="path/to/image.jpg"
)
metric = ImageCaptionQualityMetric()
result = metric.evaluate(input_data)
from evalx.meta_evaluation import MetaEvaluator
# Evaluate your metrics' quality
meta_evaluator = MetaEvaluator()
quality_report = meta_evaluator.evaluate_metric_quality(
metric=my_metric,
evaluation_data=test_data,
ground_truth=human_ratings
)
print(f"Metric Quality: {quality_report.overall_quality:.3f}")
print(f"Reliability: {quality_report.reliability:.3f}")
print(f"Validity: {quality_report.validity:.3f}")
print(f"Bias Score: {quality_report.bias:.3f}")
# Evaluate using natural language
evalx evaluate "Check my chatbot for helpfulness" --data data.json
# Research-grade evaluation
evalx research --data data.json --metrics accuracy helpfulness --confidence 0.95
# List available metrics
evalx metrics --list
- BLEU: N-gram overlap with smoothing
- ROUGE: Recall-oriented evaluation (ROUGE-1, ROUGE-2, ROUGE-L)
- METEOR: Semantic matching with synonyms and stemming
- BERTScore: Contextual embedding similarity
- Semantic Similarity: Sentence transformer-based
- Exact Match: String matching with normalization
- Levenshtein: Edit distance with word/character level
- Accuracy: Factual correctness assessment
- Helpfulness: Response utility evaluation
- Coherence: Logical consistency measurement
- Groundedness: Source attribution verification
- Relevance: Query-response alignment
- Image-Text Alignment: CLIP-based similarity
- Image Caption Quality: Comprehensive captioning assessment
- Code Correctness: Syntax, execution, security analysis
- Audio Quality: Signal processing metrics
Feature | EvalX | DeepEval | LangChain | Ragas |
---|---|---|---|---|
Meta-evaluation | β Unique | β | β | β |
Statistical rigor | β Best | Basic | Basic | Good |
Multimodal support | β Comprehensive | Limited | Limited | Limited |
Adaptive selection | β Unique | β | β | β |
Natural language interface | β Full | β | β | β |
Production ready | β Complete | Good | Basic | Good |
We welcome contributions!
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
- Built for the AI evaluation community
- Inspired by advances in LLM evaluation research
- Designed for both researchers and practitioners
EvalX: Making AI evaluation comprehensive, reliable, and accessible.