Releases: LSeu-Open/LLMScoreEngine
v0.6.1 Beta
LLMScoreEngine v0.6.1 – Enhanced HTML Report
This release delivers a major overhaul of the graphical reporting system, bringing a more polished, performant, and secure experience.
📊 Graphical Reporting Enhancements
- ✅ Refactored CSS: Completely rewritten for improved visual design, responsiveness, and interactivity.
- ⚡ Optimized JavaScript: Performance improvements to ensure smoother rendering and faster interactions.
- 🔒 Enhanced Security: Added Content-Security-Policy (CSP) headers to strengthen report security.
- 🎨 Improved UX: Refined interfaces for both the Leaderboards and Model Comparison Tool — cleaner, more intuitive, and user-friendly.
Experience a faster, safer, and visually refined reporting experience with every evaluation.
v0.6.0 Beta
LLMScoreEngine v0.6.0 - New Feature & Enhanced Technical Scoring
This update introduces a powerful Command Line Interface (CLI) feature to the LLM-Score-Engine, enabling the generation of comprehensive graphical reports from model evaluation data. The new functionality allows users to produce detailed, visually rich analyses of their models' performance metrics.
New Feature: Graphical Reporting
- Graphical Report (
--graph
):
Generate an in-depth graphical report from your latest CSV output (via--csv
option) with a single command.
This interactive report includes:- An intuitive and interactive Leaderboard for comparing LLM performances
- A Model Comparison Tool for detailed side-by-side analysis
- Cost/Efficiency-Performance metrics Leaderboard
- Comprehensive graphical exploration of all evaluation metrics produced by the engine
Enhanced Technical Scoring Methodology
- The JSON model configuration files now support both input and output pricing specifications for all models.
- This expanded pricing data enables a more accurate, real-world technical scoring methodology that better reflects actual operational costs.
v0.5.2 Beta
LLMScoreEngine v0.5.2 - Bug Fix Release
This quick update addresses a critical bug affecting the main scoring function.
Bug Fixes
-
Refactors the
calculate_final_score
method to compute a unifiedoverall_benchmark_score
from the entity and dev benchmark results. This score is then explicitly passed intocalculate_technical_score
for the size-performance ratio calculation. This change resolves an implicit dependency wherecalculate_technical_score
relied on abenchmark_score
that was not guaranteed to be present in its inputs. The new implementation makes the scoring logic more robust, self-contained, and ensures that the technical evaluation is always based on the model's calculated benchmark performance. -
Resolve a bunch of lintr errors to improve code quality.
v0.5.1 Beta
LLMScoreEngine v0.5.1 - Bug Fix Release
This quick update addresses a critical bug affecting the main scoring function.
Bug Fixes
- Fix processing of models with '.' in their names: Corrected the main scoring function to properly handle models with periods ('.') in their names, such as "Claude 3.7 Sonnet."
v0.5.0 Beta
LLMScoreEngine v0.5.0 - New Features, Major Refactoring and Testing Overhaul
This update represents a major refactoring of the LLMScoreEngine, moving from a system with hardcoded values to a centralized, configuration-driven architecture. This is complemented by a complete, from-scratch pytest
testing framework to ensure the reliability and correctness of the application.
New Features
- Batch Scoring (
--all
): Score all models in theModels/
directory with a single command. - Dynamic Configuration (
--config
): Provide an external Python configuration file to experiment with scoring parameters without modifying the core code. - CSV Report Generation (
--csv
): Generate a consolidated CSV report of all model scores, saved in theResults/
directory. - Quiet Output Mode (
--quiet
): Suppress all informational output to show only the final model scores, ideal for automated scripts.
Refactoring & Bug Fixes
-
Architecture:
- Refactored the application to use a centralized, immutable configuration file (
config/scoring_config.py
) instead of hardcoded values. - Improved import paths and module structure to prevent
ImportError
issues and clarify the package API.
- Refactored the application to use a centralized, immutable configuration file (
-
Scoring Logic:
- The
ModelScorer
class was rewritten to source all parameters from the central configuration. - Corrected scoring formulas in
hf_score.py
to align with the project documentation.
- The
-
Bug Fixes:
- Fixed a critical bug that caused benchmark scores to be ignored during calculations.
- Resolved a case-insensitivity bug in
data/loaders.py
that prevented model files from being found.
-
Testing:
- Built a comprehensive
pytest
testing framework from scratch, with unit and end-to-end tests covering all critical modules. - Used mocking for external APIs to ensure tests are fast and deterministic.
- Improved test isolation and added verbose logging for clearer debugging.
- Built a comprehensive
v0.4.1 Beta
Small beta release.
- General code Cleanup to avoid unecessary dev comments to improve readability
- Update hf_score.py to allow CLI usage
v0.4.0 Beta
This beta release introduces significant enhancements to the LLMScoreEngine, focusing on a more nuanced approach to benchmark scoring, community evaluation, and technical assessments.
🎯 Rethinking Entity Benchmarks (Now 30 points total)
The Entity Benchmarks category has been updated to provide a more comprehensive evaluation:
-
Expanded Coverage: Scores are now incorporated from 10 distinct entities, categorized into three key specialties.
-
Specialized Breakdown:
-
Generalist (each contributes 10% to the Entity Benchmark score):
- Artificial Analysis Intelligence Score
- OpenCompass LLM Average Score
- LLM Explorer Score
- Livebench Average Score
- Open LLM Leaderboard Average Score
- UGI Leaderboard UGI Score
- Dubersor LLM Leaderboard Total Score
-
Coding (each contributes 10% to the Entity Benchmark score):
- BigCodeBench Leaderboard Score
- EvalPlus Leaderboard Pass@1 Score
-
Vision (contributes 10% to the Entity Benchmark score):
- Open VLM Leaderboard Average Score
-
🛠️ Structured Dev Benchmarks (30 points total)
This release formalizes and details the structure for Dev Benchmarks, ensuring a wide-ranging assessment of model capabilities. The system is designed to cover multiple facets of modern SOTA model performance:
- Comprehensive Categories: Models are evaluated across eight key areas:
- General Knowledge & Reasoning
- Instruction Following
- Math
- Coding
- Multilingual Capabilities
- Context Handling
- Function Calling (Tool Use & Agents)
- Vision (Multimodal Capabilities)
- Emphasis on Data Integrity: Higher weight is assigned to benchmarks that are mindful of potential data contamination.
- Weighted Evaluation: Each specific benchmark within these categories has an assigned weight, reflecting its relative importance in the overall Dev Benchmark score.
- Robust Scoring System: The scoring methodology includes normalization of benchmark scores and a system for proportionally redistributing weights when data for specific benchmarks is unavailable, ensuring fair and consistent comparisons.
- In-depth Methodology: For a detailed breakdown of individual benchmarks, their weights, contamination status, and the precise scoring formulas, please refer to the Scoring Framework Description.
👥 Revamped Community Score
The Community Score now offers a more holistic view of a model's standing:
- Hugging Face Score Integration: Alongside the LMSys Arena Elo, a Hugging Face score has been added. This new metric is calculated based on factors like downloads, likes, and the model's age.
- Further Details: For a deeper understanding of this calculation, please refer to the Scoring Framework Description.
⚙️ Enhanced Technical Score
The Technical Score has been refined for greater accuracy and fairness:
- Smoother Points Distribution:
- Price: Points are now distributed linearly.
- Context Window: Points follow a logarithmic distribution.
- These changes address previous issues with tiered or stepped scoring.
- Improved Model Size to Performance Ratio: This metric now utilizes parameter count and architecture type to more effectively assess a model's performance relative to its size and architectural efficiency.
- Further Details: Learn more about these improvements in the Scoring Framework Description.
v0.3.1 Beta
Initial Release Following Migration
This marks the official launch of the AI Scoring Framework now called LLMScoreEngine in its new dedicated repository. Key updates include :
- Migrated the scoring framework AIEnhancedWork from to this repository for centralized management.
- Reorganized the project structure to enhance maintainability, scalability, and clarity.
- Upgraded the README file with detailed documentation, usage examples, and streamlined instructions for easier onboarding.
- Added CLI support for command-line interactions, enabling faster and more flexible workflows.
This release lays the foundation for future features, improvements, and contributions. 🚀