07 Aug 12:00

97207f0

v0.6.1 Beta Latest

Latest

LLMScoreEngine v0.6.1 – Enhanced HTML Report

This release delivers a major overhaul of the graphical reporting system, bringing a more polished, performant, and secure experience.

📊 Graphical Reporting Enhancements

✅ Refactored CSS: Completely rewritten for improved visual design, responsiveness, and interactivity.
⚡ Optimized JavaScript: Performance improvements to ensure smoother rendering and faster interactions.
🔒 Enhanced Security: Added Content-Security-Policy (CSP) headers to strengthen report security.
🎨 Improved UX: Refined interfaces for both the Leaderboards and Model Comparison Tool — cleaner, more intuitive, and user-friendly.

Experience a faster, safer, and visually refined reporting experience with every evaluation.

Assets 2

18 Jul 12:17

LSeu-Open

v0.6.0-Beta

90ef8eb

v0.6.0 Beta

LLMScoreEngine v0.6.0 - New Feature & Enhanced Technical Scoring

This update introduces a powerful Command Line Interface (CLI) feature to the LLM-Score-Engine, enabling the generation of comprehensive graphical reports from model evaluation data. The new functionality allows users to produce detailed, visually rich analyses of their models' performance metrics.

New Feature: Graphical Reporting

Graphical Report (--graph):
Generate an in-depth graphical report from your latest CSV output (via --csv option) with a single command.
This interactive report includes:
- An intuitive and interactive Leaderboard for comparing LLM performances
- A Model Comparison Tool for detailed side-by-side analysis
- Cost/Efficiency-Performance metrics Leaderboard
- Comprehensive graphical exploration of all evaluation metrics produced by the engine

Enhanced Technical Scoring Methodology

The JSON model configuration files now support both input and output pricing specifications for all models.
This expanded pricing data enables a more accurate, real-world technical scoring methodology that better reflects actual operational costs.

Assets 2

20 Jun 13:39

LSeu-Open

v0.5.2-Beta

36f06a7

v0.5.2 Beta

LLMScoreEngine v0.5.2 - Bug Fix Release

This quick update addresses a critical bug affecting the main scoring function.

Bug Fixes

Refactors the calculate_final_score method to compute a unified overall_benchmark_score from the entity and dev benchmark results. This score is then explicitly passed into calculate_technical_score for the size-performance ratio calculation. This change resolves an implicit dependency where calculate_technical_score relied on a benchmark_score that was not guaranteed to be present in its inputs. The new implementation makes the scoring logic more robust, self-contained, and ensures that the technical evaluation is always based on the model's calculated benchmark performance.
Resolve a bunch of lintr errors to improve code quality.

Assets 2

15 Jun 14:28

LSeu-Open

v0.5.1-Beta

a63271c

v0.5.1 Beta

LLMScoreEngine v0.5.1 - Bug Fix Release

This quick update addresses a critical bug affecting the main scoring function.

Bug Fixes

Fix processing of models with '.' in their names: Corrected the main scoring function to properly handle models with periods ('.') in their names, such as "Claude 3.7 Sonnet."

Assets 2

14 Jun 16:44

LSeu-Open

v0.5.0-Beta

b161102

v0.5.0 Beta

LLMScoreEngine v0.5.0 - New Features, Major Refactoring and Testing Overhaul

This update represents a major refactoring of the LLMScoreEngine, moving from a system with hardcoded values to a centralized, configuration-driven architecture. This is complemented by a complete, from-scratch pytest testing framework to ensure the reliability and correctness of the application.

New Features

Batch Scoring (--all): Score all models in the Models/ directory with a single command.
Dynamic Configuration (--config): Provide an external Python configuration file to experiment with scoring parameters without modifying the core code.
CSV Report Generation (--csv): Generate a consolidated CSV report of all model scores, saved in the Results/ directory.
Quiet Output Mode (--quiet): Suppress all informational output to show only the final model scores, ideal for automated scripts.

Refactoring & Bug Fixes

Architecture:
- Refactored the application to use a centralized, immutable configuration file (config/scoring_config.py) instead of hardcoded values.
- Improved import paths and module structure to prevent ImportError issues and clarify the package API.
Scoring Logic:
- The ModelScorer class was rewritten to source all parameters from the central configuration.
- Corrected scoring formulas in hf_score.py to align with the project documentation.
Bug Fixes:
- Fixed a critical bug that caused benchmark scores to be ignored during calculations.
- Resolved a case-insensitivity bug in data/loaders.py that prevented model files from being found.
Testing:
- Built a comprehensive pytest testing framework from scratch, with unit and end-to-end tests covering all critical modules.
- Used mocking for external APIs to ensure tests are fast and deterministic.
- Improved test isolation and added verbose logging for clearer debugging.

Assets 2

23 May 20:19

LSeu-Open

v0.4.1-beta

6a7d6a4

v0.4.1 Beta

Small beta release.

General code Cleanup to avoid unecessary dev comments to improve readability
Update hf_score.py to allow CLI usage

Assets 2

17 May 12:37

LSeu-Open

v0.4.0-beta

fee73ea

v0.4.0 Beta

This beta release introduces significant enhancements to the LLMScoreEngine, focusing on a more nuanced approach to benchmark scoring, community evaluation, and technical assessments.

🎯 Rethinking Entity Benchmarks (Now 30 points total)

The Entity Benchmarks category has been updated to provide a more comprehensive evaluation:

Expanded Coverage: Scores are now incorporated from 10 distinct entities, categorized into three key specialties.
Specialized Breakdown:
- Generalist (each contributes 10% to the Entity Benchmark score):
  - Artificial Analysis Intelligence Score
  - OpenCompass LLM Average Score
  - LLM Explorer Score
  - Livebench Average Score
  - Open LLM Leaderboard Average Score
  - UGI Leaderboard UGI Score
  - Dubersor LLM Leaderboard Total Score
- Coding (each contributes 10% to the Entity Benchmark score):
  - BigCodeBench Leaderboard Score
  - EvalPlus Leaderboard Pass@1 Score
- Vision (contributes 10% to the Entity Benchmark score):
  - Open VLM Leaderboard Average Score

🛠️ Structured Dev Benchmarks (30 points total)

This release formalizes and details the structure for Dev Benchmarks, ensuring a wide-ranging assessment of model capabilities. The system is designed to cover multiple facets of modern SOTA model performance:

Comprehensive Categories: Models are evaluated across eight key areas:
- General Knowledge & Reasoning
- Instruction Following
- Math
- Coding
- Multilingual Capabilities
- Context Handling
- Function Calling (Tool Use & Agents)
- Vision (Multimodal Capabilities)
Emphasis on Data Integrity: Higher weight is assigned to benchmarks that are mindful of potential data contamination.
Weighted Evaluation: Each specific benchmark within these categories has an assigned weight, reflecting its relative importance in the overall Dev Benchmark score.
Robust Scoring System: The scoring methodology includes normalization of benchmark scores and a system for proportionally redistributing weights when data for specific benchmarks is unavailable, ensuring fair and consistent comparisons.
In-depth Methodology: For a detailed breakdown of individual benchmarks, their weights, contamination status, and the precise scoring formulas, please refer to the Scoring Framework Description.

👥 Revamped Community Score

The Community Score now offers a more holistic view of a model's standing:

Hugging Face Score Integration: Alongside the LMSys Arena Elo, a Hugging Face score has been added. This new metric is calculated based on factors like downloads, likes, and the model's age.
Further Details: For a deeper understanding of this calculation, please refer to the Scoring Framework Description.

⚙️ Enhanced Technical Score

The Technical Score has been refined for greater accuracy and fairness:

Smoother Points Distribution:
- Price: Points are now distributed linearly.
- Context Window: Points follow a logarithmic distribution.
- These changes address previous issues with tiered or stepped scoring.
Improved Model Size to Performance Ratio: This metric now utilizes parameter count and architecture type to more effectively assess a model's performance relative to its size and architectural efficiency.
Further Details: Learn more about these improvements in the Scoring Framework Description.

Assets 2

05 May 17:15

LSeu-Open

v0.3.1-beta

aa893a7

v0.3.1 Beta

Initial Release Following Migration

This marks the official launch of the AI Scoring Framework now called LLMScoreEngine in its new dedicated repository. Key updates include :

Migrated the scoring framework AIEnhancedWork from to this repository for centralized management.
Reorganized the project structure to enhance maintainability, scalability, and clarity.
Upgraded the README file with detailed documentation, usage examples, and streamlined instructions for easier onboarding.
Added CLI support for command-line interactions, enabling faster and more flexible workflows.

This release lays the foundation for future features, improvements, and contributions. 🚀

Assets 2

Releases: LSeu-Open/LLMScoreEngine

v0.6.1 Beta

LLMScoreEngine v0.6.1 – Enhanced HTML Report

📊 Graphical Reporting Enhancements

Uh oh!

v0.6.0 Beta

LLMScoreEngine v0.6.0 - New Feature & Enhanced Technical Scoring

New Feature: Graphical Reporting

Enhanced Technical Scoring Methodology

Uh oh!

v0.5.2 Beta

LLMScoreEngine v0.5.2 - Bug Fix Release

Bug Fixes

Uh oh!

v0.5.1 Beta

LLMScoreEngine v0.5.1 - Bug Fix Release

Bug Fixes

Uh oh!

v0.5.0 Beta

LLMScoreEngine v0.5.0 - New Features, Major Refactoring and Testing Overhaul

New Features

Refactoring & Bug Fixes

Uh oh!

v0.4.1 Beta

Uh oh!

v0.4.0 Beta

🎯 Rethinking Entity Benchmarks (Now 30 points total)

🛠️ Structured Dev Benchmarks (30 points total)

👥 Revamped Community Score

⚙️ Enhanced Technical Score

Uh oh!

v0.3.1 Beta

Initial Release Following Migration

Uh oh!