Skip to content

Releases: LSeu-Open/LLMScoreEngine

v0.6.1 Beta

07 Aug 12:00
97207f0

Choose a tag to compare

LLMScoreEngine v0.6.1 – Enhanced HTML Report

This release delivers a major overhaul of the graphical reporting system, bringing a more polished, performant, and secure experience.

📊 Graphical Reporting Enhancements

  • Refactored CSS: Completely rewritten for improved visual design, responsiveness, and interactivity.
  • Optimized JavaScript: Performance improvements to ensure smoother rendering and faster interactions.
  • 🔒 Enhanced Security: Added Content-Security-Policy (CSP) headers to strengthen report security.
  • 🎨 Improved UX: Refined interfaces for both the Leaderboards and Model Comparison Tool — cleaner, more intuitive, and user-friendly.

Experience a faster, safer, and visually refined reporting experience with every evaluation.

v0.6.0 Beta

18 Jul 12:17
90ef8eb

Choose a tag to compare

LLMScoreEngine v0.6.0 - New Feature & Enhanced Technical Scoring

This update introduces a powerful Command Line Interface (CLI) feature to the LLM-Score-Engine, enabling the generation of comprehensive graphical reports from model evaluation data. The new functionality allows users to produce detailed, visually rich analyses of their models' performance metrics.

New Feature: Graphical Reporting

  • Graphical Report (--graph):
    Generate an in-depth graphical report from your latest CSV output (via --csv option) with a single command.
    This interactive report includes:
    • An intuitive and interactive Leaderboard for comparing LLM performances
    • A Model Comparison Tool for detailed side-by-side analysis
    • Cost/Efficiency-Performance metrics Leaderboard
    • Comprehensive graphical exploration of all evaluation metrics produced by the engine

Enhanced Technical Scoring Methodology

  • The JSON model configuration files now support both input and output pricing specifications for all models.
  • This expanded pricing data enables a more accurate, real-world technical scoring methodology that better reflects actual operational costs.

v0.5.2 Beta

20 Jun 13:39
36f06a7

Choose a tag to compare

LLMScoreEngine v0.5.2 - Bug Fix Release

This quick update addresses a critical bug affecting the main scoring function.

Bug Fixes

  • Refactors the calculate_final_score method to compute a unified overall_benchmark_score from the entity and dev benchmark results. This score is then explicitly passed into calculate_technical_score for the size-performance ratio calculation. This change resolves an implicit dependency where calculate_technical_score relied on a benchmark_score that was not guaranteed to be present in its inputs. The new implementation makes the scoring logic more robust, self-contained, and ensures that the technical evaluation is always based on the model's calculated benchmark performance.

  • Resolve a bunch of lintr errors to improve code quality.

v0.5.1 Beta

15 Jun 14:28
a63271c

Choose a tag to compare

LLMScoreEngine v0.5.1 - Bug Fix Release

This quick update addresses a critical bug affecting the main scoring function.

Bug Fixes

  • Fix processing of models with '.' in their names: Corrected the main scoring function to properly handle models with periods ('.') in their names, such as "Claude 3.7 Sonnet."

v0.5.0 Beta

14 Jun 16:44
b161102

Choose a tag to compare

LLMScoreEngine v0.5.0 - New Features, Major Refactoring and Testing Overhaul

This update represents a major refactoring of the LLMScoreEngine, moving from a system with hardcoded values to a centralized, configuration-driven architecture. This is complemented by a complete, from-scratch pytest testing framework to ensure the reliability and correctness of the application.

New Features

  • Batch Scoring (--all): Score all models in the Models/ directory with a single command.
  • Dynamic Configuration (--config): Provide an external Python configuration file to experiment with scoring parameters without modifying the core code.
  • CSV Report Generation (--csv): Generate a consolidated CSV report of all model scores, saved in the Results/ directory.
  • Quiet Output Mode (--quiet): Suppress all informational output to show only the final model scores, ideal for automated scripts.

Refactoring & Bug Fixes

  • Architecture:

    • Refactored the application to use a centralized, immutable configuration file (config/scoring_config.py) instead of hardcoded values.
    • Improved import paths and module structure to prevent ImportError issues and clarify the package API.
  • Scoring Logic:

    • The ModelScorer class was rewritten to source all parameters from the central configuration.
    • Corrected scoring formulas in hf_score.py to align with the project documentation.
  • Bug Fixes:

    • Fixed a critical bug that caused benchmark scores to be ignored during calculations.
    • Resolved a case-insensitivity bug in data/loaders.py that prevented model files from being found.
  • Testing:

    • Built a comprehensive pytest testing framework from scratch, with unit and end-to-end tests covering all critical modules.
    • Used mocking for external APIs to ensure tests are fast and deterministic.
    • Improved test isolation and added verbose logging for clearer debugging.

v0.4.1 Beta

23 May 20:19
6a7d6a4

Choose a tag to compare

Small beta release.

  • General code Cleanup to avoid unecessary dev comments to improve readability
  • Update hf_score.py to allow CLI usage

v0.4.0 Beta

17 May 12:37
fee73ea

Choose a tag to compare

This beta release introduces significant enhancements to the LLMScoreEngine, focusing on a more nuanced approach to benchmark scoring, community evaluation, and technical assessments.

🎯 Rethinking Entity Benchmarks (Now 30 points total)

The Entity Benchmarks category has been updated to provide a more comprehensive evaluation:

  • Expanded Coverage: Scores are now incorporated from 10 distinct entities, categorized into three key specialties.

  • Specialized Breakdown:

    • Generalist (each contributes 10% to the Entity Benchmark score):

      • Artificial Analysis Intelligence Score
      • OpenCompass LLM Average Score
      • LLM Explorer Score
      • Livebench Average Score
      • Open LLM Leaderboard Average Score
      • UGI Leaderboard UGI Score
      • Dubersor LLM Leaderboard Total Score
    • Coding (each contributes 10% to the Entity Benchmark score):

      • BigCodeBench Leaderboard Score
      • EvalPlus Leaderboard Pass@1 Score
    • Vision (contributes 10% to the Entity Benchmark score):

      • Open VLM Leaderboard Average Score

🛠️ Structured Dev Benchmarks (30 points total)

This release formalizes and details the structure for Dev Benchmarks, ensuring a wide-ranging assessment of model capabilities. The system is designed to cover multiple facets of modern SOTA model performance:

  • Comprehensive Categories: Models are evaluated across eight key areas:
    • General Knowledge & Reasoning
    • Instruction Following
    • Math
    • Coding
    • Multilingual Capabilities
    • Context Handling
    • Function Calling (Tool Use & Agents)
    • Vision (Multimodal Capabilities)
  • Emphasis on Data Integrity: Higher weight is assigned to benchmarks that are mindful of potential data contamination.
  • Weighted Evaluation: Each specific benchmark within these categories has an assigned weight, reflecting its relative importance in the overall Dev Benchmark score.
  • Robust Scoring System: The scoring methodology includes normalization of benchmark scores and a system for proportionally redistributing weights when data for specific benchmarks is unavailable, ensuring fair and consistent comparisons.
  • In-depth Methodology: For a detailed breakdown of individual benchmarks, their weights, contamination status, and the precise scoring formulas, please refer to the Scoring Framework Description.

👥 Revamped Community Score

The Community Score now offers a more holistic view of a model's standing:

  • Hugging Face Score Integration: Alongside the LMSys Arena Elo, a Hugging Face score has been added. This new metric is calculated based on factors like downloads, likes, and the model's age.
  • Further Details: For a deeper understanding of this calculation, please refer to the Scoring Framework Description.

⚙️ Enhanced Technical Score

The Technical Score has been refined for greater accuracy and fairness:

  • Smoother Points Distribution:
    • Price: Points are now distributed linearly.
    • Context Window: Points follow a logarithmic distribution.
    • These changes address previous issues with tiered or stepped scoring.
  • Improved Model Size to Performance Ratio: This metric now utilizes parameter count and architecture type to more effectively assess a model's performance relative to its size and architectural efficiency.
  • Further Details: Learn more about these improvements in the Scoring Framework Description.

v0.3.1 Beta

05 May 17:15
aa893a7

Choose a tag to compare

Initial Release Following Migration

This marks the official launch of the AI Scoring Framework now called LLMScoreEngine in its new dedicated repository. Key updates include :

  • Migrated the scoring framework AIEnhancedWork from to this repository for centralized management.
  • Reorganized the project structure to enhance maintainability, scalability, and clarity.
  • Upgraded the README file with detailed documentation, usage examples, and streamlined instructions for easier onboarding.
  • Added CLI support for command-line interactions, enabling faster and more flexible workflows.

This release lays the foundation for future features, improvements, and contributions. 🚀