Releases · PabloCabaleiro/pondera · GitHub

23 Oct 12:59

v0.6.2 Latest

Latest

Fixed

Changed Pydantic model configuration from extra="forbid" to extra="ignore" across all models (CaseExpectations, CaseInput, CaseJudge, EvaluationResult, Judgment, ScoreAggregate, CriteriaAggregates, MultiEvaluationResult, RubricCriterion, Rubric, RunResult, CaseSpec).

Assets 2

15 Oct 13:30

v0.6.1

Fixed

Renamed toolsets parameter to tools in Judge constructor and internal implementation for consistency with PydanticAI API

Assets 2

13 Oct 10:19

v0.6.0

Added

Judge now accepts toolsets parameter in constructor to provide PydanticAI tools/toolsets for use during evaluation

Assets 2

13 Oct 09:19

v0.5.0

Changed

Capturing RunnerError and TimeoutError when executing the evaluation cases.

Assets 2

19 Sep 15:51

v0.4.1

Fixed

Internal: route all evaluations through the unified multi_evaluate function (single-case path now uses the same aggregation pipeline).
Prompting: moved part of the prior user prompt into the system prompt and refined the system prompt wording for clarity and consistency.

Assets 2

12 Sep 10:41

v0.4.0

Added

Persist judge prompt as judge_prompt.txt and include judge_prompt field in Judgment plus has_judge_prompt flag in meta.json.
Enforce per-case timeout_s via asyncio.wait_for around runner and judge execution (raises asyncio.TimeoutError).
Tests for runner and judge timeout behavior.
Validation: per-criterion threshold keys now validated via Pydantic (CaseJudge field validator + EvaluationResult model validator) eliminating silent fallback to 0 scores.
Structured error classes introduced: RunnerError, JudgeError, TimeoutError (subclass of asyncio.TimeoutError), and ValidationError with wrapping of raw exceptions in runner/judge execution and YAML load path.
Basic logging: added standard library logging calls (logger name pondera) in core API execution path and simple availability test.

Changed

API now always returns MultiEvaluationResult (single run wrapped with one EvaluationResult) for a stable schema.
Unified pass/fail logic: removed duplicated threshold code by reusing compute_pass for multi-evaluation aggregation.
Removed ad-hoc runtime threshold key validation function in favor of model-level validators.
Timeout raising now uses project TimeoutError (still an asyncio.TimeoutError subclass) for consistent catching.
Fail-fast on missing criterion scores when per-criterion thresholds provided (no silent 0 default); compute_pass now raises ValidationError.
BREAKING: normalized naming: removed Judgment.pass_fail/dual serialization; single boolean field evaluation_passed everywhere (tests & artifacts updated, no backward alias).

Fixed

Updated tests to align with unified return type.
Consistent pass/fail evaluation across single and multi-run cases (previous divergence removed).
Test expectations updated to reflect new fail-fast validation for missing per-criterion threshold keys.

Assets 2

10 Sep 12:47

v0.3.0

Added

Protocol to Judge (allows custom override implementations)

Changed

Improved results message
Updated README
Removed duplicate settings by consolidating judge_model and default_model

Removed

CLI (temporarily removed; may return later)

Fixed

Artifacts report in API (typo: artifacts)

Assets 2

10 Sep 08:52

v0.2.0

Model agnostic: improved the agents module to support different models using PydanticAI.
Judging: Multi-judge aggregation (mean/median/majority), caching by case hash.
Propagating artifacts: Propagating runner artifacts to the Judge.

Assets 2

08 Sep 14:02

v0.1

Schemas: CaseSpec, RubricCriterion/Rubric, RunResult, Judgment.
Judge: PydanticAIJudge (typed JSON result, model-agnostic).
API: evaluate_case(...) (sync wrapper calling async core).
CLI: pondera run <cases_dir> --runner ... --artifacts ....
Pytest helper: load_cases(), run_case(); sample test file using parametrize.
Artifacts: answer.md, judgment.json, summary.md, meta.json.
Docs: README, YAML schema reference, quickstart examples.
Tests: Adding basic tests.

Assets 2