Releases: PabloCabaleiro/pondera
Releases · PabloCabaleiro/pondera
v0.6.2
v0.6.1
Fixed
- Renamed
toolsetsparameter totoolsin Judge constructor and internal implementation for consistency with PydanticAI API
v0.6.0
Added
- Judge now accepts
toolsetsparameter in constructor to provide PydanticAI tools/toolsets for use during evaluation
v0.5.0
Changed
- Capturing
RunnerErrorandTimeoutErrorwhen executing the evaluation cases.
v0.4.1
Fixed
- Internal: route all evaluations through the unified
multi_evaluatefunction (single-case path now uses the same aggregation pipeline). - Prompting: moved part of the prior user prompt into the system prompt and refined the system prompt wording for clarity and consistency.
v0.4.0
Added
- Persist judge prompt as
judge_prompt.txtand includejudge_promptfield inJudgmentplushas_judge_promptflag inmeta.json. - Enforce per-case
timeout_sviaasyncio.wait_foraround runner and judge execution (raisesasyncio.TimeoutError). - Tests for runner and judge timeout behavior.
- Validation: per-criterion threshold keys now validated via Pydantic (
CaseJudgefield validator +EvaluationResultmodel validator) eliminating silent fallback to 0 scores. - Structured error classes introduced:
RunnerError,JudgeError,TimeoutError(subclass ofasyncio.TimeoutError), andValidationErrorwith wrapping of raw exceptions in runner/judge execution and YAML load path. - Basic logging: added standard library logging calls (logger name
pondera) in core API execution path and simple availability test.
Changed
- API now always returns
MultiEvaluationResult(single run wrapped with oneEvaluationResult) for a stable schema. - Unified pass/fail logic: removed duplicated threshold code by reusing
compute_passfor multi-evaluation aggregation. - Removed ad-hoc runtime threshold key validation function in favor of model-level validators.
- Timeout raising now uses project
TimeoutError(still anasyncio.TimeoutErrorsubclass) for consistent catching. - Fail-fast on missing criterion scores when per-criterion thresholds provided (no silent 0 default);
compute_passnow raisesValidationError. - BREAKING: normalized naming: removed
Judgment.pass_fail/dual serialization; single boolean fieldevaluation_passedeverywhere (tests & artifacts updated, no backward alias).
Fixed
- Updated tests to align with unified return type.
- Consistent pass/fail evaluation across single and multi-run cases (previous divergence removed).
- Test expectations updated to reflect new fail-fast validation for missing per-criterion threshold keys.
v0.3.0
Added
- Protocol to
Judge(allows custom override implementations)
Changed
- Improved results message
- Updated README
- Removed duplicate settings by consolidating
judge_modelanddefault_model
Removed
- CLI (temporarily removed; may return later)
Fixed
- Artifacts report in API (typo: artifacts)
v0.2.0
- Model agnostic: improved the agents module to support different models using PydanticAI.
- Judging: Multi-judge aggregation (mean/median/majority), caching by case hash.
- Propagating artifacts: Propagating runner artifacts to the Judge.
v0.1
- Schemas:
CaseSpec,RubricCriterion/Rubric,RunResult,Judgment. - Judge:
PydanticAIJudge(typed JSON result, model-agnostic). - API:
evaluate_case(...)(sync wrapper calling async core). - CLI:
pondera run <cases_dir> --runner ... --artifacts .... - Pytest helper:
load_cases(),run_case(); sample test file using parametrize. - Artifacts:
answer.md,judgment.json,summary.md,meta.json. - Docs: README, YAML schema reference, quickstart examples.
- Tests: Adding basic tests.