LLM and Agent Evaluations #3
madhurprash
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Foundation models exhibit inherent stochasticity, meaning their outputs contain fundamental randomness even when provided with identical inputs and inference parameters across multiple inference runs. This non-deterministic behavior creates significant challenges when developing sophisticated generative AI systems and AI agents that rely on these models as their core reasoning engines.
The challenge becomes particularly acute when building AI agents that need to execute multi-step workflows, make consistent decisions, or maintain coherent behavior across extended interactions. Unlike traditional software systems that produce predictable outputs, AI agents powered by foundation models may exhibit variable responses to the same scenarios, potentially leading to inconsistent task execution, unpredictable decision-making patterns, and difficulty in debugging or validating agent behavior.
This stochastic nature conflicts with the expectations of reliability and reproducibility that are essential for deploying AI agents in production environments, where consistent performance and predictable outcomes are critical for user trust and system dependability. As we build increasingly complex agentic systems that chain multiple model calls, perform reasoning over time, or interact with external tools and APIs, managing and mitigating this inherent randomness becomes a fundamental engineering challenge.
This is a discussion pertained to a few components for evaluation of models and agents and a few frameworks that can help with the same. These are just a few examples as a starting point for evaluating various models and agents in the multi-agent architecture.
Model evaluations
Foundation model benchmarking tool (FMBench): https://github.com/aws-samples/foundation-model-benchmarking-tool
FMEval: https://github.com/aws/fmeval
OpenAI Evals: https://github.com/openai/evals
LangFuse for LLM evals: https://langfuse.com/docs/scores/overview
Arize AI evaluations: https://arize.com/docs/ax/concepts/how-does-evaluation-work
Agent evaluations
HAL - Holistic Agent Leaderboard: https://hal.cs.princeton.edu/
Agent evals: https://github.com/awslabs/agent-evaluation
Galileo: https://docs.galileo.ai/galileo/gen-ai-studio-products/galileo-evaluate/how-to/evaluate-and-optimize-agents--chains-or-multi-step-workflows
RAGAS: https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/agents/
Beta Was this translation helpful? Give feedback.
All reactions