Agent Development Working Group Plan

westonbrown · 2025-06-20T03:38:36Z

westonbrown
Jun 20, 2025
Maintainer

Problem Statement: Organizations need AI agents to automate workflows, but face an impossible choice: expensive custom development for each use case or rigid pre-built solutions that don't adapt. Current agents remain static after deployment, unable to learn and adapt from experience or evolve with changing requirements.

Solution: This project develops a universal self-evolving agent architecture that adapts to any domain through configuration changes alone. Using strands-agents SDK and cloud infrastructure, the same codebase powers everything from simple customer service to complex software development by adjusting only prompts and tools. The system intelligently scales from single agents for basic tasks to coordinated teams for complex projects, while continuously learning from every deployment. Validation across five industries proves that specialized behaviors emerge from configuration, not custom code - delivering the promise of one codebase, infinite business applications.

Goals & Success:
What We Want to Achieve
[Primary Goal] - Showcase Production-Proven Patterns: Demonstrate reliable and adaptable agentic systems, sharing lessons from real-world deployments across various industries.
[Secondary Goal] - Build Self-Evolving Systems: Create a framework where agents can autonomously improve, optimize workflows, evolve tools, and spawn specialized sub-agents based on environmental feedback.
[Third Goal] - Drive Practical Innovation: Move beyond hype by delivering reproducible reference architectures, best practices that solve real business problems, and answering core questions in agentic design.

How We'll Know We Succeeded:
By the end, we want to have:
[ ] Validated the core thesis that a single, general-purpose architecture can self-evolve for specialized domains.
[ ] A comprehensive white paper on agentic development best practices.
[ ] A library of production-ready reference architectures for common use cases.
[ ] A well-researched blog post detailing findings on core agentic questions.
[ ] At least 4 successful production deployments across different industry verticals.
[ ] Achieved target metrics on key benchmarks (e.g., 50% autonomous issue resolution on SWE-bench, 70% task completion on GAIA).

Core Open Questions

Note: The "Expected Results" represent hypotheses based on preliminary research and industry trends. The "Validation Methods" are proposed approaches to test these hypotheses during the project.

Question	Expected Result	Validation Method
When do multi-agent systems create value vs overhead?	Multi-agent excels at parallel research (90% improvement) but degrades coherent generation tasks by 20-30%	A/B testing on AgentBench's 8 environments comparing completion time, token usage, and quality scores
Can agents effectively spawn and manage other agents as an evolution of tool creation?	60% reduction in complex task completion time through dynamic team formation	Track agent spawning patterns and team effectiveness on full software development projects
Do agents achieve better performance with reusable tools vs direct code generation?	Tool abstractions yield 40% higher completion rates, 3x code reuse, and emergent tool hierarchies	MLGym: track tool creation, reuse metrics, and solution maintainability
How can agents effectively plan, execute, and replan based on feedback?	Dynamic replanning improves task success by 50% through iterative refinement loops	GAIA benchmark (466 questions): measure planning accuracy, execution success, and adaptation cycles
Can agents optimize their own prompts through failure analysis?	Self-optimized prompts achieve 20% better performance than human-tuned baselines	GAIA human baseline (92%) vs agent performance with iterative prompt refinement
How do agents spawn specialists without context fragmentation?	Context handoff protocol maintains 95% task understanding using compressed state + memory files	MultiAgentBench: measure context retention across agent boundaries in collaborative tasks
Which multi-agent patterns (supervisor/swarm/hierarchical) emerge naturally vs require design?	Supervisor patterns emerge for complex coordination; swarms self-organize for exploration tasks	Deploy identical base agents, measure emergent organizational structures across 100+ projects
Do self-evolving capabilities require custom architectures or just prompt/tool variations?	Same base architecture achieves self-evolution across domains with only prompt/tool changes	Cross-domain testing: deploy identical architecture across 4 industries, measure adaptation rate

Technical Stack

Important: This stack represents a reference implementation. The architecture is designed to be framework-agnostic, allowing organizations to substitute components based on their existing infrastructure and preferences.

Component	Technology	Purpose
Core Framework	strands-agents SDK	Agent development and orchestration
Orchestration	LangGraph	Graph-based workflow engine for multi-agent control flows
LLM Providers	LiteLLM	Model access with fallback options
Memory Store	Mem0 + FAISS	Hierarchical memory with cross-agent sharing and vector embeddings
Execution Runtime	Docker + Kubernetes	Sandboxed execution for agents and spawned teams
Tool Protocol	MCP	Standardized tool discovery/integration + A2A communication
Monitoring	Prometheus + Grafana + Langfuse	Metrics, visualization, and agent conversation debugging
Evaluation	Custom harness + benchmark adapters	Automated testing against open datasets + team performance metrics

Solution Architecture

Core Event Loop (Universal Pattern)

graph LR

U[User Request] --> A[Analyze Complexity]

A --> D{Spawn Decision}

D -->|Simple| S[Single Agent]

D -->|Complex| T[Spawn Team]

S --> P[Plan]

T --> P

P --> E[Execute]

E --> O[Observe Results]

O --> L{Success?}

L -->|Yes| M[Memory Update]

L -->|No| F[Learn from Failure]

M --> R[Response]

F --> P

style U fill:#e8f5e9

style A fill:#fff3e0

style P fill:#e3f2fd

style L fill:#ffebee

This core loop operates identically whether handling customer queries or building complex systems. The intelligence lies in the spawn decision - simple tasks proceed with a single agent, while complex tasks trigger team formation.

Team Formation Hierarchy (When Needed)

graph TB

MA[Meta-Agent<br/>Analyzes & Spawns]

MA -->|Product Tasks| P[Product Team]

MA -->|Dev Tasks| D[Dev Team]

MA -->|Ops Tasks| O[Ops Team]

P --> PA[PM, TPM, UX Agents]

D --> DA[SDM, SWE Agents]

O --> OA[Security, Monitor Agents]

style MA fill:#7e57c2,color:#fff

style P fill:#ec407a,color:#fff

style D fill:#42a5f5,color:#fff

style O fill:#ff7043,color:#fff

Complex tasks trigger hierarchical team formation. Each team operates with domain-appropriate patterns (swarm for ideation, supervisor for development). Simple tasks skip this entirely.

Example: Agent Development Workflow

sequenceDiagram

participant User

participant System

participant Team

participant Human

User->>System: "Build an AI agent for X"

System->>System: Analyze Complexity (High)

System->>Team: Spawn Product/Dev/Ops Teams

loop Development Cycle

Team->>Team: Plan & Execute

Team->>Human: Checkpoint Review

Human-->>Team: Feedback

Team->>Team: Learn & Adapt

end

Team->>User: Deployed Agent System

System->>System: Store Patterns for Future

This workflow demonstrates maximum complexity. For simpler tasks like "answer this customer email," the system would skip team spawning entirely and execute directly.

Core Agentic capabilities

Some core abilities include planning, tool use, memory, reasoning, prompt optimization, learning, quantization, observability, and security are built into the platform, providing the foundation for robust autonomous operation. The orchestration logic is implemented with LangGraph, a graph-based workflow engine for LLM agents, which supports multi-agent control flows and state management. The design also anticipates integration of AWS Strands multi-agent features – including open standards like Agent-to-Agent (A2A) communication and Model Context Protocol (MCP) – to ensure interoperability and enterprise-grade reliability.

This agentic architecture would contain multiple multi agent architectures within. Controlled by a higher level supervisor agent architecture, the teams within might be driven by other architectures such as a swarm of agents and other agents-as-tools patterns, replicating the inner workings of a software development team.

This multi agent architecture would then include all the primitives, tools, MCP servers, memory (long term and short term) for varying sub agents and teams), guardrails in place, and most importantly a human in the loop workflow at each step to validate the specification of the use case by the agents, revise it, and then follow it along to the next team of agents, review the code, unit tests, and finally the agent in production while maintaining oversight into documentation, specification and code generation of the agent development as the agent cycle executes.

Dynamic Tool & Resource Injection

Resource Type	Components	Injection Method
Prompt Templates	Role-specific prompts, chain-of-thought patterns, domain knowledge, error handling, evaluation criteria	Meta-agent injects based on team role
Tool Bindings	MCP endpoints, CI/CD APIs, code repositories, testing frameworks, deployment platforms, monitoring services	Dynamic binding at agent spawn time
Memory Policies	Short-term context, long-term knowledge, session management, cross-agent sharing, learning persistence, version control	Mem0 configuration per agent/team
Guardrails	Safety constraints, output validation, resource limits, security policies, approval workflows, rollback procedures	Layered enforcement at multiple levels
Communication	A2A protocols, message queues, state synchronization, event broadcasting, conflict resolution, status reporting	Event-driven architecture
Human Loops	Approval gates, review checkpoints, feedback loops, override controls, exception handling, manual intervention	Configurable per workflow stage

Key Capabilities & Features

Capability	Description	Implementation
Dynamic Spawning	Meta-agents create specialized teams based on project requirements	Pattern matching against task complexity and domain
Intelligent Resource Allocation	Each agent receives tailored prompts, tools, and memory policies	Dynamic injection at spawn time via configuration
Built-in Security & Compliance	Multi-layer security from code to deployment	Guardrails at agent, team, and system levels
Human Oversight	Configurable checkpoints for critical decisions	Approval gates with feedback integration
Cross-Agent Communication	A2A protocols for seamless collaboration	Event-driven messaging with state synchronization
Continuous Learning	System improves team patterns and configurations	Performance metrics drive meta-agent optimization
Memory Persistence	Hierarchical memory across agents and teams	Mem0 for long-term learning and context sharing
Tool Evolution	Agents create and validate new tools autonomously	Sandboxed testing with graduated deployment

Core Architecture Patterns

Pattern	Description	Use Case	Implementation
Supervisor	Central coordinator manages specialized agents	Complex workflows requiring orchestration	SDM agent coordinating Frontend/Backend/Agent SWEs
Swarm	Peer agents collaborate without hierarchy	Exploration and ideation tasks	Product team brainstorming requirements
Agent-as-Tool	Agents callable by higher-level agents	Specialized subtasks	Security agent callable by DevOps workflow
Hierarchical	Multi-level team structures	Large-scale projects	Meta-agent → Team leads → Individual agents

Project Plan

Phase	Timeline	Objectives	Deliverables
Foundation	Weeks 1-4	Establish base agent architecture	Single-agent system with spawn capability, memory hierarchy
Team Patterns	Weeks 5-8	Implement multi-agent patterns	Supervisor, swarm, and hierarchical spawning
Meta-Evolution	Weeks 9-12	Enable agent self-improvement and team optimization	Pattern learning, team configuration optimization
Industry Validation	Weeks 12-15	Deploy across verticals	4 production deployments including agent development

Phase 1: Foundation Tasks (Weeks 1-4)

• Design meta-agent hierarchy - Implement top-level meta-agent with domain spawning logic for Product, Software, and Project meta-agents, each with specialized agent creation templates

• Build dynamic resource injection - Create injection framework for prompt templates, tool bindings (MCP endpoints, APIs), memory policies, guardrails, communication protocols, and human loop configurations

• Implement Mem0 memory architecture - Deploy Mem0 for hierarchical memory management with policies for short-term context, long-term knowledge, cross-agent sharing, and learning persistence

• Create base agent spawning system - Build foundational agent class using strands-agents SDK with dynamic instantiation, including role-specific prompt injection and tool binding at spawn time

• Set up human-in-loop infrastructure - Implement approval gates, review checkpoints, and feedback loops with configurable intervention points for Product Spec, Code Review, and Deployment stages

• Deploy sandboxed team environments - Configure Docker containers with network isolation for safe multi-agent execution, including resource limits and inter-agent communication channels

Repository Structure

autonomous-agent-platform/
├── core/
│   ├── meta_agent/        # Top-level orchestration and spawning
│   ├── planner/           # Planning and replanning logic
│   ├── executor/          # Execution engine with sandboxing
│   ├── memory/            # Hierarchical memory management
├── patterns/
│   ├── supervisor/        # Supervisor pattern implementation
│   ├── swarm/            # Swarm coordination logic
│   ├── hierarchical/     # Multi-level team structures
│   └── templates/        # Team configuration templates
├── agents/
│   ├── base/             # Base agent class with spawn capability
│   └── teams/            # Pre-configured team templates
├── primitives/
│   ├── memory/           # Memory patterns and persistence
│   ├── context/          # Context engineering between agents
│   ├── guardrails/       # Safety and responsible AI
│   ├── human_loop/       # Human checkpoint integration
│   └── observability/    # Monitoring and debugging
├── tools/
│   ├── mcp_servers/      # MCP protocol implementations
│   ├── a2a_protocol/     # Agent-to-agent communication
│   └── library/          # Pre-built tool collection
├── evaluation/
│   ├── benchmarks/       # Integration with open datasets
│   ├── team_metrics/     # Multi-agent performance tracking
└── deployments/
    └── samples/         # Industry vertical examples

Key Primitives & Capabilities

Primitive	Implementation	Purpose
Memory Management	Hierarchical Redis + vector stores with team/agent/task scoping	Maintain context across agent boundaries
Context Engineering	Compressed state representations + memory files	Enable efficient handoffs between agents
Human-in-Loop	Configurable checkpoints with approval workflows	Maintain oversight while enabling automation
Guardrails	Multi-level safety checks (tool, agent, team)	Prevent harmful actions at all system levels
Self-Healing	Automatic error recovery and team reconfiguration	Maintain system stability during failures
Meta-Agent Spawning	Dynamic team creation based on task analysis	Adapt organization to problem complexity
Observability	Distributed tracing across agent interactions	Debug complex multi-agent behaviors
Evaluation	Continuous performance tracking with feedback loops	Drive systematic improvement

Meta-agent/tooling - Spawn agent and tools at runtime

A top‐level “meta‐agent” layer automatically spawns and configures dedicated sub‐teams of agents—each tailored to a different domain such as Software Development, Product Management, and Project Oversight. When a new project specification arrives, the Meta-Agent dynamically instantiates a Software Meta-Agent (which in turn spawns SWE, Agent-Integration, and SDM sub-agents), a Product Meta-Agent (responsible for user stories, acceptance criteria, and market fit), and a Project Meta-Agent (which configures TPM and timeline-management agents). Each Meta-Agent injects its own prompt templates, tool bindings (e.g., MCP endpoints, CI/CD APIs), memory policies, and human-in-loop checkpoints, ensuring that every logical domain is staffed with a fully configured, best-practice agent team before the workflow even begins.

Industry Use Cases for Validation

The platform's core capability is building other agent systems through coordinated multi-agent teams:

Cross-Industry Validation

Industry	Configuration	Self-Evolution Focus
Software Development	Prompt: "Build production software" Tools: fs_write, Git, fs_read Team: SDM, Frontend/Backend/Agent SWEs	Learn optimal team compositions for different project types
Customer Support	Prompt: "Resolve customer issues" Tools: knowledge base retrieve, ticket system Team: Triage, specialist, escalation agents	Develop specialized response patterns per issue category
Financial Analysis	Prompt: "Analyze market opportunities" Tools: Market data, Excel, reporting Team: Research, analysis, report agents	Create reusable analysis templates and team workflows
Healthcare	Prompt: "Assist clinical workflows" Tools: EHR, clinical guidelines, scheduling Team: Triage, specialist consult, follow-up agents	Learn optimal routing patterns for patient needs

This project will test a fundamental principle: self-evolving intelligence is domain-agnostic. What appears as a complex agent development system is actually a general-purpose architecture that adapts to any domain through configuration.

The this project will test the hypothesis that the future isn't about building specialized agents for each use case. It's about deploying one intelligent system that adapts itself to become whatever each situation requires.

Agent Testing & Evaluation Methodology

Comprehensive evaluation is critical for ensuring agents perform reliably across diverse real-world scenarios. The shift toward modular, multi-step reasoning, emergent behavior, and dynamic orchestration highlights the importance of a methodical, framework-driven approach to testing .The methodology employs LLM-as-a-judge automated evaluation systems that assess performance using predefined criteria , alongside traditional metrics, to capture the full spectrum of agent behavior from individual tool calls to complete workflow execution.

Evaluation Dimensions & Metrics

Dimension	Core Metrics	Measurement Approach	Target Thresholds
Performance	Task completion rate, Success rate, Error rate	Automated testing against ground truth	>90% completion, <5% errors
Efficiency	Response latency, Token usage, Turns to completion	Real-time monitoring + trace analysis	<2s latency, <5 turns average
Accuracy	Tool call accuracy, Task adherence, Intent resolution	LLM-based evaluators score adherence based on natural language prompts	>95% tool accuracy
Reliability	Consistency across runs, Failure recovery rate	Stress testing with edge cases	>99% consistency
Cost	Compute resources, API calls, Token consumption	Usage tracking per workflow	Optimize for <$0.10/complex task

Testing Strategy

Component-wise evaluation: Test each subsystem (planning, memory, tool selection) in isolation to identify specific failure points
Trajectory analysis: Analyze the agent's decision-making process to understand the "why" behind actions through step-by-step path evaluation
A/B testing: Compare agent configurations against baselines and human performance on standardized benchmarks

Team & Responsibilities
Core Team

Aaron Brown (AWS) - Lead
Madhur Prashant (AWS) - Lead
Core Contributors

Name
Seema Jaisinghani
Jamison Graff
Sri Aradhyula
Mark Vilrokx
Michael Zayats
Heeki Park

How We Work
Meetings: Alternating Thursdays at 9:00 AM PST
Communication: Google Group (wg-development@agentic-community.com) and Google Meet.
Decisions: Consensus among working group leads, with input from members.
Code/Docs: All work will be managed in a central GitHub repository.

Success Tracking

Monthly Check-in Questions
What did we ship this month?
Are we on track for our goals and timeline?
What's blocking us?

Do we need help with anything from the wider community?
Simple Metrics We'll Track

Activity: Commits, PRs, and issues closed in the repository.
Progress: % completion on key deliverables (White Paper, Architectures, Blog Post).
Performance: Scores on key benchmarks (SWE-bench, GAIA, AgentBench) and performance metrics from industry deployments.

madhurprash · 2025-06-20T13:40:34Z

madhurprash
Jun 20, 2025

This is a starting point for demonstrating the construction of a highly autonomous agent system that possesses meta-tooling and meta-agentic capabilities. The system is designed to go beyond traditional static agent implementations by creating agents that can dynamically understand, adapt, and evolve their own capabilities over time. The core use case revolves around building agents that can not only perform tasks but also create and manage other agents, modify their own tool sets, and adapt to new challenges by generating custom tools at runtime.

The system addresses the fundamental limitation of conventional AI agents that rely on predefined, static tool sets. Instead, this implementation creates an agent ecosystem where individual agents can assess their current capabilities, identify gaps in their functionality, and dynamically create new tools or even spawn new specialized agents to handle specific tasks. This meta-level reasoning and self-modification capability represents a significant advancement toward truly autonomous AI systems that can operate effectively in unpredictable environments without human intervention for capability expansion.

The practical applications of this system extend across numerous domains where adaptive problem-solving is crucial. For instance, in financial analysis, the system can dynamically create specialized agents for different market analysis tasks, generate custom data processing tools for new financial instruments, or adapt to changing regulatory requirements by modifying its analytical capabilities. In software development environments, the system can create specialized debugging agents, generate custom testing tools, or spawn agents dedicated to specific programming languages or frameworks as needed.

Code implementation for reference: https://github.com/madhurprash/meta-tools-and-agents

0 replies

mvilrokx · 2025-06-23T21:56:35Z

mvilrokx
Jun 23, 2025

@westonbrown Thanks for this fantastic start.

I've had some time to digest this now and here are some of my questions and thoughts:

"Using strands-agents SDK..." -> Why was this framework picked? Is this the only one that can be used? Do we need to use one at all?
"Validation across five industries ..." -> Where does this statement come from? Which industries are you referring to? And why 5?
I'm slightly confused about the "Core Open Questions" Table: Are the "Expected Result" set by you? Where do all those numbers come from? And is the "Validation Method" what we will be doing to validate the Expected Results?
On the "Technical Stack":
- (and this relates to my first question) I feel some of these technologies are indeed non-debatable since they are effectively industry standards now in the enterprise space (Docker, k8s, Prometheus ...), however, others are clearly not, like strands-agents, LiteLLM and LangGraph: Are we saying you have to use this stack to be able to achieve this? Will we be offering alternatives or generalize what we are doing with these tools?
- "A2A" is being mentioned in the MCP section: does this refer to the A2A Protocol or Agent to Agent communication in general? Are we suggesting that we do A2A over MCP or call other agents over MCP? Would this warrant its own section in that table?
On the "Core Event Loop (Universal Pattern)":
- I feel that the "Learn From Failure" block is doing a lot of "heavy lifting" here: exactly how would this work?
- I think the same goes for "Analyze Complexity" (and "Spawn Decision"): How will that work? What is considered complex vs simple? Who sets those criteria, where?
On "Team Formation Hierarchy (When Needed)":
- I assume this a "blown up" representation of the the box called "Spawn Team" from the previous diagram?
- Are all these Teams spawned at the same time, or is this a representation of all the Teams that could be spanned? I.e. this is the superset of available teams; sometimes we spawn 1, sometimes 2 and sometimes all of them.
- Is a Team a fixed grouping of specialized Agents? Or can Teams be created dynamically from any available agents?
- Is there communication between teams? If so, how does this work? Can an Agent in the Dev Team talk to an agent in the Product Team? Or does this communication have to happen through the Meta-Agent?
- "Each team operates with domain-appropriate patterns (swarm for ideation, supervisor for development)." -> Are these just examples or are we saying that you have to use swarm for ideation and supervisor for development?
"Example: Agent Development Workflow":
- Would the user explicitly be asking to "Build an AI agent for X" OR, would the user ask "I want to do X", we realize there is no way to do this, trigger a "Build an AI agent for X" workflow, deploy this, and then answer the question? How would a user know to ask for a creation of an Agent vs the Agent already existing and so just needing to ask the question? E.g. how would the user know not to ask "Build an AI Agent to answer this customer email"?

General Notes:

What about:
- High Availability
- Scaling
- Backup/Recovery
- The ability to cancel (long running) Workflows

2 replies

madhurprash Jun 27, 2025

Hi Mark - thank you so much for reaching out and going over these key aspects. I am going to note some of the responses to the points above and will let @westonbrown add on here to the thread as well:

"Using strands-agents SDK..." -> Why was this framework picked? Is this the only one that can be used? Do we need to use one at all? --> Strands agents SDK is a model first approach for building agents with autonomy and determinism. Here is a little more about Strands: https://strandsagents.com/0.1.x/ . On why this framework is picked - two reasons. This framework gives the ability to easily build agents in a few lines of code and gives us the autonomy to build agents that can then build agents and tools using meta capabilities easily (view information here (https://strandsagents.com/0.1.x/examples/python/multi_agent_example/multi_agent_example/)). Another requirement for this is to not only build agents with meta capabilities but also with deterministic workflows. Strands is releasing support for this here ([FEATURE] Multi-Agent Primitives strands-agents/sdk-python#214). This will enable us to build agents with autonomy and determinism (with all the other foundational primitives such as memory, guardrails, responsible AI, context/session management, etc). That being said, this is not final to Strands - LangGraph is a popularly defined framework across several customers that also helps building agents in the form of graphs and workflows showing the full control of data flow as the agent execution takes place. These are two agent frameworks and we can choose which one to pick as long as it is giving us the ability to build agentic systems with determinism and autonomy. Other agent frameworks from what I have seen (Agents SDK/Google ADK/CrewAI) are agent abstractions, which means they help us build agents easily but I have not come across an example where we can use these to build deterministic workflows. Flexible to change this here but as long as an agent framework can give us determinism and autonomy with easy code definitions would be something easy to pick up and Strands happens to be a framework that gives that (LangGraph does too but with a little more code logic). Open to other ideas on this! :) What do you think?
"Validation across five industries ..." -> Where does this statement come from? Which industries are you referring to? And why 5? - @westonbrown feel free to add more here, but this is just a starting number that would give this architecture some validation to say that this dynamic/deterministic solution provides value to these distinct use cases, showing adaptability. On industry types, I think we can discuss more but I correlated to what Anthropic had to say in their blog: https://www.anthropic.com/engineering/building-effective-agents#:~:text=Appendix%201%3A%20Agents%20in%20practice - talks about some of the most widely used use cases across industries (customer support and coding agents), and so I thiknk it would be good to get started with something like that with some data backed behind it!
I'm slightly confused about the "Core Open Questions" Table: Are the "Expected Result" set by you? Where do all those numbers come from? And is the "Validation Method" what we will be doing to validate the Expected Results? --> cc: @westonbrown. My thought is that these are some of the quantifiable metrics to set to show some expected outcomes and values, so these might be just set values flexible to change as examples. Validation Methods are some examples and benchmarks to use to validate these agents, but that being said we might want to create specific example benchmark datasets for some of the use cases to check how the agents are performing in that specific task (benchmarking dataset creation might be another overhead), but whatever helps us validate these agent's task completion and execution.
however, others are clearly not, like strands-agents, LiteLLM and LangGraph --> Yes, so many things out there for agentic frameworks and tools. I think it would be a combination of these but based on your experience, we should incorporate a foundation - for example a core agentic framework for orchestration that drives this solution and then other agents implemented with various frameworks. LiteLLM is a model gateway for you to easily access different models through a centralized point. Now this can be through LiteLLM or this can be directly by accessing a model provider. I think using a framework would be a decision we can make but the rest is flexible to use based on - 1/complexity of the task and agent, 2/ tools, models and other primitives in consideration.
does this refer to the A2A Protocol or Agent to Agent communication in general? Are we suggesting that we do A2A over MCP or call other agents over MCP? - more to discuss on this. I see A2A and MCP being used in parallel so that let's say there are multiple agents on multiple frameworks - we use agent to agent protocol to have these agents communicate, and each of these agents have access to MCP servers. Seeing these as complimentarily being used.
On the "Core Event Loop (Universal Pattern)":
I feel that the "Learn From Failure" block is doing a lot of "heavy lifting" here: exactly how would this work? --> You are right. This is a heavy lift. There are patterns to doing this today. For example, view: https://langchain-ai.github.io/langgraphjs/tutorials/workflows/#evaluator-optimizer. This is an example of an agent calling a tool, learning from it's mistakes and healing from it. While this is an example, there are various patterns of doing this: https://medium.com/@pranav.marla/the-guide-to-building-self-healing-agents-using-llms-0827dc5fc8ee.
I think the same goes for "Analyze Complexity" (and "Spawn Decision"): How will that work? What is considered complex vs simple? Who sets those criteria, where? --> These agents will have dynamic access to tools that will help them spawn agents and tools. Now that criteria is defined by us. Actually, Anthropic mentions this in their blog: https://www.anthropic.com/engineering/built-multi-agent-research-system#:~:text=Scale%20effort%20to%20query%20complexity.%20Agents%20struggle%20to. We can define a baseline criteria like this and change this based on use case to use case.
On "Team Formation Hierarchy (When Needed)":
I assume this a "blown up" representation of the the box called "Spawn Team" from the previous diagram? --> Correct! Just an example.
Are all these Teams spawned at the same time, or is this a representation of all the Teams that could be spanned? I.e. this is the superset of available teams; sometimes we spawn 1, sometimes 2 and sometimes all of them. --> I think there are patterns to this based on the use case. Where more parallelization is needed, agents are spawned at once whereas in other cases, this might be more sequential, where the output of one agent is dependent as the input to the next.

Is a Team a fixed grouping of specialized Agents? Or can Teams be created dynamically from any available agents? --> Teams can be dynamically created. So we can create a baseline to say that hey, this is when a fixed team is appropriate versus this is when we should do dynamic creation.

Is there communication between teams? If so, how does this work? Can an Agent in the Dev Team talk to an agent in the Product Team? Or does this communication have to happen through the Meta-Agent? --> Correct! There could be - either through shared contexts, sessions or even through A2A. This does not need to happen necessarily through the meta agent (unless this is a supervisor agent architecture where everything goes through the supervisor agent). This is a deeper dive so I am assuming that we will learn more as we build this out, but let's say an agent architecture might have a combination of supervisor agent and swarm agent architectures - where one communication happens through the supervisor and in the swarm, all agents can communicate with one another.

"Each team operates with domain-appropriate patterns (swarm for ideation, supervisor for development)." -> Are these just examples or are we saying that you have to use swarm for ideation and supervisor for development? - examples! These are the commonly used architectural patterns, but there can be more like - custom agent patterns, using agents themselves as tools, etc.

"Example: Agent Development Workflow" --> I am open to discussing this more and we should - but i believe the end user interacting with this application would not know how to invoke sub agents to be spawned, that would be application specific to the meta agent displayed to the user through logs and traces. All the user would maybe do is give a query and then based on the complexity of the query and the meta-agent capabilities, it might used a fixed team or spawn new agents.

westonbrown Jun 30, 2025
Maintainer Author

@mvilrokx Thanks for the detailed feedback and thoughtful questions!

I agree with @madhurprash responses and appreciate the clarifications provided. You've raised excellent points about framework flexibility, validation metrics, team communication patterns, and operational considerations.

This plan is intentionally a starting outline - we expect it to evolve based on team input and learnings. A few quick thoughts on your key concerns:

Framework and tech stack choice: We'll keep it flexible. Strands/LangGraph are starting points, but we can adapt based on what gives us the best balance of autonomy and determinism
Validation numbers: These are hypotheses to test, not fixed targets. Though is we can adjust based on real-world results
Agentic team communication: Great point about inter-team agent communication. We plan to explore both supervisor and direct agent-to-agent patterns

Please continue proposing changes! Your perspective on making this practical is exactly what we need. Would love to see your thoughts on:

Alternative framework options you've seen work well
Specific patterns for the operational concerns you raised
How we might better structure the validation methodology

This is a working draft meant to evolve. Please continue proposing changes - we'll incorporate them into the team plan through collaborative review.

sriaradhyula · 2025-06-27T16:35:52Z

sriaradhyula
Jun 27, 2025

@westonbrown - Let's reconcile this discussion into proposal as well - #2

1 reply

westonbrown Jun 27, 2025
Maintainer Author

@sriaradhyula agreed! Want to take a first pass on that reconciliation?

heeki · 2025-09-03T01:48:04Z

heeki
Sep 3, 2025
Maintainer

Hi all,
In our last working group meeting, we discussed:

Continuing the multi-framework implementation of the meta-everything-agent project, which has the repo structure in place for contributions now
Starting with documenting some baseline definitions and ideas that might help ground the rest of the work of this group

The working group coalesced around the idea of thinking through the “anatomy of an agent”, considering the following ideas, in no particular order:

Monolith versus microservice agents considering granularity (coarse-grained, fine-grained), composition (stitching, orchestrating), specialization (domain-specialized versus general purpose)
Discussion of changing UX with agentic applications
Coverage design choices and tradeoffs of what folks are finding to work or not work.

The goal of this mini-initiative would be to:

Conduct some due diligence on existing research
Build reference implementations as needed
Conduct performance analysis
Write up findings from said analysis

This will help us align as we continue forward with the meta-everything-agent. Thoughts?

Cheers.
Heeki

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agent Development Working Group Plan | DRAFT #1

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Agent Development Working Group Plan | DRAFT #1

Uh oh!

Uh oh!

westonbrown Jun 20, 2025 Maintainer

Agent Development Working Group Plan

Core Open Questions

Technical Stack

Solution Architecture

Core Event Loop (Universal Pattern)

Team Formation Hierarchy (When Needed)

Example: Agent Development Workflow

Core Agentic capabilities

Dynamic Tool & Resource Injection

Key Capabilities & Features

Core Architecture Patterns

Project Plan

Phase 1: Foundation Tasks (Weeks 1-4)

Repository Structure

Key Primitives & Capabilities

Meta-agent/tooling - Spawn agent and tools at runtime

Industry Use Cases for Validation

Cross-Industry Validation

Agent Testing & Evaluation Methodology

Evaluation Dimensions & Metrics

Testing Strategy

Replies: 4 comments · 3 replies

Uh oh!

madhurprash Jun 20, 2025

Uh oh!

mvilrokx Jun 23, 2025

Uh oh!

Uh oh!

madhurprash Jun 27, 2025

Uh oh!

westonbrown Jun 30, 2025 Maintainer Author

Uh oh!

sriaradhyula Jun 27, 2025

Uh oh!

westonbrown Jun 27, 2025 Maintainer Author

Uh oh!

heeki Sep 3, 2025 Maintainer

westonbrown
Jun 20, 2025
Maintainer

Replies: 4 comments 3 replies

madhurprash
Jun 20, 2025

mvilrokx
Jun 23, 2025

westonbrown Jun 30, 2025
Maintainer Author

sriaradhyula
Jun 27, 2025

westonbrown Jun 27, 2025
Maintainer Author

heeki
Sep 3, 2025
Maintainer