If you have ollama running, these examples should work out of the box on unix systems (Linux, macOS).
python3 -m venv .venv
source .venv/bin/activate
pip install -U uv aiofiles autogen-agentchat "autogen-ext[magentic-one,ollama,docker,mcp,web-surfer]"
export OLLAMA_API_BASE=http://localhost:11434
ollama pull mistral:7b-instruct
# ollama pull mistral-small3.1 # I was able to websurf with this model
The model must have tooling capability (and vision for web surfing).
Note that unless the Ollama models are listed here I need to provide
model_info
to theOllamaChatCompletionClient
constructor.
If you are in doubt which models support function calling and vision, you can check it with these scripts that scan all your local Ollama models :
./scripts/ollama-tooling-support.sh
./scripts/ollama-vision-support.sh /path/to/an/image.jpg
After grasping these examples, we should be able to build a small team of agents like architect, developer and tester who build a project together until it compiles and all tests pass.
The simplest form of using autogen is just sending a message to the LLM and getting a response back.
Assistant agent is a customizable wrapper around the LLM that can be given context, handle structured and streaming responses, call functions, interact with other agents in teams, and more.
This agent was designed to execute code either in local (LocalCommandLineCodeExecutor
) or
isolated DockerCommandLineCodeExecutor
environments.
MultimodalWebSurfer is a multimodal agent that acts as a web surfer that can search the web and visit web pages.
Custom made agent that writes given code to a named file.
Token streaming is useful for applications that require real-time interaction with the LLM, such as chatbots.
Structured responses allow you to define a schema for the responses which can be useful for tasks like classification or structured data extraction.
AssistantAgent
sends request to LLM with function/tool definition- LLM returns
function_call
(structured tool request){ "tool_call": { "name": "get_weather", "arguments": {"city": "Paris"} } }
AssistantAgent
handlesfunction_call
:- Calls the corresponding local Python function
- Feeds function result back to LLM
- LLM produces final answer
-
AssistantAgent
sends the prompt to LLM with tool metadata (JSON schema, tool name, etc.). -
If the LLM supports function calling, it responds with a
tool_call
. -
AssistantAgent
routes the call toMcpWorkbench
which callsmcp-server-fetch
tool. -
The
mcp-server-fetch
tool does the fetch (like downloading the Wikipedia page) and returns the result. -
AssistantAgent
continues reasoning or summarizing with the result.
Team of agents sharing the same context where each agent takes turn and broadcasts response to the rest of the team.
Team of agents that selects the next speaker with a selector function.
One of the team agents is a UserProxyAgent
which provides input from the user.
MaxMessageTermination
: Stops after a specified number of messages have been produced, including both agent and task messages.TextMentionTermination
: Stops when specific text or string is mentioned in a message (e.g., “TERMINATE”).TokenUsageTermination
: Stops when a certain number of prompt or completion tokens are used. This requires the agents to report token usage in their messages.TimeoutTermination
: Stops after a specified duration in seconds.HandoffTermination
: Stops when a handoff to a specific target is requested to allow another agent to provide input.SourceMatchTermination
: Stops after a specific agent responds.ExternalTermination
: Enables programmatic control of termination from outside the run. This is useful for UI integration (e.g., “Stop” buttons in chat interfaces).StopMessageTermination
: Stops when a StopMessage is produced by an agent.TextMessageTermination
: Stops when a TextMessage is produced by an agent.FunctionCallTermination
: Stops when a ToolCallExecutionEvent containing a FunctionExecutionResult with a matching name is produced by an agent.FunctionalTermination
: Stop when a function expression is evaluated to True on the last delta sequence of messages.
Agent and Team interfaces have save_state
and load_state
methods to either a file or a database.
-
Note that advanced agents with deep reasoning or large context should use large
70B
models on a GPU with at least 48GB of VRAM likeNVIDIA Quadro RTX 8000 48GB
. I tend to prototype with local models through ollama or llama-cpp server because theopenAI
bill can be quite high, especially withGPT-4o
models. -
Be sure that Ollama is really using underlying GPU like nvidia with cuda toolkit compiled, otherwise it will be very slow.
-
Increase
OLLAMA_CONTEXT_LENGTH=
in case the context of your agent team grows.