Arize-ai · Jgilhuly · May 16, 2025 · Apr 21, 2025 · May 10, 2025 · May 10, 2025
@@ -0,0 +1,360 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "jXCU375BitSy"
+   },
+   "source": [
+    "<center>\n",
+    "    <p style=\"text-align:center\">\n",
+    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
+    "        <br>\n",
+    "        <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n",
+    "        |\n",
+    "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
+    "        |\n",
+    "        <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n",
+    "    </p>\n",
+    "</center>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "bB9zTru-iiQS"
+   },
+   "source": [
+    "# LangGraph: Evaluator–Optimizer Loop\n",
+    "In this tutorial, we’ll build a code generation feedback loop using LangGraph — where a generator LLM writes code and an evaluator LLM provides structured reviews. This iterative pattern is useful for refining outputs over multiple steps until they meet a defined success criterion.\n",
+    "\n",
+    "The workflow consists of:\n",
+    "\n",
+    "A **generator LLM** that produces or revises code based on feedback.\n",
+    "\n",
+    "An **evaluator LLM** that assigns a grade (pass or fail) and gives feedback if needed.\n",
+    "\n",
+    "A **LangGraph state machine** that loops the generator until the evaluator approves the result.\n",
+    "\n",
+    "To make this fully observable and production-grade, we’ve instrumented the graph with Phoenix tracing. This enables you to inspect each generation and evaluation step, see what the model produced, and understand why it did (or didn’t) pass."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install langgraph langchain langchain_community \"arize-phoenix==9.0.1\" arize-phoenix-otel openinference-instrumentation-langchain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install langchain_openai"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import getpass\n",
+    "import os\n",
+    "\n",
+    "from langgraph.graph import END, START, StateGraph"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "kE0uzv1ei0Nc"
+   },
+   "source": [
+    "# Configure Phoenix Tracing\n",
+    "\n",
+    "Make sure you go to https://app.phoenix.arize.com/ and generate an API key. This will allow you to trace your Langgraph application with Phoenix."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "PHOENIX_API_KEY = getpass.getpass(\"Phoenix API Key:\")\n",
+    "os.environ[\"PHOENIX_CLIENT_HEADERS\"] = f\"api_key={PHOENIX_API_KEY}\"\n",
+    "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from phoenix.otel import register\n",
+    "\n",
+    "tracer_provider = register(project_name=\"Evaluator-Optimizer\", auto_instrument=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "uCHt5qQzjQ84"
+   },
+   "source": [
+    "# Evaluator‑Optimizer • Code‑Writing Loop\n",
+    "---------------------------------------\n",
+    "Input  : problem_spec (natural‑language description of the function/program)\n",
+    "\n",
+    "Output : refined, accepted code (Python string)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import Literal, TypedDict\n",
+    "\n",
+    "from langchain_core.messages import HumanMessage, SystemMessage\n",
+    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
+    "from langchain_openai import ChatOpenAI"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Kfy5KexRjeKF"
+   },
+   "source": [
+    "LLMs\n",
+    "----\n",
+    "• generator_llm : writes / rewrites the code\n",
+    "\n",
+    "• evaluator_llm : grades the code via structured output"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "generator_llm = ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0.3)\n",
+    "evaluator_llm = ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "I3v2xmGCjlSB"
+   },
+   "source": [
+    "To enable structured, reliable feedback from the evaluator LLM, we define a Pydantic schema called Review. This schema ensures that all evaluations include both a binary grade (pass or fail) and feedback if the code needs improvement.\n",
+    "\n",
+    "By binding this schema to the evaluator LLM, we guarantee consistent output formatting and make it easier to route logic in the graph. This step is essential for closing the feedback loop and driving iterative optimization."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class Review(BaseModel):\n",
+    "    grade: Literal[\"pass\", \"fail\"] = Field(description=\"Did the code fully solve the problem?\")\n",
+    "    feedback: str = Field(description=\"If grade=='fail', give concrete, actionable feedback.\")\n",
+    "\n",
+    "\n",
+    "evaluator = evaluator_llm.with_structured_output(Review)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "fdfvi2tTjuTu"
+   },
+   "source": [
+    "# Langgraph Shared State\n",
+    "Defines the shared state for the evaluator-optimizer loop, tracking the problem description, generated code, feedback, and evaluation grade."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class State(TypedDict):\n",
+    "    problem_spec: str\n",
+    "    code: str\n",
+    "    feedback: str\n",
+    "    grade: str  # pass / fail"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "RHpxqUrwj1PL"
+   },
+   "source": [
+    "# Node Functions: Generator & Evaluator\n",
+    "These nodes power the evaluator–optimizer loop. The code_generator node uses the generator LLM to produce or revise Python code based on the task and prior feedback. The code_evaluator node uses a structured evaluator LLM to simulate a code review — mentally testing the code against the spec and returning a binary grade along with constructive feedback if needed. This feedback is then looped back to the generator until the code passes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def code_generator(state: State):\n",
+    "    \"\"\"Write or refine code based on feedback.\"\"\"\n",
+    "    prompt = (\n",
+    "        \"You are an expert Python developer.\\n\"\n",
+    "        \"Write clear, efficient, PEP‑8 compliant code that solves the task below.\\n\"\n",
+    "        \"If feedback is provided, revise the previous code accordingly.\\n\\n\"\n",
+    "        f\"### Task\\n{state['problem_spec']}\\n\\n\"\n",
+    "    )\n",
+    "    if state.get(\"feedback\"):\n",
+    "        prompt += f\"### Previous Reviewer Feedback\\n{state['feedback']}\\n\"\n",
+    "\n",
+    "    msg = generator_llm.invoke(prompt)\n",
+    "    return {\"code\": msg.content}\n",
+    "\n",
+    "\n",
+    "def code_evaluator(state: State):\n",
+    "    \"\"\"LLM reviews the code solution.\"\"\"\n",
+    "    review = evaluator.invoke(\n",
+    "        [\n",
+    "            SystemMessage(\n",
+    "                content=(\n",
+    "                    \"You are a strict code reviewer. \"\n",
+    "                    \"Run mental tests / reasoning to decide whether the code meets the spec. \"\n",
+    "                    \"If it fails, give concise actionable feedback.\"\n",
+    "                )\n",
+    "            ),\n",
+    "            HumanMessage(\n",
+    "                content=(\n",
+    "                    f\"### Problem\\n{state['problem_spec']}\\n\\n\"\n",
+    "                    f\"### Candidate Code\\n{state['code']}\"\n",
+    "                )\n",
+    "            ),\n",
+    "        ]\n",
+    "    )\n",
+    "    return {\"grade\": review.grade, \"feedback\": review.feedback}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "azF5t0TYkFAi"
+   },
+   "source": [
+    "# Routing Logic\n",
+    "To support iterative refinement, we define a simple routing function called route. After the code is evaluated, this function checks the grade returned by the evaluator: if the code passes, the process ends; if it fails, control loops back to the generator for another revision. This logic ensures the LLM can continuously improve its output based on feedback until it meets the quality bar."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def route(state: State):\n",
+    "    return \"Accept\" if state[\"grade\"] == \"pass\" else \"Revise\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "AALHddZ5kJLP"
+   },
+   "source": [
+    "# Building the LangGraph\n",
+    "We now define our LangGraph-based workflow. Each component — the generator and evaluator — is added as a node. Directed edges specify the flow: the graph starts with generation, then moves to evaluation. Conditional edges use the routing logic to either terminate the loop (if passed) or revisit the generator (if failed). This structure enables self-correcting behavior with every iteration traceable via Phoenix."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "builder = StateGraph(State)\n",
+    "\n",
+    "builder.add_node(\"code_generator\", code_generator)\n",
+    "builder.add_node(\"code_evaluator\", code_evaluator)\n",
+    "\n",
+    "builder.add_edge(START, \"code_generator\")\n",
+    "builder.add_edge(\"code_generator\", \"code_evaluator\")\n",
+    "builder.add_conditional_edges(\n",
+    "    \"code_evaluator\",\n",
+    "    route,\n",
+    "    {\n",
+    "        \"Accept\": END,\n",
+    "        \"Revise\": \"code_generator\",\n",
+    "    },\n",
+    ")\n",
+    "\n",
+    "workflow = builder.compile()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "9JAMsA_8kMAe"
+   },
+   "source": [
+    "# Example Usage"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "problem = \"\"\"\n",
+    "Write code for a complicated website with javascript features.\n",
+    "\"\"\"\n",
+    "\n",
+    "result_state = workflow.invoke({\"problem_spec\": problem})\n",
+    "print(\"===== FINAL CODE =====\\n\")\n",
+    "print(result_state[\"code\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Dg72TE-ykNlN"
+   },
+   "source": [
+    "# Make sure to check our your traces in Phoenix!"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}