pytorch
diff --git a/‎_downloads/162cf335b789dd055d4192f77cb0251c/foreach_map.ipynb
Lines changed: 306 additions & 0 deletions b/‎_downloads/162cf335b789dd055d4192f77cb0251c/foreach_map.ipynb
Lines changed: 306 additions & 0 deletions
diff --git a/‎_downloads/3195443a0ced3cabc0ad643537bdb5cd/introyt1_tutorial.ipynb
Lines changed: 2 additions & 2 deletions b/‎_downloads/3195443a0ced3cabc0ad643537bdb5cd/introyt1_tutorial.ipynb
Lines changed: 2 additions & 2 deletions
diff --git a/‎_downloads/4355e2cef7d17548f1e25f97a62828c4/template_tutorial.ipynb
Lines changed: 2 additions & 2 deletions b/‎_downloads/4355e2cef7d17548f1e25f97a62828c4/template_tutorial.ipynb
Lines changed: 2 additions & 2 deletions
diff --git a/‎_downloads/63a0f0fc7b3ffb15d3a5ac8db3d521ee/tensors_deeper_tutorial.ipynb
Lines changed: 2 additions & 2 deletions b/‎_downloads/63a0f0fc7b3ffb15d3a5ac8db3d521ee/tensors_deeper_tutorial.ipynb
Lines changed: 2 additions & 2 deletions
diff --git a/‎_downloads/770632dd3941d2a51b831c52ded57aa2/trainingyt.ipynb
Lines changed: 2 additions & 2 deletions b/‎_downloads/770632dd3941d2a51b831c52ded57aa2/trainingyt.ipynb
Lines changed: 2 additions & 2 deletions
@@ -0,0 +1,306 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# For tips on running notebooks in Google Colab, see\n",
+    "# https://pytorch.org/tutorials/beginner/colab\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "(beta) Explicit horizontal fusion with foreach\\_map and torch.compile\n",
+    "============================================================\n",
+    "\n",
+    "**Author:** [Michael Lazos](https://github.com/mlazos)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Horizontal fusion is a key optimization in ML compilers. In eager,\n",
+    "\n",
+    ":   this is typically expressed using the torch.\\_foreach\\* ops which\n",
+    "    parallelizes operations across a list of tensors. However,\n",
+    "    supporting all possible permutations of arguments is quite difficult\n",
+    "    (e.g. mixtures of scalars and lists). Foreach\\_map allows conversion\n",
+    "    of any pointwise op in `torch` to a horiztonally fused foreach\n",
+    "    variant. In this tutorial, we will demonstrate how to implement the\n",
+    "    Adam optimizer with `foreach_map` to generate a fully fused kernel.\n",
+    "\n",
+    "<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n",
+    "\n",
+    "<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n",
+    "\n",
+    "<p>This tutorial requires PyTorch 2.7.0 or later.</p>\n",
+    "\n",
+    "</div>\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Model Setup\n",
+    "===========\n",
+    "\n",
+    "For this example, we\\'ll use a simple sequence of linear layers. We\n",
+    "instantiate an independent copy to compare the two optimizer\n",
+    "implementations.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "# exit cleanly if we are on a device that doesn't support ``torch.compile``\n",
+    "if torch.cuda.get_device_capability() < (7, 0):\n",
+    "    print(\"Exiting because torch.compile is not supported on this device.\")\n",
+    "    import sys\n",
+    "    sys.exit(0)\n",
+    "\n",
+    "# Create simple model\n",
+    "model = torch.nn.Sequential(\n",
+    "    *[torch.nn.Linear(1024, 1024, False, device=\"cuda\") for _ in range(10)]\n",
+    ")\n",
+    "model_copy = torch.nn.Sequential(\n",
+    "    *[torch.nn.Linear(1024, 1024, False, device=\"cuda\") for _ in range(10)]\n",
+    ")\n",
+    "input = torch.rand(1024, device=\"cuda\")\n",
+    "\n",
+    "# run forward pass\n",
+    "output = model(input)\n",
+    "output_copy = model_copy(input)\n",
+    "\n",
+    "# run backward to populate the grads for our optimizer below\n",
+    "output.sum().backward()\n",
+    "output_copy.sum().backward()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Helper functions for foreach\\_map implementation\n",
+    "================================================\n",
+    "\n",
+    "In this section, we\\'ll begin our implementation of the Adam optimizer.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from torch._higher_order_ops.foreach_map import foreach_map\n",
+    "\n",
+    "# Helper function to extract optimizer states from a torch.optim.Adam instance\n",
+    "def get_inputs(optim):\n",
+    "    steps = []\n",
+    "    params = []\n",
+    "    grads = []\n",
+    "    exp_avgs = []\n",
+    "    exp_avg_sqs = []\n",
+    "    for group in optim.param_groups:\n",
+    "        for p in group[\"params\"]:\n",
+    "            params.append(p)\n",
+    "            grads.append(p.grad)\n",
+    "            state = optim.state[p]\n",
+    "            exp_avgs.append(state[\"exp_avg\"])\n",
+    "            exp_avg_sqs.append(state[\"exp_avg_sq\"])\n",
+    "            steps.append(state[\"step\"])\n",
+    "\n",
+    "    return steps, params, exp_avgs, exp_avg_sqs\n",
+    "\n",
+    "\n",
+    "# Functions to update the different optimizer states\n",
+    "def update_exp_avg_sq(exp_avg_sq, grad, beta2):\n",
+    "    return exp_avg_sq.mul(beta2).addcmul(grad, grad, value=1 - beta2)\n",
+    "\n",
+    "def update_param(param, step, exp_avg, exp_avg_sq, beta1, beta2, lr, eps):\n",
+    "    bias_correction1 = 1 - torch.pow(beta1, step)\n",
+    "    bias_correction2 = (1 - torch.pow(beta2, step)).sqrt()\n",
+    "    step_size = (lr / bias_correction1).neg()\n",
+    "    denom = (exp_avg_sq.sqrt() / (bias_correction2 * step_size)).add(eps / step_size)\n",
+    "    return torch.add(param, torch.div(exp_avg, denom))\n",
+    "\n",
+    "# Our full Adam implementation\n",
+    "def foreach_map_adam(\n",
+    "    steps,\n",
+    "    params,\n",
+    "    exp_avgs,\n",
+    "    exp_avg_sqs,\n",
+    "    weight_decay=0,\n",
+    "    beta1=0.9,\n",
+    "    beta2=0.999,\n",
+    "    lr=1e-3,\n",
+    "    eps=1e-8,\n",
+    "):\n",
+    "    with torch.no_grad():\n",
+    "        grads = [param.grad for param in params]\n",
+    "        # update step\n",
+    "        updated_steps = foreach_map(lambda x: x + 1, steps)\n",
+    "        torch._foreach_copy_(steps, updated_steps)\n",
+    "\n",
+    "        if weight_decay != 0:\n",
+    "            foreach_map(torch.add, (grads,), alpha=weight_decay)\n",
+    "\n",
+    "        # Higher-order operators (HOPs) cannot have multiple outputs at the moment\n",
+    "        # need to call foreach_map once for each output\n",
+    "        exp_avgs_updated = foreach_map(torch.lerp, exp_avgs, grads, 1 - beta1)\n",
+    "        exp_avgs_sq_updated = foreach_map(update_exp_avg_sq, exp_avg_sqs, grads, beta2)\n",
+    "        params_updated = foreach_map(\n",
+    "            update_param,\n",
+    "            params,\n",
+    "            steps,\n",
+    "            exp_avgs_updated,\n",
+    "            exp_avgs_sq_updated,\n",
+    "            beta1,\n",
+    "            beta2,\n",
+    "            lr,\n",
+    "            eps,\n",
+    "        )\n",
+    "        # Higher-order operators (HOPs) don't support input mutation today\n",
+    "        # so manually  update the states in-place\n",
+    "        torch._foreach_copy_(exp_avgs, exp_avgs_updated)\n",
+    "        torch._foreach_copy_(exp_avg_sqs, exp_avgs_sq_updated)\n",
+    "        torch._foreach_copy_(params, params_updated)\n",
+    "    return"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Setting up and running the compiled kernel\n",
+    "==========================================\n",
+    "\n",
+    "In this section, we\\'ll run our Adam optimizer and compare the results\n",
+    "\n",
+    "<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n",
+    "\n",
+    "<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n",
+    "\n",
+    "<p><code>torch.compile</code> is only supported on CUDA devices that have a compute capability of 7.0 or higher.</p>\n",
+    "\n",
+    "</div>\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "opt_eager = torch.optim.Adam(model.parameters(), lr=torch.tensor(0.01))\n",
+    "opt_eager_copy = torch.optim.Adam(model_copy.parameters(), lr=torch.tensor(0.01))\n",
+    "\n",
+    "# warm up the optimizer state dict\n",
+    "opt_eager.step()\n",
+    "opt_eager_copy.step()\n",
+    "\n",
+    "inputs = get_inputs(opt_eager_copy)\n",
+    "compiled_adam = torch.compile(foreach_map_adam)\n",
+    "\n",
+    "# optionally view the output code\n",
+    "torch._logging.set_logs(output_code=True)\n",
+    "\n",
+    "# Warmup runs to compile the function\n",
+    "for _ in range(5):\n",
+    "    opt_eager.step()\n",
+    "    compiled_adam(*inputs)\n",
+    "\n",
+    "for eager_p, compile_p in zip(opt_eager.param_groups[0][\"params\"], opt_eager_copy.param_groups[0][\"params\"]):\n",
+    "    torch.allclose(eager_p, compile_p)\n",
+    "\n",
+    "# Benchmark performance\n",
+    "\n",
+    " # Let's define a helpful benchmarking function:\n",
+    "import torch.utils.benchmark as benchmark\n",
+    "\n",
+    "def benchmark_torch_function_in_microseconds(f, *args, **kwargs):\n",
+    "    t0 = benchmark.Timer(\n",
+    "        stmt=\"f(*args, **kwargs)\", globals={\"args\": args, \"kwargs\": kwargs, \"f\": f}\n",
+    "    )\n",
+    "    return t0.blocked_autorange().mean * 1e6\n",
+    "\n",
+    "eager_runtime = benchmark_torch_function_in_microseconds(opt_eager.step)\n",
+    "compiled_runtime = benchmark_torch_function_in_microseconds(lambda: compiled_adam(*inputs))\n",
+    "\n",
+    "assert eager_runtime > compiled_runtime\n",
+    "   \n",
+    "print(f\"eager runtime: {eager_runtime}us\")\n",
+    "print(f\"compiled runtime: {compiled_runtime}us\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Conclusion\n",
+    "==========\n",
+    "\n",
+    "In this tutorial, we successfully implemented a custom fully-fused Adam\n",
+    "optimizer using foreach\\_map. By leveraging the power of foreach\\_map\n",
+    "and torch.compile, we were able to create an optimized version of the\n",
+    "Adam optimizer that can be used in various machine learning\n",
+    "applications. This tutorial provides a comprehensive guide on how to use\n",
+    "foreach\\_map and torch.compile to optimize machine learning models, and\n",
+    "serves as a valuable resource for developers looking to improve the\n",
+    "performance of their models with horizontal fusion.\n",
+    "\n",
+    "See also:\n",
+    "\n",
+    "-   [Compiled optimizer\n",
+    "    tutorial](https://pytorch.org/tutorials/recipes/compiling_optimizer.html) -\n",
+    "    an intro into the compiled optimizer.\n",
+    "-   [Compiling the optimizer with\n",
+    "    PT2](https://dev-discuss.pytorch.org/t/compiling-the-optimizer-with-pt2/1669) -\n",
+    "    deeper technical details on the compiled optimizer.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
@@ -34,7 +34,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "625eef0e",
+   "id": "2257e813",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -50,7 +50,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "40e708e3",
+   "id": "4229a50a",
    "metadata": {},
    "source": [
     "\n",
 
@@ -31,7 +31,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "62f602a3",
+   "id": "7d03b8e9",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -47,7 +47,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "119971fe",
+   "id": "bb240c29",
    "metadata": {},
    "source": [
     "\n",
 
@@ -34,7 +34,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "30c66a0a",
+   "id": "b39ee222",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -50,7 +50,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "47a69ea4",
+   "id": "b3c105e6",
    "metadata": {},
    "source": [
     "\n",
 
@@ -35,7 +35,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "b617591b",
+   "id": "beda72ee",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -51,7 +51,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "088f2a19",
+   "id": "e850402a",
    "metadata": {},
    "source": [
     "\n",