stefmolin
diff --git a/‎.github/workflows/env-checks.yml
Lines changed: 4 additions & 0 deletions b/‎.github/workflows/env-checks.yml
Lines changed: 4 additions & 0 deletions
diff --git a/‎.github/workflows/stale.yml
Lines changed: 23 additions & 0 deletions b/‎.github/workflows/stale.yml
Lines changed: 23 additions & 0 deletions
diff --git a/‎notebooks/1-getting_started_with_pandas.ipynb
Lines changed: 68 additions & 20 deletions b/‎notebooks/1-getting_started_with_pandas.ipynb
Lines changed: 68 additions & 20 deletions
diff --git a/‎notebooks/3-data_visualization.ipynb
Lines changed: 20 additions & 0 deletions b/‎notebooks/3-data_visualization.ipynb
Lines changed: 20 additions & 0 deletions
@@ -21,6 +21,10 @@ on:
   schedule:
     - cron: "44 22 11 * *"
 
+concurrency:
+  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
+  cancel-in-progress: true
+
 # A workflow run is made up of one or more jobs that can run sequentially or in parallel
 jobs:
   # This workflow contains a single job called "build"
 
@@ -0,0 +1,23 @@
+# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
+#
+# You can adjust the behavior by modifying this file.
+# For more information, see:
+# https://github.com/actions/stale
+name: Mark stale issues and pull requests
+
+on:
+  schedule:
+    - cron: '27 20 * * *'
+
+jobs:
+  stale:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/stale@v4
+        with:
+          days-before-stale: 30
+          days-before-close: 7
+          stale-issue-message: 'This issue has been marked as stale due to lack of recent activity. It will be closed if no further activity occurs.'
+          stale-pr-message: ''
+          stale-issue-label: 'stale'
+          stale-pr-label: 'stale'
@@ -2057,7 +2057,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "1dd8c102-fe33-4376-98f8-0b8c9c5b8384",
+   "id": "ef8f5c35-57e4-4347-845e-39fa4de28075",
    "metadata": {
     "slideshow": {
      "slide_type": "subslide"
@@ -2071,13 +2071,13 @@
   {
    "cell_type": "code",
    "execution_count": 25,
-   "id": "3d666cd2-eca0-4634-86a5-aebbcda99522",
+   "id": "b998653e-02dd-4540-ba66-dbc4b458b558",
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "32.6"
+       "13278.078548601512"
       ]
      },
      "execution_count": 25,
@@ -2086,33 +2086,47 @@
     }
    ],
    "source": [
-    "meteorites['mass (g)'].median()"
+    "meteorites['mass (g)'].mean()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "f322c0f3-a057-4193-9f7f-78c9828d6197",
+   "id": "a398ecbe-10cc-4498-a7f1-91ea0bc736d2",
    "metadata": {
     "slideshow": {
      "slide_type": "fragment"
     },
     "tags": []
    },
    "source": [
-    "We can take this a step further and look at quantiles:"
+    "**Important**: The mean isn't always the best measure of central tendency. If there are outliers in the distribution, the mean will be skewed. Here, the mean is being pulled higher by some very heavy meteorites &ndash; the distribution is [right-skewed](https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b0162c6-f48f-4687-9902-72325ebecc0d",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "subslide"
+    },
+    "tags": []
+   },
+   "source": [
+    "Taking a look at some quantiles at the extremes of the distribution shows that the mean is between the 95th and 99th percentile of the distribution, so it isn't a good measure of central tendency here:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 26,
-   "id": "5d97fd11-12eb-4970-b042-6cbbd35a3a23",
+   "id": "b7379492-da17-4358-b357-2ae6e1a26e67",
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "0.01        0.44\n",
        "0.05        1.10\n",
+       "0.50       32.60\n",
        "0.95     4000.00\n",
        "0.99    50600.00\n",
        "Name: mass (g), dtype: float64"
@@ -2124,7 +2138,41 @@
     }
    ],
    "source": [
-    "meteorites['mass (g)'].quantile([0.01, 0.05, 0.95, 0.99])"
+    "meteorites['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2ca1c739-cf2b-4000-bedb-b66a3d11f071",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "fragment"
+    },
+    "tags": []
+   },
+   "source": [
+    "A better measure in this case is the median (50th percentile), since it is robust to outliers:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "bc2e62f3-899d-4a50-a2f4-8b2e73e1bc2f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "32.6"
+      ]
+     },
+     "execution_count": 27,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "meteorites['mass (g)'].median()"
    ]
   },
   {
@@ -2142,7 +2190,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 28,
    "id": "585af605-e601-49b6-bd1f-4838ab993302",
    "metadata": {},
    "outputs": [
@@ -2152,7 +2200,7 @@
        "60000000.0"
       ]
      },
-     "execution_count": 27,
+     "execution_count": 28,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -2176,7 +2224,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 28,
+   "execution_count": 29,
    "id": "29720ccc-3855-42f7-a0d0-e41a83cf1bef",
    "metadata": {},
    "outputs": [
@@ -2196,7 +2244,7 @@
        "Name: 16392, dtype: object"
       ]
      },
-     "execution_count": 28,
+     "execution_count": 29,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -2220,7 +2268,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 29,
+   "execution_count": 30,
    "id": "79c2a1db-0eeb-4173-964a-a38741c059ba",
    "metadata": {},
    "outputs": [
@@ -2230,7 +2278,7 @@
        "466"
       ]
      },
-     "execution_count": 29,
+     "execution_count": 30,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -2254,7 +2302,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 30,
+   "execution_count": 31,
    "id": "3ac57de5-7734-478a-9772-feb82890d5ef",
    "metadata": {},
    "outputs": [
@@ -2266,7 +2314,7 @@
        "      dtype=object)"
       ]
      },
-     "execution_count": 30,
+     "execution_count": 31,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -2299,7 +2347,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 31,
+   "execution_count": 32,
    "id": "f0297d45-1d86-411f-ad8e-74cfaa3b2389",
    "metadata": {},
    "outputs": [
@@ -2512,7 +2560,7 @@
        "max                        NaN     81.166670    354.473330         NaN  "
       ]
      },
-     "execution_count": 31,
+     "execution_count": 32,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -2557,7 +2605,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 32,
+   "execution_count": 33,
    "id": "876cafcb-00ab-4f5a-8b3c-bfead4f0b14c",
    "metadata": {},
    "outputs": [],
@@ -2578,7 +2626,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 33,
+   "execution_count": 34,
    "id": "6402bb24-3da9-48e5-bde1-0b8a9576f00d",
    "metadata": {},
    "outputs": [],
 
@@ -32,6 +32,26 @@
     "The human brain excels at finding patterns in visual representations of the data; so in this section, we will learn how to visualize data using pandas along with the Matplotlib and Seaborn libraries for additional features. We will create a variety of visualizations that will help us better understand our data."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "267d9762-d012-43d1-82b0-02b37e110de8",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "slide"
+    },
+    "tags": []
+   },
+   "source": [
+    "## Why is data visualization necessary?\n",
+    "\n",
+    "So far, we have focused a lot on summarizing the data using statistics. However, summary statistics are not enough to understand the distribution &ndash; there are many possible distributions for a given set of summary statistics. Data visualization is necessary to truly understand the distribution:\n",
+    "\n",
+    "<div style=\"text-align: center; margin-top: -10px;\">\n",
+    "<img width=\"50%\" src=\"https://raw.githubusercontent.com/stefmolin/data-morph/main/docs/_static/panda-to-star-eased.gif\" alt=\"Data Morph: panda to star\" style=\"min-width: 300px; margin-bottom: -10px;\"/>\n",
+    "<div style=\"margin: auto 26%;\"><small><em>A set of points forming a panda can also form a star without any significant changes to the summary statistics displayed above. (source: <a href=\"https://github.com/stefmolin/data-morph\">Data Morph</a>)</em></small></div>\n",
+    "</div>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "e58aeca3-c71e-4b42-9ece-4eaa30ea0382",