astrojuanlu
diff --git a/‎img/red-green-refactor.png
34.8 KB b/‎img/red-green-refactor.png
34.8 KB
diff --git a/‎sessions/02_layout-unit-tests.ipynb
Lines changed: 356 additions & 0 deletions b/‎sessions/02_layout-unit-tests.ipynb
Lines changed: 356 additions & 0 deletions
@@ -0,0 +1,356 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "slide"
+    }
+   },
+   "source": [
+    "![IE](../img/ie.png)\n",
+    "\n",
+    "# Sessions 2 & 3: Project layout and unit tests\n",
+    "\n",
+    "### Juan Luis Cano Rodríguez <jcano@faculty.ie.edu> - Master in Business Analytics and Big Data (2019-04-03)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Triangular workflows in git\n",
+    "\n",
+    "When collaborating with a project hosted online on GitHub or GitLab, the most common setup is having a central repository, one remote fork per user, and local clones/checkouts:\n",
+    "\n",
+    "![Triangular workflow](https://github.blog/wp-content/uploads/2015/07/5dcdcae4-354a-11e5-9f82-915914fad4f7.png?resize=2000%2C951)\n",
+    "\n",
+    "(Source: https://github.blog/2015-07-29-git-2-5-including-multiple-worktrees-and-triangular-workflows/)\n",
+    "\n",
+    "Following this workflow requires discipline and sticking to a subset of actions and git commands to avoid common mistakes. This website contains all you need to know to setup your triangular workflow and we don't need to reproduce it here:\n",
+    "\n",
+    "https://www.asmeurer.com/git-workflow/\n",
+    "\n",
+    "*Notice* the different naming conventions between this website and the first image:\n",
+    "\n",
+    "1. **Convention 1**: upstream/origin/local\n",
+    "2. **Convention 2**: origin/&#x3C;username&#x3E;/local\n",
+    "\n",
+    "We will be consistent with the Aaron Meurer guide and therefore use Convention 2 all the time.\n",
+    "\n",
+    "### ⚠ After creating a pull request ⚠\n",
+    "\n",
+    "After your pull request has been merged to `master`, your local `master` and `<username>/master` will be outdated with respect to `origin/master`. On the other hand, **you should avoid working on this branch anymore in the future**: remember branches should be ephemeral and short-lived.\n",
+    "\n",
+    "To put yourself in a clean state again, you have to:\n",
+    "\n",
+    "1. Click \"remove branch\" in the pull request (don't click \"remove fork\"!)\n",
+    "2. `git checkout master` (go back to `master`)\n",
+    "3. `git fetch origin` (**never, ever use `git pull` unless you know exactly what you're doing**)\n",
+    "4. `git merge --ff-only origin master` (update your local `master` with `origin/master`, and fail if you accidentally made any commit in `master`)\n",
+    "5. `git fetch -p <username>` (✨ acknowledge the removal of the remote branch ✨)\n",
+    "6. `git branch -d old-branch` (remove the old branch)\n",
+    "7. `git push <username> master` (update your fork with respect to `origin`)\n",
+    "8. `git checkout -b new-branch` (start working in the new feature!)\n",
+    "\n",
+    "This process **has to be repeated after every pull request**.\n",
+    "\n",
+    "🌈 "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Project layout\n",
+    "\n",
+    "Most data science projects will start with a bunch of notebooks. However, at some point we will want to reuse code between them, and eventually put our models in production without the need to use the notebooks themselves ([unless you are Netflix](https://medium.com/netflix-techblog/notebook-innovation-591ee3221233)). Choosing a good project layout is extremely important to organize the code, avoid common pitfalls and be predictable (i.e. imitate the rest of the ecosystem to minimize surprise). On the other hand, there is lots (**lots**) of outdated, bad or wrong advice on the Internet about this topic, so here we will present The Truth™.\n",
+    "\n",
+    "### References\n",
+    "\n",
+    "* Packaging a Python library https://blog.ionelmc.ro/2014/05/25/python-packaging/\n",
+    "* Less known packaging features and tricks https://blog.ionelmc.ro/presentations/packaging/\n",
+    "* setuptools documentation https://setuptools.readthedocs.io/en/stable/setuptools.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### The `src` layout\n",
+    "\n",
+    "```\n",
+    "├─ src\n",
+    "│  └─ packagename\n",
+    "│     ├─ __init__.py\n",
+    "│     └─ ...\n",
+    "├─ tests\n",
+    "│  └─ ...\n",
+    "├─ README.txt\n",
+    "├─ setup.py\n",
+    "└─ setup.cfg\n",
+    "```\n",
+    "\n",
+    "* The `src/packagename` contains the source code of the library.\n",
+    "  - The `packagename` is what users type after `import` in a Python script, and therefore should not contain special characters.\n",
+    "  - It should contain a `__init__.py` that can be empty (more on that below).\n",
+    "  - The `src` segment prevents you from *shooting yourself in the foot*, because it's common to do `import packagename` when you are developing, and this will import the code from the directory, not from your `sys.path`. Always include it.\n",
+    "\n",
+    "* The `tests` directory contains the tests. It **must not** contain any `__init__.py` because it's not meant to be imported as a package. In very specific cases it's included inside `src/packagename`.\n",
+    "\n",
+    "* Every project contains a `README.txt` that at least explains what the project is.\n",
+    "\n",
+    "* `setup.py` can be an extremely simple file containing only this:\n",
+    "\n",
+    "```\n",
+    "from setuptools import setup\n",
+    "\n",
+    "setup()\n",
+    "```\n",
+    "\n",
+    "(This requires `setuptools > 30.3.0`, released 8 Dec 2016)\n",
+    "\n",
+    "* `setup.cfg` contains the metadata of the project. The absolutely required fields are `name`, `version`, and `packages`, therefore you will need something like this:\n",
+    "\n",
+    "```\n",
+    "[metadata]\n",
+    "name = my_package\n",
+    "version = 0.1.dev0\n",
+    "\n",
+    "# Magic! Don't touch below this line\n",
+    "[options]\n",
+    "package_dir=\n",
+    "    =src\n",
+    "packages=find:\n",
+    "\n",
+    "[options.packages.find]\n",
+    "where=src\n",
+    "```\n",
+    "\n",
+    "The `name` is what users will have to type after `pip install` and therefore can contain hyphens. **Do not confuse this** with what users have to type on `import` (see above).\n",
+    "\n",
+    "With this layout, **you can `pip install` your code** in your Python environment:\n",
+    "\n",
+    "```\n",
+    "$ pip install --editable .\n",
+    "$ python\n",
+    ">>> import packagename\n",
+    ">>>\n",
+    "```\n",
+    "\n",
+    "This is an alternative to modifying the `PYTHONPATH` environment variable (see first session)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Version numbers\n",
+    "\n",
+    "* Version numbers for Python packages are explained in [PEP 440](https://www.python.org/dev/peps/pep-0440/)\n",
+    "* For libraries, the most widely used convention is [semantic versioning](https://semver.org/): `X.Y.Z`\n",
+    "  - `Z` **must** be incremented if only backwards compatible bug fixes are introduced (a bug fix is defined as an internal change that fixes incorrect behavior)\n",
+    "  - `Y` **must** be incremented every time there is new, backwards-compatible functionality\n",
+    "  - `X` **must** be incremented every time there are backwards-incompatible changes\n",
+    "* Between releases, the version should have the `.dev0` suffix\n",
+    "* Recommendation: start with `0.1.dev0` (development version), then make a `0.1.0` release, then progress to `0.1.1` for quick fixes and `0.2.0` for new functionality, and when you want to make a promise of *relative* stability jump to `1.0.0`\n",
+    "* For applications, other conventions are more appropriate, like [calendar versioning](http://calver.org/): `[YY]YY.MM.??"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "\n",
+    "1. Create a directory called `nlp-utils`\n",
+    "2. `git init` inside it\n",
+    "3. Create a basic `src` layout in it, with `name = nlp-utils` and the source in `src/nlp_utils`\n",
+    "4. Create a `src/test_package/__init__.py` with a `print(\"Hello, world!\")`\n",
+    "5. Install it in editable mode using `pip` and test that `>>> import nlp_utils` prints `Hello, world!`\n",
+    "6. Include a `README.txt` and an appropriate `.gitignore` from http://gitignore.io/\n",
+    "7. Commit the changes\n",
+    "8. Create a new GitHub project and push the repository there\n",
+    "\n",
+    "🎉"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Testing\n",
+    "\n",
+    "Testing is **essential**. Many developers get along without testing their software, but as common wisdom says:\n",
+    "\n",
+    "<blockquote class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">If you use software that lacks automated tests, you are the tests.</p>&mdash; Jenny Bryan (@JennyBryan) <a href=\"https://twitter.com/JennyBryan/status/1043307291909316609?ref_src=twsrc%5Etfw\">September 22, 2018</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\" charset=\"utf-8\"></script>\n",
+    "\n",
+    "Computers excel at doing repetitive tasks: they basically never make mistakes (the mistake might be in what we told the computer to do). Humans, on the other hand, fail more often, especially under pressure, or on Friday afternoons and Monday mornings. Therefore, instead of letting the humans be the tests, we will use the computer to **frequently verify that our software works as specified**.\n",
+    "\n",
+    "### References\n",
+    "\n",
+    "* pytest documentation https://docs.pytest.org\n",
+    "\n",
+    "### Further reading\n",
+    "\n",
+    "* Extreme Programming https://www.wikiwand.com/en/Extreme_programming\n",
+    "* Obey the Testing Goat! http://www.obeythetestinggoat.com/pages/book.html#toc\n",
+    "* (Shameless self-plug) Testing and validation approaches for scientific software https://nbviewer.jupyter.org/format/slides/github/poliastro/oscw2018-talk/blob/master/Talk.ipynb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Test-Driven Development\n",
+    "\n",
+    "> Make it work. Make it right. Make it fast.\n",
+    "\n",
+    "Test-Driven Development shifts the focus of software development to writing tests. The \"practice of test-first development, planning and writing tests before each micro-increment\" is not new: it was in use at NASA in the early 1960s ([source](https://www.wikiwand.com/en/Extreme_programming)). In the 1990s, Extreme Programming took this concept to the extreme by the use of **small, automated** tests.\n",
+    "\n",
+    "The \"test-driven development mantra\" is <span style=\"color: red\">**Red**</span> - <span style=\"color: green\">**Green**</span> - **Refactor**:\n",
+    "\n",
+    "![The mantra](../img/red-green-refactor.png)\n",
+    "\n",
+    "1. Write a test. <span style=\"color: red\">**Watch it fail**</span>.\n",
+    "2. Write just enough code to <span style=\"color: green\">**pass the test**</span>.\n",
+    "3. Improve the code without breaking the test.\n",
+    "4. Repeat."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Testing in Python\n",
+    "\n",
+    "Summary: **use pytest**. Everybody does. It rocks.\n",
+    "\n",
+    "[pytest](https://docs.pytest.org/) is a testing framework for Python that makes writing tests extremely easy. It is much more powerful than the standard library equivalent, `unittest`. To use it, you need to install it first:\n",
+    "\n",
+    "```\n",
+    "$ pip install pytest\n",
+    "```\n",
+    "\n",
+    "The simplest test is **a function with an `assert`**. The `assert` statement just fails if the contents are not `True`, and else it does nothing. *It should only be used for testing*."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert True  # Does nothing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "AssertionError",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mAssertionError\u001b[0m                            Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-2-40f67ddecc26>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0;32mFalse\u001b[0m  \u001b[0;31m# Fails!\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+      "\u001b[0;31mAssertionError\u001b[0m: "
+     ]
+    }
+   ],
+   "source": [
+    "assert False  # Fails!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "AssertionError",
+     "evalue": "Math is wrong",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mAssertionError\u001b[0m                            Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-3-c2bde6a3219c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0;36m2\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m2\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"Math is wrong\"\u001b[0m  \u001b[0;31m# Fails with a message\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+      "\u001b[0;31mAssertionError\u001b[0m: Math is wrong"
+     ]
+    }
+   ],
+   "source": [
+    "assert 2 + 2 == 5, \"Math is wrong\"  # Fails with a message"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Example\n",
+    "\n",
+    "> Write a function that **tokenizes a sentence** (i.e. splits it into a list of words)\n",
+    "\n",
+    "First, we write a (failing) test:\n",
+    "\n",
+    "```python\n",
+    "# tests/test_tokenize.py\n",
+    "from nlp_utils import tokenize  # This will fail right away!\n",
+    "\n",
+    "def test_tokenize_returns_expected_list():\n",
+    "    sentence = \"This is a sentence\"\n",
+    "    expected_tokens = [\"This\", \"is\", \"a\", \"sentence\"]\n",
+    "\n",
+    "    tokens = tokenize(sentence)\n",
+    "\n",
+    "    assert tokens == expected_tokens\n",
+    "```\n",
+    "\n",
+    "and we run it from the command line:\n",
+    "\n",
+    "```\n",
+    "$ pytest\n",
+    "...\n",
+    "```\n",
+    "\n",
+    "Then we fix the test in the simplest way:\n",
+    "\n",
+    "```python\n",
+    "# src/nlp_utils/__init__.py\n",
+    "def tokenize(sentence):\n",
+    "    return sentence.split()\n",
+    "```\n",
+    "\n",
+    "And we watch it pass!\n",
+    "\n",
+    "```\n",
+    "$ pytest\n",
+    "...\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "\n",
+    "1. Add a new test that checks that `tokenize(sentence, lower=True)` returns a list of *lowercase* tokens.\n",
+    "2. Fix the test *in a way the first one doesn't break*.\n",
+    "3. *Extra*: Use `@pytest.mark.parametrize` to pass two different sentences to the new test https://docs.pytest.org/en/latest/example/parametrize.html"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}