|
4 | 4 | "cell_type": "markdown",
|
5 | 5 | "metadata": {},
|
6 | 6 | "source": [
|
7 |
| - "# 1. Validating restructured data against a schema using a spreadsheet\n", |
| 7 | + "# _Module 1 lesson 2_: Validating restructured data against a schema using a spreadsheet\n", |
8 | 8 | "\n",
|
9 | 9 | "<div class=\"alert alert-block alert-warning\">\n",
|
10 | 10 | " <b>Learning outcomes:</b>\n",
|
|
23 | 23 | "cell_type": "markdown",
|
24 | 24 | "metadata": {},
|
25 | 25 | "source": [
|
26 |
| - "## 1.1 Creating a JSON schema\n", |
| 26 | + "## 2.1 Creating a JSON schema\n", |
27 | 27 | "\n",
|
28 | 28 | "When you produced your machine-readable file in Lesson 1, you came up with your own approach to how to structure the header row and the data. You named the columns yourself, and decided on how many, and what data should be in them. You did so by reviewing the source data.\n",
|
29 | 29 | "\n",
|
|
45 | 45 | "\n",
|
46 | 46 | "In simple terms, you need to specify the columns which an input CSV or Excel-file will be restructured into. The new columns are defined by the fields in your schema. These target fields are likely to be those in your database, or in your analytical software. Until your input data conform to this structure, your data will not validate.\n",
|
47 | 47 | "\n",
|
48 |
| - "### 1.1.1 Minimum valid requirements\n", |
| 48 | + "### 2.1.1 Minimum valid requirements\n", |
49 | 49 | "\n",
|
50 | 50 | "A minimum valid schema requires a `name` to identify the schema, and a single, minimally-valid `field` containing a `name` and `type`:\n",
|
51 | 51 | "\n",
|
|
68 | 68 | "\n",
|
69 | 69 | "The `fields` value is a list, or - in JSON terminology - an `array` of dictionary `objects`. Each field, unsurprisingly, has a `name`, `title` and `description`, of which only the `name` is required. \n",
|
70 | 70 | "\n",
|
71 |
| - "### 1.1.2 Types\n", |
| 71 | + "### 2.1.2 Types\n", |
72 | 72 | "\n",
|
73 | 73 | "Fields also have a `type`. This describes the data expected and limits the actions which can be performed during the wrangling process:\n",
|
74 | 74 | "\n",
|
|
84 | 84 | "\n",
|
85 | 85 | "There are more [types and formats](https://specs.frictionlessdata.io/table-schema/#types-and-formats) like `geojson`, `geopoints` and variations on dates.\n",
|
86 | 86 | "\n",
|
87 |
| - "### 1.1.3 Constraints\n", |
| 87 | + "### 2.1.3 Constraints\n", |
88 | 88 | "\n",
|
89 | 89 | "In addition, these data can be `constrained`:\n",
|
90 | 90 | "\n",
|
|
111 | 111 | "\n",
|
112 | 112 | "Again, there are other [constraints](https://specs.frictionlessdata.io/table-schema/#constraints), such as `pattern`, `maxLength`, `minLength` you can use as well.\n",
|
113 | 113 | "\n",
|
114 |
| - "### 1.1.4 Other properties\n", |
| 114 | + "### 2.1.4 Other properties\n", |
115 | 115 | "\n",
|
116 | 116 | "There are also special properties you can add to your schema that are not part of the `fields` definitions:\n",
|
117 | 117 | "\n",
|
118 | 118 | "* `missingValues`: defines which terms in your data should be treated as missing values, e.g. `-`, `NaN`, `..`, etc. This must be presented as a list, with terms defined as strings, e.g. `[\"NaN\", \"..\"]`\n",
|
119 | 119 | "\n",
|
120 |
| - "### 1.1.5 Example schema\n", |
| 120 | + "### 2.1.5 Example schema\n", |
121 | 121 | "\n",
|
122 | 122 | "As an example, let's imagine we want our destination data to conform to the following structure:\n",
|
123 | 123 | "\n",
|
124 |
| - " ========= ============ ============= ======== ================ ===================== ============= ========================\n", |
125 |
| - " la_code ba_ref occupant_name postcode occupation_state occupation_state_date prop_ba_rates occupation_state_reliefs\n", |
126 |
| - " ========= ============ ============= ======== ================ ===================== ============= ========================\n", |
127 |
| - " E06000044 177500080710 A company PO5 2SE True 2019-04-01 98530 [small_business, retail]\n", |
128 |
| - " ========= ============ ============= ======== ================ ===================== ============= ========================\n", |
| 124 | + "| la_code | ba_ref | occupant_name | postcode | occupation_state | occupation_state_date | prop_ba_rates | occupation_state_reliefs |\n", |
| 125 | + "|---------|--------|---------------|----------|------------------|-----------------------|---------------|-------------------------|\n", |
| 126 | + "| E06000044 | 177500080710 | A company | PO5 2SE | True | 2019-04-01 | 98530 | [small_business, retail] |\n", |
129 | 127 | "\n",
|
130 | 128 | "The complete schema for this example is then:"
|
131 | 129 | ]
|
|
243 | 241 | "cell_type": "markdown",
|
244 | 242 | "metadata": {},
|
245 | 243 | "source": [
|
246 |
| - "## 1.2 Apply data validation to cells in a spreadsheet\n", |
| 244 | + "## 2.2 Apply data validation to cells in a spreadsheet\n", |
247 | 245 | "\n",
|
248 | 246 | "Your `types` - at this stage - are only a guide. You will have no feedback, or error messages like you get when running Python code, if any of the data types in your field columns are wrong. There are a few ways to get that feedback so you can correct things, but we'll start with data validation in spreadsheet cells.\n",
|
249 | 247 | "\n",
|
250 | 248 | "The following is adapted from a [Microsoft Office tutorial](https://support.office.com/en-gb/article/apply-data-validation-to-cells-29fecbcc-d1b9-42c1-9d76-eff3ce5f7249). This approach will work in OpenOffice as well as Google Sheets, although the specific steps are different.\n",
|
251 | 249 | "\n",
|
252 | 250 | "Microsoft has an example file you can [download](http://download.microsoft.com/download/9/6/8/968A9140-2E13-4FDC-B62C-C1D98D2B0FE6/Data%20Validation%20Examples.xlsx).\n",
|
253 | 251 | "\n",
|
254 |
| - "### 1.2.1 Specify validation for data types\n", |
| 252 | + "### 2.2.1 Specify validation for data types\n", |
255 | 253 | "\n",
|
256 | 254 | "The process is straightforward:\n",
|
257 | 255 | "\n",
|
|
289 | 287 | "\n",
|
290 | 288 | "Now - only for new data - if a user tries to enter a value that is not valid, a pop-up appears with the message, \"This value doesn’t match the data validation restrictions for this cell.\" We'll run validation on your existing data shortly, but first a detour into `lists`.\n",
|
291 | 289 | "\n",
|
292 |
| - "### 1.2.2 Lists are a special type\n", |
| 290 | + "### 2.2.2 Lists are a special type\n", |
293 | 291 | "\n",
|
294 | 292 | "Before you can validate a `list` type, you need to specify valid terms. In Excel, this requires an [extra set of steps](https://support.office.com/en-us/article/create-a-drop-down-list-7693307a-59ef-400a-b769-c5402dce407b).\n",
|
295 | 293 | "\n",
|
|
308 | 306 | " - Convert your list to a table with __Ctrl+T__, then from the __Table Design__ tab give your table a name, permitting you to reference the table name and column (e.g. `=CityTable[City]`)\n",
|
309 | 307 | " - From the __Formulas__ tab select __Name Manager__, create a __New__ item with an appropriate name (e.g. `CityList`), and reference the cells (e.g. `=Sheet1!A4:A10`), which then lets you reference your list anywhere (e.g. `=CityList`)\n",
|
310 | 308 | "\n",
|
311 |
| - "### 1.2.3 Validate and get error messages for your existing data\n", |
| 309 | + "### 2.2.3 Validate and get error messages for your existing data\n", |
312 | 310 | "\n",
|
313 | 311 | "After you've specified validation rules on your existing data you might be disappoined. Excel does not automatically notify you whether these cells contain invalid data. Here's a quick way to [highlight existing invalid cells](https://support.office.com/en-us/article/more-on-data-validation-f38dee73-9900-4ca6-9301-8a5f6e1f0c4c) by circling the values:\n",
|
314 | 312 | "\n",
|
|
350 | 348 | "cell_type": "markdown",
|
351 | 349 | "metadata": {},
|
352 | 350 | "source": [
|
353 |
| - "## 1.3 Saving your validated file as a comma-separated-value\n", |
| 351 | + "## 2.3 Saving your validated file as a comma-separated-value\n", |
354 | 352 | "\n",
|
355 | 353 | "Comma separated value files (`.csv`) are text files in which the comma character `,` separates each field of text. Where a comma appears in the value - whether a `string` or `number` - the value is then surrounded by quotation marks, e.g. `100, 200, \"20,000\"` indicates three values in three separate fields.\n",
|
356 | 354 | "\n",
|
|
374 | 372 | "cell_type": "markdown",
|
375 | 373 | "metadata": {},
|
376 | 374 | "source": [
|
377 |
| - "## 1.4 Validating your data and JSON schema using CSVLint\n", |
| 375 | + "## 2.4 Validating your data and JSON schema using CSVLint\n", |
378 | 376 | "\n",
|
379 | 377 | "In the next lesson, we'll learn how to validate your data using Python directly in a Jupyter Notebook, for now we'll use an online resource provided by the Open Data Institute called [CSVLint](https://csvlint.io/).\n",
|
380 | 378 | "\n",
|
|
418 | 416 | "cell_type": "markdown",
|
419 | 417 | "metadata": {},
|
420 | 418 | "source": [
|
421 |
| - "## 1.5 Lesson tutorial\n", |
| 419 | + "## 2.5 Lesson tutorial\n", |
422 | 420 | "\n",
|
423 | 421 | "<div class=\"alert alert-block alert-success\">\n",
|
424 | 422 | " <p><b>Tutorial:</b></p>\n",
|
|
451 | 449 | "name": "python",
|
452 | 450 | "nbconvert_exporter": "python",
|
453 | 451 | "pygments_lexer": "ipython3",
|
454 |
| - "version": "3.7.7" |
| 452 | + "version": "3.8.5" |
| 453 | + }, |
| 454 | + "latex_envs": { |
| 455 | + "LaTeX_envs_menu_present": true, |
| 456 | + "autoclose": false, |
| 457 | + "autocomplete": true, |
| 458 | + "bibliofile": "biblio.bib", |
| 459 | + "cite_by": "apalike", |
| 460 | + "current_citInitial": 1, |
| 461 | + "eqLabelWithNumbers": true, |
| 462 | + "eqNumInitial": 1, |
| 463 | + "hotkeys": { |
| 464 | + "equation": "Ctrl-E", |
| 465 | + "itemize": "Ctrl-I" |
| 466 | + }, |
| 467 | + "labels_anchors": false, |
| 468 | + "latex_user_defs": false, |
| 469 | + "report_style_numbering": false, |
| 470 | + "user_envs_cfg": false |
455 | 471 | }
|
456 | 472 | },
|
457 | 473 | "nbformat": 4,
|
|
0 commit comments