|
2032 | 2032 | "- __Data approximations__: Sometimes data are collected as ranges, e.g. `20-30`, or `<2`. This may for data anonymisation (we deliberately bucket people into ranges to ensure that\n",
|
2033 | 2033 | " individuals can't be identified), or because our data are simply not sufficiently accurate (limits of the mechanism by which we recorded the data). This, also, is information. Forcing\n",
|
2034 | 2034 | " these ranges to an absolute point may satisfy a data check, but at the risk of losing information on the limits of the data gathering process.\n",
|
2035 |
| - "- __Date formats__: When is 10 April also 4 April? When you're trying to figure out whether you're working with American or global date formats, e.g. `10-4-1990` vs `4-10-1990`. Date ranges\n", |
| 2035 | + "- __Date formats__: When is 10 April also 4 October? When you're trying to figure out whether you're working with American or global date formats, e.g. `10-4-1990` vs `4-10-1990`. Date ranges\n", |
2036 | 2036 | " can also be a problem. Should you force `2007-2008` to be a specific year?\n",
|
2037 | 2037 | "\n",
|
2038 | 2038 | "This type of data validation is certainly critical _at the point of use_, but is it important _at the point of publication_? How far should you go in validating data for publication?\n",
|
|
2281 | 2281 | "\n",
|
2282 | 2282 | "However, the hard work if getting data into a place where a few simple programmatic fixes can manipulate our data into any format that works for the user.\n",
|
2283 | 2283 | "\n",
|
| 2284 | + "<div class=\"alert alert-block alert-warning\">\n", |
| 2285 | + " <p><b>Never trust source data:</b> any data that comes from outside your work environment cannot be trusted until proven otherwise. It does not matter if the publisher is <i>trustworthy</i> or claims their data <i>validates</i>. Until definitively proven to validate by your own systems you can't simply import it into your systems untested.</p>\n", |
| 2286 | + " <p>A publisher supports their user's workflow by ensuring data is machine-readable, well-structured, that all terms are clearly defined, and there is metadata for everything. After that, trust, but verify.</p>\n", |
| 2287 | + "</div>\n", |
| 2288 | + "\n", |
2284 | 2289 | "As an exercise, describe what decisions you should make to either fix or leave these data as is.\n",
|
2285 | 2290 | "\n",
|
2286 | 2291 | "### 2.3.2 Data publication and citation\n",
|
|
2387 | 2392 | "name": "python",
|
2388 | 2393 | "nbconvert_exporter": "python",
|
2389 | 2394 | "pygments_lexer": "ipython3",
|
2390 |
| - "version": "3.7.7" |
| 2395 | + "version": "3.6.8" |
2391 | 2396 | }
|
2392 | 2397 | },
|
2393 | 2398 | "nbformat": 4,
|
|
0 commit comments