Skip to content

Commit 678f9f4

Browse files
committed
Revisions and translation edits
1 parent dadace7 commit 678f9f4

2 files changed

+75
-72
lines changed

Lesson 2-2 - Restructuring and validating data against a schema using Python and whyqd.ipynb

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2032,7 +2032,7 @@
20322032
"- __Data approximations__: Sometimes data are collected as ranges, e.g. `20-30`, or `<2`. This may for data anonymisation (we deliberately bucket people into ranges to ensure that\n",
20332033
" individuals can't be identified), or because our data are simply not sufficiently accurate (limits of the mechanism by which we recorded the data). This, also, is information. Forcing\n",
20342034
" these ranges to an absolute point may satisfy a data check, but at the risk of losing information on the limits of the data gathering process.\n",
2035-
"- __Date formats__: When is 10 April also 4 April? When you're trying to figure out whether you're working with American or global date formats, e.g. `10-4-1990` vs `4-10-1990`. Date ranges\n",
2035+
"- __Date formats__: When is 10 April also 4 October? When you're trying to figure out whether you're working with American or global date formats, e.g. `10-4-1990` vs `4-10-1990`. Date ranges\n",
20362036
" can also be a problem. Should you force `2007-2008` to be a specific year?\n",
20372037
"\n",
20382038
"This type of data validation is certainly critical _at the point of use_, but is it important _at the point of publication_? How far should you go in validating data for publication?\n",
@@ -2281,6 +2281,11 @@
22812281
"\n",
22822282
"However, the hard work if getting data into a place where a few simple programmatic fixes can manipulate our data into any format that works for the user.\n",
22832283
"\n",
2284+
"<div class=\"alert alert-block alert-warning\">\n",
2285+
" <p><b>Never trust source data:</b> any data that comes from outside your work environment cannot be trusted until proven otherwise. It does not matter if the publisher is <i>trustworthy</i> or claims their data <i>validates</i>. Until definitively proven to validate by your own systems you can't simply import it into your systems untested.</p>\n",
2286+
" <p>A publisher supports their user's workflow by ensuring data is machine-readable, well-structured, that all terms are clearly defined, and there is metadata for everything. After that, trust, but verify.</p>\n",
2287+
"</div>\n",
2288+
"\n",
22842289
"As an exercise, describe what decisions you should make to either fix or leave these data as is.\n",
22852290
"\n",
22862291
"### 2.3.2 Data publication and citation\n",
@@ -2387,7 +2392,7 @@
23872392
"name": "python",
23882393
"nbconvert_exporter": "python",
23892394
"pygments_lexer": "ipython3",
2390-
"version": "3.7.7"
2395+
"version": "3.6.8"
23912396
}
23922397
},
23932398
"nbformat": 4,

0 commit comments

Comments
 (0)