Skip to content

Commit 4cb5c72

Browse files
committed
Lesson outcomes
1 parent 285e2f6 commit 4cb5c72

File tree

1 file changed

+72
-2
lines changed

1 file changed

+72
-2
lines changed

README.md

Lines changed: 72 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,72 @@
1-
# data-wrangling-and-validation
2-
Syllabus for Data Wrangling and Validation for Open Data
1+
# Data Wrangling and Validation for Open Data
2+
3+
The term _Open Data_ is generally understood to be data that are made available to the public free of charge, without registration or restrictive licenses, for any purpose whatsoever (including commercial purposes), in electronic, machine-readable formats that ensure data are easy to find, download and use.
4+
5+
Open data initiatives by public institutions, such as governments and intergovernmental organisations, recognise that such data is produced with public funds and so, with few exceptions, should be treated as public goods.
6+
7+
Data reuse, both by data experts and the public at large, is key to creating new opportunities and benefits from government data. Open data reuse requires two basic criteria:
8+
9+
1. Data must be legally open, meaning that it is placed in the public domain or under liberal terms of use with minimal restrictions. This ensures that government policies do not create barriers or ambiguities concerning how the data may be used.
10+
2. Data must be technically open, meaning that it is published in electronic formats that are machine-readable and non-proprietary. This ensures that ordinary citizens can access and use the data with little or no cost using common software tools.
11+
12+
The purpose of this syllabus in _Data Wrangling and Validation for Open Data_ is to guide learners to confidence in delivering technically open data: well-structured, machine-readable data, validated to a defined and standard metadata schema.
13+
14+
## Lesson 1: Data wrangling messy data
15+
16+
_Learning outcomes_:
17+
18+
- Understand, and have practical experience with, the structure and design of machine-readable data files.
19+
- Use Excel to investigate and manipulate source data to learn its metadata, shape and robustness, and employ these methods to develop a structural metadata schema.
20+
- Learn and apply a basic set of methods to restructure messy source data into machine-readable CSV files using Microsoft Excel.
21+
- Learn, and have practical experience with writing, the basic syntax and approach to coding in Python.
22+
- Integrate and apply methods from the core data analysis libraries of Numpy, Pandas and Matplotlib to investigate and manipulate source data.
23+
- Perform techniques in analysis and coding, using the Whyqd data wrangling package, to create a structured, JSON-formatted method, for restructuring data into a standard schema.
24+
25+
_Project_:
26+
27+
Each participant will be assigned a spreadsheet from [training data](https://drive.google.com/open?id=0B8eZRkdFGaEHfnlwU25vdVRUOFNOdnNfWnMwb3IwYXJ3QU9BeTU0ZmlTNlpaRmZFZE5iM28) and expected to restructure it both in Excel and using Python/Whyqd.
28+
29+
## Lesson 2: Validating restructured data against a schema
30+
31+
_Learning outcomes_:
32+
33+
- Understand, and employ, standard definitions to write a JSON schema for data validation.
34+
- Perform data validation using Microsoft Excel.
35+
- Learn how to validate machine-readable data in online applications against a defined schema.
36+
- Write and use modular functions to automatically validate data files using Python.
37+
- Apply techniques in open data publication to prepare data, schemas and validation outputs for publication.
38+
39+
_Project_:
40+
41+
Using the machine-readable spreadsheet created in Lesson 1, develop a JSON schema, and validate the data using this schema on [CSV Lint](https://csvlint.io/). Then import [Frictionless Data](https://github.com/frictionlessdata/tableschema-py) and perform the same task in Python.
42+
43+
## Lesson 3: Anonymising personal data prior to publication
44+
45+
_Learning outcomes_:
46+
47+
- Acknowledge the privacy and confidentiality issues in data storage and security of personal data.
48+
- Recognise responsibilities and mechanisms for securing data-at-rest and data-in-motion.
49+
- Employ methods for anonymising data, including geospatial jitter, address and name redaction, and field obfuscation.
50+
- Investigate and apply appropriate aggregation techniques to anonymise personal data that is otherwise immune to field redaction approaches.
51+
52+
_Project_:
53+
54+
Use a manufactured sample data file containing personal information and redact these data to prevent de-anonymisation.
55+
56+
## Project: COVID-19 data preparation for release
57+
58+
As a topical subject, a general discussion of likely data sources for release, and wrangling. We may need manufactured patient records for aggregation, wrangling and release.
59+
60+
---
61+
62+
## Whois
63+
64+
My name is [Gavin Chait](https://gavinchait.com), and I am an independent data scientist specialising in economic development and data curation. I spent more than a decade in economic and development initiatives in South Africa. I was the commercial lead of open data projects at the Open Knowledge Foundation, leading the open source CKAN development team, and led the implementation of numerous open data technical and research projects around the world. Recently, I have developed [Sqwyre.com](https://sqwyre.com), an initiative to develop a comprehensive business intelligence search engine for entrepreneurs. Data are based on open data and Freedom of Information requests.
65+
66+
I have extensive experience in leading research projects, implementing open source software initiatives, and developing and leading seminars and workshops. I have taught for 25 years, including for undergraduates, adult education, and technical and analytical teaching at all levels.
67+
68+
## Licensing and release
69+
70+
Course content, materials and approach are copyright Gavin Chait, and released under both the [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/) and the [MIT](https://opensource.org/licenses/MIT) licences.
71+
72+
The objective is to ensure reuse, and I recommend - but do not require - that any modifications or adaptations of the source material should be released under an equivalent licence.

0 commit comments

Comments
 (0)