Skip to content

01. Data standardization

Richard Zijdeman edited this page Jun 13, 2022 · 10 revisions

Data standardization is required to get as much correct links as possible. If not linking to the Civil Registry only step 1 is required, but the other steps are still advised to facilitate entity linking to other datasets at some other point.

Standardization can be done in any programme (R/python). Experienced users can also do some of this in the metadata file using Jinja templating. See here for examples.

1. Date variables

Dates of vital events and registration (if applicable) need to be in YYYY-MM-DD format. Here's an example file in R and here's one in Python.

2. Surname prefixes

Prefixes ("van", "de", etc.) need to be separated from the surname. You can use this R or Python script to do so.

3. Name standardization

To reduce mismatching on name variants person names in the Civil Registry are standardized.

3.1 Character (combination) reduction

Change these characters in all person names:

'ch' to 'g'
'c' to 'k'
'z' to 's'
'ph' to 'f'
'ij' to 'y'

3.2 Diacritic removal

Replace diacritics in all person names with the most near normal character (‘ä’ becomes ‘a’ etc.)

Clone this wiki locally