-
Notifications
You must be signed in to change notification settings - Fork 1
01. Data standardization
Data standardization is required to get as much correct links as possible. If not linking to the Civil Registry only step 1 is required, but the other steps are still advised to facilitate entity linking to other datasets at some other point.
Standardization can be done in any programme (R/python). Experienced users can also do some of this in the metadata file using Jinja templating. See here for examples.
Dates of vital events and registration (if applicable) need to be in YYYY-MM-DD
format. Here's an example file in R and here's one in Python.
Prefixes ("van", "de", etc.) need to be separated from the surname. You can use this R or Python script to do so.
To reduce mismatching on name variants person names in the Civil Registry are standardized.
Change these characters in all person names:
'ch' to 'g'
'c' to 'k'
'z' to 's'
'ph' to 'f'
'ij' to 'y'
Replace diacritics in all person names with the most near normal character (‘ä’ becomes ‘a’ etc.)