-
Notifications
You must be signed in to change notification settings - Fork 1
01. Data standardization
Data standardization is required to get as much correct links as possible. If not linking to the Civil Registry only step 1 is required, but the other steps are still advised to facilitate entity linking to other datasets at some other point.
Standardization can be done in any programme (R/python). Experienced users can also do some of this in the metadata file using Jinja templating. See here for examples.
Dates of vital events and registration (if applicable) need to be in YYYY-MM-DD
format.
Prefixes ("van", "de", etc.) need to be separated from the surname. You can use this R script for this.
To reduce mismatching on name variants, several characters of person names in the Civil Registry are standardized.
Change these characters in all person names:
'ch' to 'g'
'c' to 'k'
'z' to 's'
'ph' to 'f'
'ij' to 'y'
Replace diacritics in all person names with the most near normal character (‘ä’ becomes ‘a’ etc.)