01. Data standardization

Data standardization is required to get as much correct links as possible. If not linking to the Civil Registry only step 1 is required, but the other steps are still advised to facilitate entity linking to other datasets at some other point.

Standardization can be done in any programme (R/python). Experienced users can also do some of this in the metadata file using Jinja templating. See here for examples.

1. Date variables

Dates of vital events and registration (if applicable) need to be in YYYY-MM-DD format. Here's an example file in R and here's one in Python.

2. Surname prefixes

Prefixes ("van", "de", etc.) need to be separated from the surname. You can use this R or Python script to do so.

3. Name standardization

To reduce mismatching on name variants person names in the Civil Registry are standardized.

3.1 Character (combination) reduction

Change these characters in all person names:

'ch' to 'g'
'c' to 'k'
'z' to 's'
'ph' to 'f'
'ij' to 'y'

3.2 Diacritic removal

Replace diacritics in all person names with the most near normal character (‘ä’ becomes ‘a’ etc.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

01. Data standardization

1. Date variables

2. Surname prefixes

3. Name standardization

3.1 Character (combination) reduction

3.2 Diacritic removal

Uh oh!

Uh oh!

Clone this wiki locally