-
Notifications
You must be signed in to change notification settings - Fork 1
OpenRefine Workflow (Workshop 25.11.2022)
The following steps are the ones presented and worked with during the workshop on November 25th, 2022. This list is only indicative (part) of all necessary steps for uploading data to Wikidata via OpenRefine.
Part A
Part B
The first step in our workflow is the import of the project file, with contains the data to be uploaded to Wikidata: datasets_notReconciled.tar.gz

After importing your data your screen should look like this (open eventually the image in a separate tab, so that it looks larger):
In order to have a better overview of your data, it is also suggested that you switch your view to seeing all datasets in a single screen. In order to do this, just click on "show 50
rows" on the top of your window:

As you examine now your data, you will notice that some columns include linked values (in blue), whereas other columns have only plain text. The linked values are actually reconciled values and they are the ones that have been matched to Wikidata items, a prerequisite for uploading our datasets to Wikidata.
A short hint on reconciliation: Reconciliation is the process of matching a dataset with that of an external source. We will work on it in one of the following steps.
Before we reconcile our data (= match them with Wikidata items) and upload them to Wikidata, we need first to clean or edit them. This includes - among other things - the editing of columns, as we will see in the following exercise.
For this part we will work with the column Objekttypen
. This column includes the types of objects that a CdV dataset may have: text, image, audio recording, video recording, software, or model. As this is a piece of information that we would like to upload to Wikidata, we will first need to reconcile it. However, some datasets (rows) include more than one object types, e.g. the dataset of the Herzog Anton Ulrich-Museum (row 11), which has objects of type Bild
and Video
. In order to reconcile both values, we will first need to get them in separate columns.
This is where the function of splitting a column into several columns comes in handy. You can find this operation at Edit column
→ Split into several columns....
under each column, in this case under Objekttypen
:

In the opened-up dialog box you get the possibility to define the separator (in this case: ,
) and to remove (or not) the original column after splitting. It is suggested that you uncheck the box at Remove this column
, so that you can have a better overview of the splitting process:

If you check the newly created column Objekttypen 2
, you will notice that the filled cells (rows 11, 27, 37, etc.) have a whitespace before their values. In order to reconcile these values, though, you will need first to trim the whitespaces. For this we will need the operation Trim leading and trailing whitespace
that is provided under the Edit cells
menu:

After having executed this operation, there should be no whitespace before any values in this column.
After having cleaned out data it is now time to match them to existing Wikidata entities before uploading them. This is what is known as reconciliation and in this workshop it will be applied to three columns: Objekttypen 1
, Objekttypen 2
and Lizenz der Metadaten
. The rest of the columns have been already reconciled (blue values) or they will be used as literals/strings or be ignored (plain text).
For each of the three columns we will have to select the Reconcile
option:

The first time you will be using Openrefine, it is very possible that you will have only the English Wikidata reconciliation service activated (by default). Although this could be enough for our purposes, we will also install the German Wikidata reconciliation service. The difference between them lies only in the language in which we receive the reconciliation results.
After having clicked on Start reconciling...
we get the following dialog box:

In order to import the German Wikidata reconciliation service, we will have to click on Add standard service...
and provide https://wikidata.reconci.link/de/api
as the service's URL value. The service should be now added in our list of reconciliation services:

Now we are ready to proceed with the main reconciliation function, which is identical for all columns.
As a rule of thumb it is suggested that you always select the option Reconcile against no particular type
and uncheck the box at Auto-match candidates with high confidence
in the reconciliation window. You can also ignore the content in the grey area, as it may be different, depending on the column you are reconciling:

By clicking on Start reconciling...
you will trigger the reconciliation process.
After the Wikiata reconciliation has been completed, each cell should get a list of possible matches, as indicated by the following example:

According to the data model of the project, there are specific Wikidata items to match the cell values with. These items can be found in the mapping tables below and determine which value among the reconciliation results should be chosen. It must be noted that possibly a desired Wikidata item is not listed in the results list. If this is the case, we can select the Search for match
option, in order to search for the desired Wikidata item ourselves. If Wikidata doesn't have a matching item at all (which is not the case in this project), one can create a new item by clicking on Create new item
. If the item has been indeed listed in the results (e.g. Abbild
in the example above), we can click on its double tick symbol
, in order to match the item to this cell and all identical ones.
Cell value | Wikidata item ID | Wikidata label [de] | Wikidata label [en] |
---|---|---|---|
Text | Q234460 | Text | text |
Bild | Q478798 | Abbild | image |
Ton | Q3302947 | Tonaufnahme | audio recording |
Video | Q34508 | Videotechnik | video recording |
Software | Q7397 | Software | software |
Modell | Q1979154 | Modell | model |
Cell value | Wikidata item ID | Wikidata label [de] | Wikidata label [en] |
---|---|---|---|
public domain | Q7257361 | [no label] | Creative Commons Public Domain Mark |
CC0 1.0 | Q6938433 | CC0 | CC0 |
CC BY | Q6905323 | Creative Commons Namensnennung | Creative Commons Attribution |
CC BY 3.0 | Q14947546 | Creative Commons Namensnennung 3.0 Unported | Creative Commons Attribution 3.0 Unported |
CC BY 3.0 DE | Q62619894 | Creative-Commons-Lizenz „Namensnennung 3.0 Deutschland“ | Creative Commons Attribution 3.0 Germany |
CC BY 4.0 | Q20007257 | Creative Commons Namensnennung 4.0 International | Creative Commons Attribution 4.0 International |
CC BY-SA | Q6905942 | Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen | Creative Commons Attribution-ShareAlike |
CC BY-SA 3.0 | Q14946043 | Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen 3.0 Generisch | Creative Commons Attribution-ShareAlike 3.0 Unported |
CC BY-SA 3.0 DE | Q42716613 | Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen 3.0 Deutschland | Creative Commons Attribution-ShareAlike 3.0 Germany |
CC BY-SA 4.0 | Q18199165 | Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen 4.0 international | Creative Commons Attribution-ShareAlike 4.0 International |
After the reconciliation process has been completed, you should now see a green bar on top of all reconciled columns. If you haven't reached this stage yet, or you would like to keep working from this stage on, you can import the project at its current status here: datasets_Reconciled_noSchema.tar.gz
Note: In the following image, the columns we worked with (Objekttypen 1
, Objekttypen 2
and Lizenz der Metadaten
) have been reconciled against the English Wikidata reconciliation service, this is why you are seeing the results in English (rather than in German).

Our data is now ready to be uploaded, however we need first to create a schema for mapping our data to the Wikidata ontology/model (as already defined by the project team).
In order to create the Wikidata schema, we will need first to select the option Edit Wikibase schema
in the Extensions drop-down menu:

You should now be able see something similar to the image below. What you see is the columns names (the reconciled ones having a green bar underneath them):

We are now ready to start creating the schema, and for this we need first to select the option + add item
.
After clicking on + add item
, we see the main skeleton of the schema, which includes the three main components of a Wikidata item:
- Wikidata instance (title)
- Terms
- Statements

These areas correspond to the main areas of a Wikidata item, as we can see in the example below:

By expanding the Terms
section of the schema, we can see that we have the possibility to add values for Label
, Description
and Alias
.

These terms together with lang
correspond to the retrospective terms section of a Wikidata item:

The Statements
section corresponds to the statements that constitute the core of a Wikidata item's description and follow the triple principle of Subject (Item)
→ Predicate (Property)
→ Object (Item/Value)
. Many such statements can be refined with qualifiers
, that can add information and refine the stated concepts. The following example on the English writer Douglas Adams is indicative of this structure:

The schema is used to describe how the data in each row is mapped to Wikidata property values.
First add item
in the schema and drag & drop the reconciled column Datenset (reconciled)
there (since we are creating Wikidata instances/items for each dataset).
Now add terms
to the schema to specify the Wikidata item label and description for different languages.
Schema terms | Language | Term values |
---|---|---|
Label | en | column: Datenset [en]
|
Label | de | column: Datenset [de]
|
Description | en | column: Beschreibung [en]
|
Description | de | column: Beschreibung [de]
|
Now add statements
to the schema to specify which Wikidata properties or qualifiers the values in the columns should be assigned to for each row.
You can also specify fix values for Wikidata properties and qualifiers that all items should have in common.
Property | Property values | Qualifier | Qualifier values |
---|---|---|---|
instance of (P31) | fix value: open data (Q309901) | - | - |
~ | fix value: data set (Q1172284) | - | - |
creator (P170) | column: Institution (reconciled)
|
- | - |
described at URL (P973) | column: URL (CdV)
|
language of work or name (P407) | German (Q188) |
After the schema has been completed your screen should look like this (open eventually the image in a separate tab, so that it looks larger):
Note: The above schema is only part of the whole schema to be used when uploading the CdV datasets to Wikidata. If you want to work on the rest of the schema, you are advised to use the data model described in the WikiProject as your basis for creating statements
and adding qualifiers
, where necessary. You can also check with the final version of the project file(datasets_Reconciled_withSchema.tar.gz), to see how the end schema* should look like in OpenRefine.
*The workshop data has been actually already uploaded to Wikidata. In case you want to re-upload the data (see section below) with the complete schema, you are welcome to do so. However, you are kindly asked to remove the Titelbild(Filename)
column from the project as well as the image (P18)
statement from the schema because of a buggy operation when re-uploading images to Wikidata. Thank you!
In order to upload edits to Wikidata, we will need first to select the option Upload edits to Wikibase...
in the Extensions drop-down menu:
If you are not logged in to Wikidata yet, a login popup will open.
Enter your Wikidata account login data at Username
and Password
and then click Log in
.
An info popup will be displayed, which you should skip for now, click Ok
.
Before uploading edits to Wikidata, the number of edits and the warnings ignored so far are displayed as a precaution.
If you are really sure that the edits can be uploaded, include a meaningful description of the edit at summary
, e.g. "Upload Coding da Vinci datasets", and then click Upload edits
. That's all.