OpenRefine Workflow (Workshop 25.11.2022)

The following steps are the ones presented and worked with during the workshop on November 25th, 2022. This list is only indicative (part) of all necessary steps for uploading data to Wikidata via OpenRefine.

Part A

Import data in OpenRefine
Prepare data for reconciliation and uploading
- Splitting a column into several columns
- Trim leading whitespaces
Reconcile data

Part B

Create Wikidata schema in OpenRefine
Upload data

Part A

Import data in OpenRefine

The first step in our workflow is the import of the project file, with contains the data to be uploaded to Wikidata: datasets_notReconciled.tar.gz

After importing your data your screen should look like this (open eventually the image in a separate tab, so that it looks larger):

importData2

In order to have a better overview of your data, it is also suggested that you switch your view to seeing all datasets in a single screen. In order to do this, just click on "show 50 rows" on the top of your window:

As you examine now your data, you will notice that some columns include linked values (in blue), whereas other columns have only plain text. The linked values are actually reconciled values and they are the ones that have been matched to Wikidata items, a prerequisite for uploading our datasets to Wikidata.

A short hint on reconciliation: Reconciliation is the process of matching a dataset with that of an external source. We will work on it in one of the following steps.

Prepare data for reconciliation and uploading

Before we reconcile our data (= match them with Wikidata items) and upload them to Wikidata, we need first to clean or edit them. This includes - among other things - the editing of columns, as we will see in the following exercise.

Splitting a column into several columns

For this part we will work with the column Objekttypen. This column includes the types of objects that a CdV dataset may have: text, image, audio recording, video recording, software, or model. As this is a piece of information that we would like to upload to Wikidata, we will first need to reconcile it. However, some datasets (rows) include more than one object types, e.g. the dataset of the Herzog Anton Ulrich-Museum (row 11), which has objects of type Bild and Video. In order to reconcile both values, we will first need to get them in separate columns.

This is where the function of splitting a column into several columns comes in handy. You can find this operation at Edit column → Split into several columns.... under each column, in this case under Objekttypen:

In the opened-up dialog box you get the possibility to define the separator (in this case: ,) and to remove (or not) the original column after splitting. It is suggested that you uncheck the box at Remove this column, so that you can have a better overview of the splitting process:

Trim leading whitespaces

If you check the newly created column Objekttypen 2, you will notice that the filled cells (rows 11, 27, 37, etc.) have a whitespace before their values. In order to reconcile these values, though, you will need first to trim the whitespaces. For this we will need the operation Trim leading and trailing whitespace that is provided under the Edit cells menu:

After having executed this operation, there should be no whitespace before any values in this column.

Reconcile data

After having cleaned out data it is now time to match them to existing Wikidata entities before uploading them. This is what is known as reconciliation and in this workshop it will be applied to three columns: Objekttypen 1, Objekttypen 2 and Lizenz der Metadaten. The rest of the columns have been already reconciled (blue values) or they will be used as literals/strings or be ignored (plain text).

For each of the three columns we will have to select the Reconcile option:

Adding Wikidata reconciliation services

The first time you will be using Openrefine, it is very possible that you will have only the English Wikidata reconciliation service activated (by default). Although this could be enough for our purposes, we will also install the German Wikidata reconciliation service. The difference between them lies only in the language in which we receive the reconciliation results.

After having clicked on Start reconciling... we get the following dialog box:

In order to import the German Wikidata reconciliation service, we will have to click on Add standard service... and provide https://wikidata.reconci.link/de/api as the service's URL value. The service should be now added in our list of reconciliation services:

Now we are ready to proceed with the main reconciliation function, which is identical for all columns. As a rule of thumb it is suggested that you always select the option Reconcile against no particular type and uncheck the box at Auto-match candidates with high confidence in the reconciliation window. You can also ignore the content in the grey area, as it may be different, depending on the column you are reconciling:

By clicking on Start reconciling... you will trigger the reconciliation process.

Matching cell values to reconciliation results

After the Wikiata reconciliation has been completed, each cell should get a list of possible matches, as indicated by the following example:

According to the data model of the project, there are specific Wikidata items to match the cell values with. These items can be found in the mapping tables below and determine which value among the reconciliation results should be chosen. It must be noted that possibly a desired Wikidata item is not listed in the results list. If this is the case, we can select the Search for match option, in order to search for the desired Wikidata item ourselves. If Wikidata doesn't have a matching item at all (which is not the case in this project), one can create a new item by clicking on Create new item. If the item has been indeed listed in the results (e.g. Abbild in the example above), we can click on its double tick symbol, in order to match the item to this cell and all identical ones.

Mapping table for the `Objekttypen 1` and `Objekttypen 2` columns

Cell value	Wikidata item ID	Wikidata label [de]	Wikidata label [en]
Text	Q234460	Text	text
Bild	Q478798	Abbild	image
Ton	Q3302947	Tonaufnahme	audio recording
Video	Q34508	Videotechnik	video recording
Software	Q7397	Software	software
Modell	Q1979154	Modell	model

Mapping table for the `Lizenz der Metadaten` column

Cell value	Wikidata item ID	Wikidata label [de]	Wikidata label [en]
public domain	Q7257361	[no label]	Creative Commons Public Domain Mark
CC0 1.0	Q6938433	CC0	CC0
CC BY	Q6905323	Creative Commons Namensnennung	Creative Commons Attribution
CC BY 3.0	Q14947546	Creative Commons Namensnennung 3.0 Unported	Creative Commons Attribution 3.0 Unported
CC BY 3.0 DE	Q62619894	Creative-Commons-Lizenz „Namensnennung 3.0 Deutschland“	Creative Commons Attribution 3.0 Germany
CC BY 4.0	Q20007257	Creative Commons Namensnennung 4.0 International	Creative Commons Attribution 4.0 International
CC BY-SA	Q6905942	Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen	Creative Commons Attribution-ShareAlike
CC BY-SA 3.0	Q14946043	Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen 3.0 Generisch	Creative Commons Attribution-ShareAlike 3.0 Unported
CC BY-SA 3.0 DE	Q42716613	Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen 3.0 Deutschland	Creative Commons Attribution-ShareAlike 3.0 Germany
CC BY-SA 4.0	Q18199165	Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen 4.0 international	Creative Commons Attribution-ShareAlike 4.0 International

Part B

Create Wikidata schema in OpenRefine

After the reconciliation process has been completed, you should now see a green bar on top of all reconciled columns. If you haven't reached this stage yet, or you would like to keep working from this stage on, you can import the project at its current status here: datasets_Reconciled_noSchema.tar.gz

Note: In the following image, the columns we worked with (Objekttypen 1, Objekttypen 2 and Lizenz der Metadaten) have been reconciled against the English Wikidata reconciliation service, this is why you are seeing the results in English (rather than in German).

Our data is now ready to be uploaded, however we need first to create a schema for mapping our data to the Wikidata ontology/model (as already defined by the project team).

Create a new schema

In order to create the Wikidata schema, we will need first to select the option Edit Wikibase schema in the Extensions drop-down menu:

You should now be able see something similar to the image below. What you see is the columns names (the reconciled ones having a green bar underneath them):

We are now ready to start creating the schema, and for this we need first to select the option + add item.

Understanding the Wikidata schema architecture

After clicking on + add item, we see the main skeleton of the schema, which includes the three main components of a Wikidata item:

Wikidata instance (title)
Terms
Statements

These areas correspond to the main areas of a Wikidata item, as we can see in the example below:

Terms

By expanding the Terms section of the schema, we can see that we have the possibility to add values for Label, Description and Alias.

These terms together with lang correspond to the retrospective terms section of a Wikidata item:

Statements

The Statements section corresponds to the statements that constitute the core of a Wikidata item's description and follow the triple principle of Subject (Item) → Predicate (Property) → Object (Item/Value). Many such statements can be refined with qualifiers, that can add information and refine the stated concepts. The following example on the English writer Douglas Adams is indicative of this structure:

Fill the schema

The schema is used to describe how the data in each row is mapped to Wikidata property values.

Reconciled items

First add item in the schema and drag & drop the reconciled column Datenset (reconciled) there (since we are creating Wikidata instances/items for each dataset).

create-wikidata-schema-part-1

Terms

Now add terms to the schema to specify the Wikidata item label and description for different languages.

Schema terms	Language	Term values
Label	en	column: `Datenset [en]`
Label	de	column: `Datenset [de]`
Description	en	column: `Beschreibung [en]`
Description	de	column: `Beschreibung [de]`

create-wikidata-schema-part-2

Statements

Now add statements to the schema to specify which Wikidata properties or qualifiers the values in the columns should be assigned to for each row. You can also specify fix values for Wikidata properties and qualifiers that all items should have in common.

Property	Property values	Qualifier	Qualifier values
instance of (P31)	fix value: open data (Q309901)	-	-
~	fix value: data set (Q1172284)	-	-
creator (P170)	column: `Institution (reconciled)`	-	-
described at URL (P973)	column: `URL (CdV)`	language of work or name (P407)	German (Q188)

create-wikidata-schema-part-3

After the schema has been completed your screen should look like this (open eventually the image in a separate tab, so that it looks larger):

Note: The above schema is only part of the whole schema to be used when uploading the CdV datasets to Wikidata. If you want to work on the rest of the schema, you are advised to use the data model described in the WikiProject as your basis for creating statements and adding qualifiers, where necessary. You can also check with the final version of the project file(datasets_Reconciled_withSchema.tar.gz), to see how the end schema* should look like in OpenRefine.

*The workshop data has been actually already uploaded to Wikidata. In case you want to re-upload the data (see section below) with the complete schema, you are welcome to do so. However, you are kindly asked to remove the Titelbild(Filename) column from the project as well as the image (P18) statement from the schema because of a buggy operation when re-uploading images to Wikidata. Thank you!

Upload data

In order to upload edits to Wikidata, we will need first to select the option Upload edits to Wikibase... in the Extensions drop-down menu:

If you are not logged in to Wikidata yet, a login popup will open.

Enter your Wikidata account login data at Username and Password and then click Log in.

An info popup will be displayed, which you should skip for now, click Ok.

Before uploading edits to Wikidata, the number of edits and the warnings ignored so far are displayed as a precaution.

If you are really sure that the edits can be uploaded, include a meaningful description of the edit at summary, e.g. "Upload Coding da Vinci datasets", and then click Upload edits. That's all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenRefine Workflow (Workshop 25.11.2022)

Part A

Import data in OpenRefine

Prepare data for reconciliation and uploading

Splitting a column into several columns

Trim leading whitespaces

Reconcile data

Adding Wikidata reconciliation services

Matching cell values to reconciliation results

Mapping table for the `Objekttypen 1` and `Objekttypen 2` columns

Mapping table for the `Lizenz der Metadaten` column

Part B

Create Wikidata schema in OpenRefine

Create a new schema

Understanding the Wikidata schema architecture

Terms

Statements

Fill the schema

Reconciled items

Terms

Statements

Upload data

Uh oh!

Clone this wiki locally

OpenRefine Workflow (Workshop 25.11.2022)

Part A

Import data in OpenRefine

Prepare data for reconciliation and uploading

Splitting a column into several columns

Trim leading whitespaces

Reconcile data

Adding Wikidata reconciliation services

Matching cell values to reconciliation results

Mapping table for the Objekttypen 1 and Objekttypen 2 columns

Mapping table for the Lizenz der Metadaten column

Part B

Create Wikidata schema in OpenRefine

Create a new schema

Understanding the Wikidata schema architecture

Terms

Statements

Fill the schema

Reconciled items

Terms

Statements

Upload data

Uh oh!

Clone this wiki locally

Mapping table for the `Objekttypen 1` and `Objekttypen 2` columns

Mapping table for the `Lizenz der Metadaten` column