Skip to content

OpenRefine Workflow (Workshop 25.11.2022)

Ilias Kyriazis edited this page Nov 25, 2022 · 109 revisions

The following steps are the ones presented and worked with during the workshop on November 25th, 2022. This list is only indicative (part) of all necessary steps for uploading data to Wikidata via OpenRefine.

Part A

Part B


Part A

Import data in OpenRefine

The first step in our workflow is the import of the project file, with contains the data to be uploaded to Wikidata: datasets_notReconciled.tar.gz

importData

After importing your data your screen should look like this (open eventually the image in a separate tab, so that it looks larger):

importData2

In order to have a better overview of your data, it is also suggested that you switch your view to seeing all datasets in a single screen. In order to do this, just click on "show 50 rows" on the top of your window:

rows

As you examine now your data, you will notice that some columns include linked values (in blue), whereas other columns have only plain text. The linked values are actually reconciled values and they are the ones that have been matched to Wikidata items, a prerequisite for uploading our datasets to Wikidata.

A short hint on reconciliation: Reconciliation is the process of matching a dataset with that of an external source. We will work on it in one of the following steps.


Prepare data for reconciliation and uploading

Before we reconcile our data (= match them with Wikidata items) and upload them to Wikidata, we need first to clean or edit them. This includes - among other things - the editing of columns, as we will see in the following exercise.

Splitting a column into several columns

For this part we will work with the column Objekttypen. This column includes the types of objects that a CdV dataset may have: text, image, audio recording, video recording, software, or model. As this is a piece of information that we would like to upload to Wikidata, we will first need to reconcile it. However, some datasets (rows) include more than one object types, e.g. the dataset of the Herzog Anton Ulrich-Museum (row 11), which has objects of type Bild and Video. In order to reconcile both values, we will first need to get them in separate columns.

This is where the function of splitting a column into several columns comes in handy. You can find this operation at Edit columnSplit into several columns.... under each column, in this case under Objekttypen:

splittingColumns

In the opened-up dialog box you get the possibility to define the separator (in this case: ,) and to remove (or not) the original column after splitting. It is suggested that you uncheck the box at Remove this column, so that you can have a better overview of the splitting process:

splitColumns_dialogBox

Trim leading whitespaces

If you check the newly created column Objekttypen 2, you will notice that the filled cells (rows 11, 27, 37, etc.) have a whitespace before their values. In order to reconcile these values, though, you will need first to trim the whitespaces. For this we will need the operation Trim leading and trailing whitespace that is provided under the Edit cells menu:

trimWhitespace

After having executed this operation, there should be no whitespace before any values in this column.


Reconcile data

After having cleaned out data it is now time to match them to existing Wikidata entities before uploading them. This is what is known as reconciliation and in this workshop it will be applied to three columns: Objekttypen 1, Objekttypen 2 and Lizenz der Metadaten. The rest of the columns have been already reconciled (blue values) or they will be used as literals/strings or be ignored (plain text).

For each of the three columns we will have to select the Reconcile option:

startReconciling

Adding Wikidata reconciliation services

The first time you will be using Openrefine, it is very possible that you will have only the English Wikidata reconciliation service activated (by default). Although this could be enough for our purposes, we will also install the German Wikidata reconciliation service. The difference between them lies only in the language in which we receive the reconciliation results.

After having clicked on Start reconciling... we get the following dialog box:

reconcileDialogBox

In order to import the German Wikidata reconciliation service, we will have to click on Add standard service... and provide https://wikidata.reconci.link/de/api as the service's URL value. The service should be now added in our list of reconciliation services:

reconciliationServices

Now we are ready to proceed with the main reconciliation function, which is identical for all columns. As a rule of thumb it is suggested that you always select the option Reconcile against no particular type and uncheck the box at Auto-match candidates with high confidence in the reconciliation window. You can also ignore the content in the grey area, as it may be different, depending on the column you are reconciling:

reconciliationWindow

By clicking on Start reconciling... you will trigger the reconciliation process.

Matching cell values to reconciliation results

After the Wikiata reconciliation has been completed, each cell should get a list of possible matches, as indicated by the following example:

reconciliation example

According to the data model of the project, there are specific Wikidata items to match the cell values with. These items can be found in the mapping tables below and determine which value among the reconciliation results should be chosen. It must be noted that possibly a desired Wikidata item is not listed in the results list. If this is the case, we can select the Search for match option, in order to search for the desired Wikidata item ourselves. If Wikidata doesn't have a matching item at all (which is not the case in this project), one can create a new item by clicking on Create new item. If the item has been indeed listed in the results (e.g. Abbild in the example above), we can click on its double tick symbol, in order to match the item to this cell and all identical ones.

Mapping table for the Objekttypen 1 and Objekttypen 2 columns

Cell value Wikidata item ID Wikidata label [de] Wikidata label [en]
Text Q234460 Text text
Bild Q478798 Abbild image
Ton Q3302947 Tonaufnahme audio recording
Video Q34508 Videotechnik video recording
Software Q7397 Software software
Modell Q1979154 Modell model

Mapping table for the Lizenz der Metadaten column

Cell value Wikidata item ID Wikidata label [de] Wikidata label [en]
public domain Q7257361 [no label] Creative Commons Public Domain Mark
CC0 1.0 Q6938433 CC0 CC0
CC BY Q6905323 Creative Commons Namensnennung Creative Commons Attribution
CC BY 3.0 Q14947546 Creative Commons Namensnennung 3.0 Unported Creative Commons Attribution 3.0 Unported
CC BY 3.0 DE Q62619894 Creative-Commons-Lizenz „Namensnennung 3.0 Deutschland“ Creative Commons Attribution 3.0 Germany
CC BY 4.0 Q20007257 Creative Commons Namensnennung 4.0 International Creative Commons Attribution 4.0 International
CC BY-SA Q6905942 Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen Creative Commons Attribution-ShareAlike
CC BY-SA 3.0 Q14946043 Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen 3.0 Generisch Creative Commons Attribution-ShareAlike 3.0 Unported
CC BY-SA 3.0 DE Q42716613 Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen 3.0 Deutschland Creative Commons Attribution-ShareAlike 3.0 Germany
CC BY-SA 4.0 Q18199165 Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen 4.0 international Creative Commons Attribution-ShareAlike 4.0 International



Part B

Create Wikidata schema in OpenRefine

After the reconciliation process has been completed, you should now see a green bar on top of all reconciled columns. If you haven't reached this stage yet, or you would like to keep working from this stage on, you can import the project at its current status here: datasets_Reconciled_noSchema.tar.gz

Note: In the following image, the columns we worked with (Objekttypen 1, Objekttypen 2 and Lizenz der Metadaten) have been reconciled against the English Wikidata reconciliation service, this is why you are seeing the results in English (rather than in German).

reconciledData

Our data is now ready to be uploaded, however we need first to create a schema for mapping our data to the Wikidata ontology/model (as already defined by the project team).

Create a new schema

In order to create the Wikidata schema, we will need first to select the option Edit Wikibase schema in the Extensions drop-down menu:

schemaMenu

You should now be able see something similar to the image below. What you see is the columns names (the reconciled ones having a green bar underneath them):

emptySchema

We are now ready to start creating the schema, and for this we need first to select the option + add item.

Understanding the Wikidata schema architecture

After clicking on + add item, we see the main skeleton of the schema, which includes the three main components of a Wikidata item:

  • Wikidata instance (title)
  • Terms
  • Statements
emptySchema2

These areas correspond to the main areas of a Wikidata item, as we can see in the example below:

itemExample
Terms

By expanding the Terms section of the schema, we can see that we have the possibility to add values for Label, Description and Alias.

emptySchema3

These terms together with lang correspond to the retrospective terms section of a Wikidata item:

itemExample_2
Statements

The Statements section corresponds to the statements that constitute the core of a Wikidata item's description and follow the triple principle of Subject (Item)Predicate (Property)Object (Item/Value). Many such statements can be refined with qualifiers, that can add information and refine the stated concepts. The following example on the English writer Douglas Adams is indicative of this structure:

statements

Fill the schema

The schema is used to describe how the data in each row is mapped to Wikidata property values.

Reconciled items

First add item in the schema and drag & drop the reconciled column Datenset (reconciled) there (since we are creating Wikidata instances/items for each dataset).

create-wikidata-schema-part-1

Terms

Now add terms to the schema to specify the Wikidata item label and description for different languages.

Schema terms Language Term values
Label en column: Datenset [en]
Label de column: Datenset [de]
Description en column: Beschreibung [en]
Description de column: Beschreibung [de]

create-wikidata-schema-part-2

Statements

Now add statements to the schema to specify which Wikidata properties or qualifiers the values in the columns should be assigned to for each row. You can also specify fix values for Wikidata properties and qualifiers that all items should have in common.

Property Property values Qualifier Qualifier values
instance of (P31) fix value: open data (Q309901) - -
~ fix value: data set (Q1172284) - -
creator (P170) column: Institution (reconciled) - -
described at URL (P973) column: URL (CdV) language of work or name (P407) German (Q188)

create-wikidata-schema-part-3

After the schema has been completed your screen should look like this (open eventually the image in a separate tab, so that it looks larger):

image

Note: The above schema is only part of the whole schema to be used when uploading the CdV datasets to Wikidata. If you want to work on the rest of the schema, you are advised to use the data model described in the WikiProject as your basis for creating statements and adding qualifiers, where necessary. You can also check with the final version of the project file(datasets_Reconciled_withSchema.tar.gz), to see how the end schema* should look like in OpenRefine.

*The workshop data has been actually already uploaded to Wikidata. In case you want to re-upload the data (see section below) with the complete schema, you are welcome to do so. However, you are kindly asked to remove the Titelbild(Filename) column from the project as well as the image (P18) statement from the schema because of a buggy operation when re-uploading images to Wikidata. Thank you!


Upload data

In order to upload edits to Wikidata, we will need first to select the option Upload edits to Wikibase... in the Extensions drop-down menu:

image

If you are not logged in to Wikidata yet, a login popup will open.

image

Enter your Wikidata account login data at Username and Password and then click Log in.

An info popup will be displayed, which you should skip for now, click Ok.

image

Before uploading edits to Wikidata, the number of edits and the warnings ignored so far are displayed as a precaution.

image

If you are really sure that the edits can be uploaded, include a meaningful description of the edit at summary, e.g. "Upload Coding da Vinci datasets", and then click Upload edits. That's all.