You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: exploring.qmd
+9-9Lines changed: 9 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -2,15 +2,15 @@
2
2
title: "Exploring Data Types & Facets"
3
3
---
4
4
5
-
Now, let’s dive into one of the most powerful and defining features of OpenRefine—facets—which also explains why its logo is shaped like a diamond.
5
+
Now, let’s dive into one of the most powerful and defining features of OpenRefine—**facets**—which also explains why its logo is shaped like a diamond.
6
6
7
-
*Faceting* is a method of exploring and filtering data to better understand its structure and content. It allows us to more easily spots errors and outliers in the data. By applying multiple filters, you can quickly uncover patterns, spot inconsistencies, and isolate specific subsets of data for closer inspection or bulk editing. A facet groups together all the similar values within a column, allowing you to easily filter and refine your dataset. It’s also incredibly useful for editing values across many records at once.
7
+
*Faceting* is a method of exploring and filtering data to better understand its structure and content. It allows us to more easily spots errors and outliers in the data. By applying multiple filters, you can quickly uncover patterns and inconsistencies, and isolate specific subsets of data for closer inspection or bulk editing. A facet groups together all the similar values within a column, allowing you to easily filter and refine your dataset. It’s also incredibly useful for editing values across many records at once.
8
8
9
9
## Text Facets
10
10
11
11
One type of facet is called a ‘Text facet’. This groups all the identical text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface.
12
12
13
-
After applying a text facet to the `title` column, you'll notice that the number of unique choices differs from the total number of rows. This happens because a single title may appear across multiple streaming platforms, and also because different productions can share the same name. Let’s sort the facet by count to explore how often titles like `Cinderella` appear. Just a heads-up: we don’t have any duplicate title entries to clean up in this dataset.
13
+
After applying a text facet to the **title** column, you'll notice that the number of unique choices differs from the total number of rows. This happens because a single title may appear across multiple streaming platforms, and also because different productions can share the same name. Let’s sort the facet by count to explore how often titles like **Cinderella** appear. Just a heads-up: we don’t have any duplicate title entries to clean up in this dataset.
14
14
15
15
::: columns
16
16
::: {.column width="50%"}
@@ -26,7 +26,7 @@ As you can see, a text facet can come in handy for providing some quick insights
26
26
27
27
### Checking For Errors
28
28
29
-
Let’s now turn our attention to the `classification` column. This column should represent the ratings from [The Classification and Rating Administration (CARA)](https://www.filmratings.com). How many unique values are represented in the dataset? You should find 18 distinct entries—correct? It appears that the original data collector could have enforced validation rules to ensure consistency in this field, but since that wasn’t done, we can use faceted analysis to identify inconsistencies or errors. How many could you identify? Have you noticed the question marks and NAs? What approach would you take to clean and standardize these entries?
29
+
Let’s now turn our attention to the **classification** column. This column should represent the ratings from [The Classification and Rating Administration (CARA)](https://www.filmratings.com). How many unique values are represented in the dataset? You should find 18 distinct entries—correct? It appears that the original data collector could have enforced validation rules to ensure consistency in this field, but since that wasn’t done, we can use faceted analysis to identify inconsistencies or errors. How many could you identify? Have you noticed the question marks and NAs? What approach would you take to clean and standardize these entries?
30
30
31
31
Since those entries represent either non-existent data or uncertainty—and given that we already have many productions with unknown classifications (13,132)—we might consider setting those values to blank as well. To do this, we can first run a text facet and select the entries we want to amend:
32
32
@@ -38,11 +38,11 @@ Now that we have included those six choices and eight entries, we can apply a \`
38
38
39
39
Now, let's run another text facet for the same column and check how many choices we have. But wait, have you noticed the empty string option? An empty string will work the same as null for most purposes. Can you guess when it will not be the case? An empty string (`""`) will not behave the same as null when you use functions or filters that specifically distinguish between null and empty. Setting it as `null` represent "no data", making it more explicit the data is missing, unknown or not applicable. But let's not get too sidetracked. We will cover more about transformations in the next episodes.
40
40
41
-
Now we will use faceting to look for potential errors in data entry in the `streaming` column. First, Scroll over to the `streaming` column and then, click the down arrow and choose `Facet`\>`Text facet`.
41
+
Now we will use faceting to look for potential errors in data entry in the **streaming** column. First, Scroll over to the **streaming** column and then, click the down arrow and choose `Facet`\>`Text facet`.
42
42
43
43

44
44
45
-
Alright, in the left panel, you should now see a box containing every unique value in the `streaming` column along with a number representing how many times that value occurs in the column.
45
+
Alright, in the left panel, you should now see a box containing every unique value in the **streaming** column along with a number representing how many times that value occurs in the column.
46
46
47
47

48
48
@@ -82,7 +82,7 @@ Only observations that include only numerals (0-9) can be transformed to numbers
82
82
83
83
For columns with numeric values we can convert it using the **`Edit cells > Common transforms`**. In this section, we’ll experiment with converting columns to numbers and explore the additional features and functionality this unlocks.
84
84
85
-
Sometimes there are non-number values or blanks in a column which may represent errors in data entry and we want to find them. We can do that with a `Numeric facet`. So if you try to create a `numeric facet` for the column `release_year`. The facet will be empty because OpenRefine sees those values as text strings.
85
+
Sometimes there are non-number values or blanks in a column which may represent errors in data entry and we want to find them. We can do that with a `Numeric facet`. So if you try to create a numeric facet for the column **release_year**. The facet will be empty because OpenRefine sees those values as text strings.
86
86
87
87
To transform cells into numbers, click the down arrow for that column, then `Edit cells`\>`Common transforms…`\>`To number`. You will notice the values will change from left-justified to right-justified, and black to green color.
88
88
@@ -102,7 +102,7 @@ Performing a numeric facet will display a histogram of the number of entries in
102
102
103
103
The "date" type in OpenRefine is created when a column is explicitly transformed into dates—either by applying a built-in expression, using `Edit cells → Common transforms → To date`, or manually setting individual cells to the "date" data type.
104
104
105
-
Let’s take the release_year column as an example. When you apply the To date transformation, OpenRefine attempts to convert each cell into a standardized date format. It uses the ISO 8601 extended format with time in UTC, which looks like this: `YYYY-MM-DDTHH:MM:SSZ`
105
+
Let’s take the release_year column as an example. When you apply the To date transformation, OpenRefine attempts to convert each cell into a standardized date format. It uses the ISO 8601 extended format with time in UTC, which looks like this: **YYYY-MM-DDTHH:MM:SSZ**
106
106
107
107
If the original values are just four-digit years, like:
108
108
@@ -135,6 +135,6 @@ We've covered the most commonly used facets in OpenRefine, but the platform also
135
135
136
136
## Facets for Subsetting Working Dataset
137
137
138
-
Facets can be also very handy to subset the dataset and make it more easily manageable. Let's say you want to focus only on the shows for a while, or even export this subset. How would you do that? You can perform a text facet for the column `type` and select *show*, you will notice the number of matching rows in the grid header will change accordingly. From now on, you will be only working with those rows, unless you click revert or close the facet panel.
138
+
Facets can be also very handy to subset the dataset and make it more easily manageable. Let's say you want to focus only on the shows for a while, or even export this subset. How would you do that? You can perform a text facet for the column **type** and select **show**, you will notice the number of matching rows in the grid header will change accordingly. From now on, you will be only working with those rows, unless you click revert or close the facet panel.
139
139
140
140
If you want to export that subset of shows click export on the right side of the project bar and select your preferred format. As a reminder, the **permalink** will save all active facets for your project!
Copy file name to clipboardExpand all lines: features.qmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -50,7 +50,7 @@ Titles that meet *both* criteria (e.g. popular but missing age ratings) will now
50
50
51
51
OpenRefine is primarily designed for data cleaning rather than adding new columns or rows. However, it does provide essential tools for reorganizing your dataset, including renaming, reordering, and deleting columns.
52
52
53
-
For example, we want to remove the `imdb_votes` and `tmdb_votes` columns. This can be done in one of two ways: by deleting each column individually or by using the "All" column dropdown to manage multiple columns at once. Choose the method that works best for you to exclude them:
53
+
For example, we want to remove the **imdb_votes** and **tmdb_votes** columns. This can be done in one of two ways: by deleting each column individually or by using the "All" column dropdown to manage multiple columns at once. Choose the method that works best for you to exclude them:
Keep in mind that OpenRefine **does not** retain any original formatting from your file. Elements like cell colors, font styles, or background shading will be lost during import. Hyperlinked text will appear as plain text, though OpenRefine will detect any URLs and make them clickable within the project interface.
31
+
32
+
Keep in mind that OpenRefine **does not** retain any original formatting from your file. Elements like cell colors, font styles, or background shading will be lost during import. Hyperlinked text will appear as plain text, though OpenRefine will detect any URLs and make them clickable within the project interface.
32
33
33
34
That said, relying on visual formatting or emphasis—like colors or bold text—to convey important meaning isn’t recommended in data management. These elements aren't machine-readable and can lead to inconsistencies or misinterpretation during analysis. It's always better to encode meaning directly in the data using clearly labeled columns or consistent values.
0 commit comments