Skip to content

Commit 223bac8

Browse files
committed
some formatting
1 parent 3fb9316 commit 223bac8

File tree

3 files changed

+12
-11
lines changed

3 files changed

+12
-11
lines changed

exploring.qmd

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,15 @@
22
title: "Exploring Data Types & Facets"
33
---
44

5-
Now, let’s dive into one of the most powerful and defining features of OpenRefine—facets—which also explains why its logo is shaped like a diamond.
5+
Now, let’s dive into one of the most powerful and defining features of OpenRefine—**facets**—which also explains why its logo is shaped like a diamond.
66

7-
*Faceting* is a method of exploring and filtering data to better understand its structure and content. It allows us to more easily spots errors and outliers in the data. By applying multiple filters, you can quickly uncover patterns, spot inconsistencies, and isolate specific subsets of data for closer inspection or bulk editing. A facet groups together all the similar values within a column, allowing you to easily filter and refine your dataset. It’s also incredibly useful for editing values across many records at once.
7+
*Faceting* is a method of exploring and filtering data to better understand its structure and content. It allows us to more easily spots errors and outliers in the data. By applying multiple filters, you can quickly uncover patterns and inconsistencies, and isolate specific subsets of data for closer inspection or bulk editing. A facet groups together all the similar values within a column, allowing you to easily filter and refine your dataset. It’s also incredibly useful for editing values across many records at once.
88

99
## Text Facets
1010

1111
One type of facet is called a ‘Text facet’. This groups all the identical text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface.
1212

13-
After applying a text facet to the `title` column, you'll notice that the number of unique choices differs from the total number of rows. This happens because a single title may appear across multiple streaming platforms, and also because different productions can share the same name. Let’s sort the facet by count to explore how often titles like `Cinderella` appear. Just a heads-up: we don’t have any duplicate title entries to clean up in this dataset.
13+
After applying a text facet to the **title** column, you'll notice that the number of unique choices differs from the total number of rows. This happens because a single title may appear across multiple streaming platforms, and also because different productions can share the same name. Let’s sort the facet by count to explore how often titles like **Cinderella** appear. Just a heads-up: we don’t have any duplicate title entries to clean up in this dataset.
1414

1515
::: columns
1616
::: {.column width="50%"}
@@ -26,7 +26,7 @@ As you can see, a text facet can come in handy for providing some quick insights
2626

2727
### Checking For Errors
2828

29-
Let’s now turn our attention to the `classification` column. This column should represent the ratings from [The Classification and Rating Administration (CARA)](https://www.filmratings.com). How many unique values are represented in the dataset? You should find 18 distinct entries—correct? It appears that the original data collector could have enforced validation rules to ensure consistency in this field, but since that wasn’t done, we can use faceted analysis to identify inconsistencies or errors. How many could you identify? Have you noticed the question marks and NAs? What approach would you take to clean and standardize these entries?
29+
Let’s now turn our attention to the **classification** column. This column should represent the ratings from [The Classification and Rating Administration (CARA)](https://www.filmratings.com). How many unique values are represented in the dataset? You should find 18 distinct entries—correct? It appears that the original data collector could have enforced validation rules to ensure consistency in this field, but since that wasn’t done, we can use faceted analysis to identify inconsistencies or errors. How many could you identify? Have you noticed the question marks and NAs? What approach would you take to clean and standardize these entries?
3030

3131
Since those entries represent either non-existent data or uncertainty—and given that we already have many productions with unknown classifications (13,132)—we might consider setting those values to blank as well. To do this, we can first run a text facet and select the entries we want to amend:
3232

@@ -38,11 +38,11 @@ Now that we have included those six choices and eight entries, we can apply a \`
3838

3939
Now, let's run another text facet for the same column and check how many choices we have. But wait, have you noticed the empty string option? An empty string will work the same as null for most purposes. Can you guess when it will not be the case? An empty string (`""`) will not behave the same as null when you use functions or filters that specifically distinguish between null and empty. Setting it as `null` represent "no data", making it more explicit the data is missing, unknown or not applicable. But let's not get too sidetracked. We will cover more about transformations in the next episodes.
4040

41-
Now we will use faceting to look for potential errors in data entry in the `streaming` column. First, Scroll over to the `streaming` column and then, click the down arrow and choose `Facet` \> `Text facet`.
41+
Now we will use faceting to look for potential errors in data entry in the **streaming** column. First, Scroll over to the **streaming** column and then, click the down arrow and choose `Facet` \> `Text facet`.
4242

4343
![](images/text-facet-menu.png)
4444

45-
Alright, in the left panel, you should now see a box containing every unique value in the `streaming` column along with a number representing how many times that value occurs in the column.
45+
Alright, in the left panel, you should now see a box containing every unique value in the **streaming** column along with a number representing how many times that value occurs in the column.
4646

4747
![](images/streaming-facet.png)
4848

@@ -82,7 +82,7 @@ Only observations that include only numerals (0-9) can be transformed to numbers
8282

8383
For columns with numeric values we can convert it using the **`Edit cells > Common transforms`**. In this section, we’ll experiment with converting columns to numbers and explore the additional features and functionality this unlocks.
8484

85-
Sometimes there are non-number values or blanks in a column which may represent errors in data entry and we want to find them. We can do that with a `Numeric facet`. So if you try to create a `numeric facet` for the column `release_year`. The facet will be empty because OpenRefine sees those values as text strings.
85+
Sometimes there are non-number values or blanks in a column which may represent errors in data entry and we want to find them. We can do that with a `Numeric facet`. So if you try to create a numeric facet for the column **release_year**. The facet will be empty because OpenRefine sees those values as text strings.
8686

8787
To transform cells into numbers, click the down arrow for that column, then `Edit cells` \> `Common transforms…` \> `To number`. You will notice the values will change from left-justified to right-justified, and black to green color.
8888

@@ -102,7 +102,7 @@ Performing a numeric facet will display a histogram of the number of entries in
102102

103103
The "date" type in OpenRefine is created when a column is explicitly transformed into dates—either by applying a built-in expression, using `Edit cells → Common transforms → To date`, or manually setting individual cells to the "date" data type.
104104

105-
Let’s take the release_year column as an example. When you apply the To date transformation, OpenRefine attempts to convert each cell into a standardized date format. It uses the ISO 8601 extended format with time in UTC, which looks like this: `YYYY-MM-DDTHH:MM:SSZ`
105+
Let’s take the release_year column as an example. When you apply the To date transformation, OpenRefine attempts to convert each cell into a standardized date format. It uses the ISO 8601 extended format with time in UTC, which looks like this: **YYYY-MM-DDTHH:MM:SSZ**
106106

107107
If the original values are just four-digit years, like:
108108

@@ -135,6 +135,6 @@ We've covered the most commonly used facets in OpenRefine, but the platform also
135135

136136
## Facets for Subsetting Working Dataset
137137

138-
Facets can be also very handy to subset the dataset and make it more easily manageable. Let's say you want to focus only on the shows for a while, or even export this subset. How would you do that? You can perform a text facet for the column `type` and select *show*, you will notice the number of matching rows in the grid header will change accordingly. From now on, you will be only working with those rows, unless you click revert or close the facet panel.
138+
Facets can be also very handy to subset the dataset and make it more easily manageable. Let's say you want to focus only on the shows for a while, or even export this subset. How would you do that? You can perform a text facet for the column **type** and select **show**, you will notice the number of matching rows in the grid header will change accordingly. From now on, you will be only working with those rows, unless you click revert or close the facet panel.
139139

140140
If you want to export that subset of shows click export on the right side of the project bar and select your preferred format. As a reminder, the **permalink** will save all active facets for your project!

features.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Titles that meet *both* criteria (e.g. popular but missing age ratings) will now
5050

5151
OpenRefine is primarily designed for data cleaning rather than adding new columns or rows. However, it does provide essential tools for reorganizing your dataset, including renaming, reordering, and deleting columns.
5252

53-
For example, we want to remove the `imdb_votes` and `tmdb_votes` columns. This can be done in one of two ways: by deleting each column individually or by using the "All" column dropdown to manage multiple columns at once. Choose the method that works best for you to exclude them:
53+
For example, we want to remove the **imdb_votes** and **tmdb_votes** columns. This can be done in one of two ways: by deleting each column individually or by using the "All" column dropdown to manage multiple columns at once. Choose the method that works best for you to exclude them:
5454

5555
::: columns
5656
::: {.column width="50%"}

project.qmd

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ We’ll stick with the default settings for now—they typically work well for m
2828

2929
::: {.callout-important collapse="true" icon="true"}
3030
## Color and Other Formatting
31-
Keep in mind that OpenRefine **does not** retain any original formatting from your file. Elements like cell colors, font styles, or background shading will be lost during import. Hyperlinked text will appear as plain text, though OpenRefine will detect any URLs and make them clickable within the project interface.
31+
32+
Keep in mind that OpenRefine **does not** retain any original formatting from your file. Elements like cell colors, font styles, or background shading will be lost during import. Hyperlinked text will appear as plain text, though OpenRefine will detect any URLs and make them clickable within the project interface.
3233

3334
That said, relying on visual formatting or emphasis—like colors or bold text—to convey important meaning isn’t recommended in data management. These elements aren't machine-readable and can lead to inconsistencies or misinterpretation during analysis. It's always better to encode meaning directly in the data using clearly labeled columns or consistent values.
3435
:::

0 commit comments

Comments
 (0)