You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: dataset.qmd
+9-8Lines changed: 9 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -6,11 +6,11 @@ title: "Our Running Example"
6
6
7
7
{width="370"}
8
8
9
-
This workshop utilizes the **streaming-master-messy** comma-separated value (CSV) file which is derived from the movies and TV shows featured by major streaming services and distributed in Kaggle Project under a CC0 Public License:
9
+
This workshop utilizes the **streaming-master-messy** comma-separated value (CSV) file which is derived from the movies and TV shows featured by major streaming services and distributed in Kaggle Project under a CC0 Public License[^1].
10
10
11
-
Henrique, D. (2020). *A simple movie & TV show recommendation system*. Kaggle. <https://www.kaggle.com/code/dgoenrique/a-simple-movie-tv-show-recommendation-system?select=credits.csv>
11
+
[^1]: Henrique, D. (2020). *A simple movie & TV show recommendation system*. Kaggle. <https://www.kaggle.com/code/dgoenrique/a-simple-movie-tv-show-recommendation-system?select=credits.csv>
12
12
13
-
We have merged six `titles.csv` files—each representing one of the streaming services featured in this project (Amazon Prime Video, Apple TV+, Disney+, HBO Max, Netflix, and Paramount)—into a single master dataset.
13
+
We have merged six `titles.csv` files—each representing one of the streaming services featured in this project (Amazon Prime Video, Apple TV+, Disney+, HBO Max, Netflix, and Paramount)—into a single master spreadsheet.
14
14
15
15
The dataset contains 25,223 rows with movies and TV series titles along with the following variables as described in the data dictionary:
16
16
@@ -33,11 +33,6 @@ The dataset contains 25,223 rows with movies and TV series titles along with the
33
33
- tmdb_popularity: Votes on The Movie Database (TMDB).
34
34
- tmdb_score: Score on on The Movie Database TMDB.
35
35
36
-
::: {.callout-important collapse="true"}
37
-
## Disclaimer
38
-
39
-
Please note that, for the purposes of this lesson, the data has been intentionally modified to support the associated exercises. Therefore, we do not vouch for the use of this dataset for actual research. The data has been specifically edited and curated for instructional purposes and may not represent a fully accurate or comprehensive source of data for formal analysis.
40
-
:::
41
36
42
37
## Downloading the Dataset
43
38
@@ -49,6 +44,12 @@ Now that we have a clearer understanding of the data we'll be working with, plea
49
44
50
45
Let's open the file and check how the data looks like. Also, can you spot your favorite movie or TV series on it?
51
46
47
+
::: {.callout-important collapse="true"}
48
+
## Disclaimer
49
+
50
+
Please note that, for the purposes of this lesson, the data has been intentionally modified to support the associated exercises. Therefore, we do not vouch for the use of this dataset for actual research. The data has been specifically edited and curated for instructional purposes and may not represent a fully accurate or comprehensive source of data for formal analysis.
51
+
:::
52
+
52
53
## Our Challenge
53
54
54
55
In this workshop, we will explore how OpenRefine can support data organization and preparation for analysis. For instance, you might want to compare scores across genres, plot the most common age classifications over the years, or investigate whether the country of origin affects popularity. These are just a few examples of the kinds of insights you could uncover once your data is properly cleaned and organized. But before that the data has to be cleaned and prepared accordingly. Ready?
0 commit comments