@@ -10,12 +10,25 @@ This repository contains the methods used to identify over 1.4 billion Creative
10
10
Commons licensed works. The challenge is that these works are dispersed
11
11
throughout the web and identifying them requires a combination of techniques.
12
12
13
- Two approaches are currently in use:
13
+ Currently, we only pull data from APIs which serve Creative Commons licensed media.
14
+ In the past, we have also used web crawl data as a source.
14
15
15
- 1 . Web crawl data
16
- 2 . Application Programming Interfaces (API Data)
16
+ ## API Data
17
+
18
+ [ Apache Airflow] ( https://airflow.apache.org/ ) is used to manage the workflow for
19
+ various API ETL jobs which pull and process data from a number of open APIs on
20
+ the internet.
17
21
18
- ## Web Crawl Data
22
+ ### API Workflows
23
+
24
+ To view more information about all the available workflows (DAGs) within the project,
25
+ see [ DAGs.md] ( DAGs.md ) .
26
+
27
+ See each provider API script's notes in their respective [ handbook] [ ov-handbook ] entry.
28
+
29
+ [ ov-handbook ] : https://make.wordpress.org/openverse/handbook/openverse-handbook/
30
+
31
+ ## Web Crawl Data (retired)
19
32
20
33
The Common Crawl Foundation provides an open repository of petabyte-scale web
21
34
crawl data. A new dataset is published at the end of each month comprising over
@@ -31,10 +44,10 @@ The data is available in three file formats:
31
44
For more information about these formats, please see the
32
45
[ Common Crawl documentation] [ ccrawl_doc ] .
33
46
34
- Openverse Catalog uses AWS Data Pipeline service to automatically create an Amazon EMR
35
- cluster of 100 c4.8xlarge instances that will parse the WAT archives to identify
47
+ Openverse Catalog used AWS Data Pipeline service to automatically create an Amazon EMR
48
+ cluster of 100 c4.8xlarge instances that parsed the WAT archives to identify
36
49
all domains that link to creativecommons.org. Due to the volume of data, Apache
37
- Spark is used to streamline the processing. The output of this methodology is a
50
+ Spark was also used to streamline the processing. The output of this methodology was a
38
51
series of parquet files that contain:
39
52
40
53
- the domains and its respective content path and query string (i.e. the exact
@@ -45,26 +58,13 @@ series of parquet files that contain:
45
58
- the location of the webpage in the WARC file so that the page contents can be
46
59
found.
47
60
48
- The steps above are performed in [ ` ExtractCCLinks.py ` ] [ ex_cc_links ] .
61
+ The steps above were performed in [ ` ExtractCCLinks.py ` ] [ ex_cc_links ] .
62
+
63
+ This method was retired in 2021.
49
64
50
65
[ ccrawl_doc ] : https://commoncrawl.org/the-data/get-started/
51
66
[ ex_cc_links ] : archive/ExtractCCLinks.py
52
67
53
- ## API Data
54
-
55
- [ Apache Airflow] ( https://airflow.apache.org/ ) is used to manage the workflow for
56
- various API ETL jobs which pull and process data from a number of open APIs on
57
- the internet.
58
-
59
- ### API Workflows
60
-
61
- To view more information about all the available workflows (DAGs) within the project,
62
- see [ DAGs.md] ( DAGs.md ) .
63
-
64
- See each provider API script's notes in their respective [ handbook] [ ov-handbook ] entry.
65
-
66
- [ ov-handbook ] : https://make.wordpress.org/openverse/handbook/openverse-handbook/
67
-
68
68
## Development setup for Airflow and API puller scripts
69
69
70
70
There are a number of scripts in the directory
@@ -224,12 +224,13 @@ openverse-catalog
224
224
├── openverse_catalog/ # Primary code directory
225
225
│ ├── dags/ # DAGs & DAG support code
226
226
│ │ ├── common/ # - Shared modules used across DAGs
227
- │ │ ├── commoncrawl / # - DAGs & scripts for commoncrawl parsing
227
+ │ │ ├── data_refresh / # - DAGs & code related to the data refresh process
228
228
│ │ ├── database/ # - DAGs related to database actions (matview refresh, cleaning, etc.)
229
229
│ │ ├── maintenance/ # - DAGs related to airflow/infrastructure maintenance
230
230
│ │ ├── oauth2/ # - DAGs & code for Oauth2 key management
231
231
│ │ ├── providers/ # - DAGs & code for provider ingestion
232
232
│ │ │ ├── provider_api_scripts/ # - API access code specific to providers
233
+ │ │ │ ├── provider_csv_load_scripts/ # - Schema initialization SQL definitions for SQL-based providers
233
234
│ │ │ └── *.py # - DAG definition files for providers
234
235
│ │ └── retired/ # - DAGs & code that is no longer needed but might be a useful guide for the future
235
236
│ └── templates/ # Templates for generating new provider code
0 commit comments