Skip to content

Commit dad3cb4

Browse files
AetherUnboundMeet ParekhMeet Parekhobulat
authored
Retire Common Crawl module & DAGs (#870)
* Retired module commoncrawl and retired the commoncrawl_utils test * updated DAGs.md and test_dag_parsing.py as suggested in ##861 * Remove ETL test module, additional documentation cleanup * Delete more unused test files * Remove unused testing buckets * Update README.md Co-authored-by: Olga Bulat <obulat@gmail.com> Co-authored-by: Meet Parekh <meetparekh@192.168.1.16> Co-authored-by: Meet Parekh <meetparekh@192.168.1.9> Co-authored-by: Olga Bulat <obulat@gmail.com>
1 parent a6f4eab commit dad3cb4

File tree

17 files changed

+26
-89
lines changed

17 files changed

+26
-89
lines changed

DAGs.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,23 +14,13 @@ The DAGs are shown in two forms:
1414

1515
The following are DAGs grouped by their primary tag:
1616

17-
1. [Commoncrawl](#commoncrawl)
1817
1. [Data Refresh](#data_refresh)
1918
1. [Database](#database)
2019
1. [Maintenance](#maintenance)
2120
1. [Oauth](#oauth)
2221
1. [Provider](#provider)
2322
1. [Provider Reingestion](#provider-reingestion)
2423

25-
## Commoncrawl
26-
27-
| DAG ID | Schedule Interval |
28-
| --- | --- |
29-
| `commoncrawl_etl_workflow` | `0 0 * * 1` |
30-
| `sync_commoncrawl_workflow` | `0 16 15 * *` |
31-
32-
33-
3424
## Data Refresh
3525

3626
| DAG ID | Schedule Interval |

README.md

Lines changed: 25 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,25 @@ This repository contains the methods used to identify over 1.4 billion Creative
1010
Commons licensed works. The challenge is that these works are dispersed
1111
throughout the web and identifying them requires a combination of techniques.
1212

13-
Two approaches are currently in use:
13+
Currently, we only pull data from APIs which serve Creative Commons licensed media.
14+
In the past, we have also used web crawl data as a source.
1415

15-
1. Web crawl data
16-
2. Application Programming Interfaces (API Data)
16+
## API Data
17+
18+
[Apache Airflow](https://airflow.apache.org/) is used to manage the workflow for
19+
various API ETL jobs which pull and process data from a number of open APIs on
20+
the internet.
1721

18-
## Web Crawl Data
22+
### API Workflows
23+
24+
To view more information about all the available workflows (DAGs) within the project,
25+
see [DAGs.md](DAGs.md).
26+
27+
See each provider API script's notes in their respective [handbook][ov-handbook] entry.
28+
29+
[ov-handbook]: https://make.wordpress.org/openverse/handbook/openverse-handbook/
30+
31+
## Web Crawl Data (retired)
1932

2033
The Common Crawl Foundation provides an open repository of petabyte-scale web
2134
crawl data. A new dataset is published at the end of each month comprising over
@@ -31,10 +44,10 @@ The data is available in three file formats:
3144
For more information about these formats, please see the
3245
[Common Crawl documentation][ccrawl_doc].
3346

34-
Openverse Catalog uses AWS Data Pipeline service to automatically create an Amazon EMR
35-
cluster of 100 c4.8xlarge instances that will parse the WAT archives to identify
47+
Openverse Catalog used AWS Data Pipeline service to automatically create an Amazon EMR
48+
cluster of 100 c4.8xlarge instances that parsed the WAT archives to identify
3649
all domains that link to creativecommons.org. Due to the volume of data, Apache
37-
Spark is used to streamline the processing. The output of this methodology is a
50+
Spark was also used to streamline the processing. The output of this methodology was a
3851
series of parquet files that contain:
3952

4053
- the domains and its respective content path and query string (i.e. the exact
@@ -45,26 +58,13 @@ series of parquet files that contain:
4558
- the location of the webpage in the WARC file so that the page contents can be
4659
found.
4760

48-
The steps above are performed in [`ExtractCCLinks.py`][ex_cc_links].
61+
The steps above were performed in [`ExtractCCLinks.py`][ex_cc_links].
62+
63+
This method was retired in 2021.
4964

5065
[ccrawl_doc]: https://commoncrawl.org/the-data/get-started/
5166
[ex_cc_links]: archive/ExtractCCLinks.py
5267

53-
## API Data
54-
55-
[Apache Airflow](https://airflow.apache.org/) is used to manage the workflow for
56-
various API ETL jobs which pull and process data from a number of open APIs on
57-
the internet.
58-
59-
### API Workflows
60-
61-
To view more information about all the available workflows (DAGs) within the project,
62-
see [DAGs.md](DAGs.md).
63-
64-
See each provider API script's notes in their respective [handbook][ov-handbook] entry.
65-
66-
[ov-handbook]: https://make.wordpress.org/openverse/handbook/openverse-handbook/
67-
6868
## Development setup for Airflow and API puller scripts
6969

7070
There are a number of scripts in the directory
@@ -224,12 +224,13 @@ openverse-catalog
224224
├── openverse_catalog/ # Primary code directory
225225
│ ├── dags/ # DAGs & DAG support code
226226
│ │ ├── common/ # - Shared modules used across DAGs
227-
│ │ ├── commoncrawl/ # - DAGs & scripts for commoncrawl parsing
227+
│ │ ├── data_refresh/ # - DAGs & code related to the data refresh process
228228
│ │ ├── database/ # - DAGs related to database actions (matview refresh, cleaning, etc.)
229229
│ │ ├── maintenance/ # - DAGs related to airflow/infrastructure maintenance
230230
│ │ ├── oauth2/ # - DAGs & code for Oauth2 key management
231231
│ │ ├── providers/ # - DAGs & code for provider ingestion
232232
│ │ │ ├── provider_api_scripts/ # - API access code specific to providers
233+
│ │ │ ├── provider_csv_load_scripts/ # - Schema initialization SQL definitions for SQL-based providers
233234
│ │ │ └── *.py # - DAG definition files for providers
234235
│ │ └── retired/ # - DAGs & code that is no longer needed but might be a useful guide for the future
235236
│ └── templates/ # Templates for generating new provider code

docker-compose.override.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ services:
2727
MINIO_ROOT_USER: ${AWS_ACCESS_KEY}
2828
MINIO_ROOT_PASSWORD: ${AWS_SECRET_KEY}
2929
# Comma separated list of buckets to create on startup
30-
BUCKETS_TO_CREATE: ${OPENVERSE_BUCKET},openverse-airflow-logs,commonsmapper-v2,commonsmapper
30+
BUCKETS_TO_CREATE: ${OPENVERSE_BUCKET},openverse-airflow-logs
3131
# Create empty buckets on every container startup
3232
# Note: $0 is included in the exec because "/bin/bash -c" swallows the first
3333
# argument, so it must be re-added at the beginning of the exec call

env.template

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -100,9 +100,6 @@ AWS_ACCESS_KEY=test_key
100100
AWS_SECRET_KEY=test_secret
101101
# General bucket used for TSV->DB ingestion and logging
102102
OPENVERSE_BUCKET=openverse-storage
103-
# Used only for commoncrawl parsing
104-
S3_BUCKET=not_set
105-
COMMONCRAWL_BUCKET=not_set
106103
# Seconds to wait before poking for availability of the data refresh pool when running a data_refresh
107104
# DAG. Used to shorten the time for testing purposes.
108105
DATA_REFRESH_POKE_INTERVAL=5

openverse_catalog/dags/.airflowignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
# Ignore all non-DAG files
22
common/
3-
commoncrawl/commoncrawl_scripts
43
providers/provider_api_scripts
54
retired

tests/dags/common/etl/__init__.py

Whitespace-only changes.

tests/dags/common/etl/test_commoncrawl_utils.py

Lines changed: 0 additions & 40 deletions
This file was deleted.

tests/dags/common/loader/test_resources/new_columns_crawl.tsv

Lines changed: 0 additions & 2 deletions
This file was deleted.

tests/dags/common/loader/test_resources/new_columns_papis.tsv

Lines changed: 0 additions & 2 deletions
This file was deleted.

tests/dags/common/loader/test_resources/old_columns_crawl.tsv

Lines changed: 0 additions & 2 deletions
This file was deleted.

0 commit comments

Comments
 (0)