WordPress
diff --git a/‎DAGs.md
Lines changed: 0 additions & 10 deletions b/‎DAGs.md
Lines changed: 0 additions & 10 deletions
diff --git a/‎README.md
Lines changed: 25 additions & 24 deletions b/‎README.md
Lines changed: 25 additions & 24 deletions
diff --git a/‎docker-compose.override.yml
Lines changed: 1 addition & 1 deletion b/‎docker-compose.override.yml
Lines changed: 1 addition & 1 deletion
diff --git a/‎env.template
Lines changed: 0 additions & 3 deletions b/‎env.template
Lines changed: 0 additions & 3 deletions
diff --git a/‎openverse_catalog/dags/.airflowignore
Lines changed: 0 additions & 1 deletion b/‎openverse_catalog/dags/.airflowignore
Lines changed: 0 additions & 1 deletion
diff --git a/‎openverse_catalog/dags/commoncrawl/commoncrawl_etl.py renamed to ‎openverse_catalog/dags/retired/commoncrawl/commoncrawl_etl.py b/‎openverse_catalog/dags/commoncrawl/commoncrawl_etl.py renamed to ‎openverse_catalog/dags/retired/commoncrawl/commoncrawl_etl.py
diff --git a/‎openverse_catalog/dags/commoncrawl/commoncrawl_scripts/commoncrawl_s3_syncer/SyncImageProviders.py renamed to ‎openverse_catalog/dags/retired/commoncrawl/commoncrawl_scripts/commoncrawl_s3_syncer/SyncImageProviders.py b/‎openverse_catalog/dags/commoncrawl/commoncrawl_scripts/commoncrawl_s3_syncer/SyncImageProviders.py renamed to ‎openverse_catalog/dags/retired/commoncrawl/commoncrawl_scripts/commoncrawl_s3_syncer/SyncImageProviders.py
diff --git a/‎openverse_catalog/dags/commoncrawl/commoncrawl_scripts/scripts/merge_cc_tags.py renamed to ‎openverse_catalog/dags/retired/commoncrawl/commoncrawl_scripts/scripts/merge_cc_tags.py b/‎openverse_catalog/dags/commoncrawl/commoncrawl_scripts/scripts/merge_cc_tags.py renamed to ‎openverse_catalog/dags/retired/commoncrawl/commoncrawl_scripts/scripts/merge_cc_tags.py
diff --git a/‎openverse_catalog/dags/commoncrawl/commoncrawl_utils.py renamed to ‎openverse_catalog/dags/retired/commoncrawl/commoncrawl_utils.py b/‎openverse_catalog/dags/commoncrawl/commoncrawl_utils.py renamed to ‎openverse_catalog/dags/retired/commoncrawl/commoncrawl_utils.py
diff --git a/‎openverse_catalog/dags/commoncrawl/sync_commoncrawl_workflow.py renamed to ‎openverse_catalog/dags/retired/commoncrawl/sync_commoncrawl_workflow.py b/‎openverse_catalog/dags/commoncrawl/sync_commoncrawl_workflow.py renamed to ‎openverse_catalog/dags/retired/commoncrawl/sync_commoncrawl_workflow.py
diff --git a/‎tests/dags/common/etl/__init__.py b/‎tests/dags/common/etl/__init__.py
diff --git a/‎tests/dags/common/etl/test_commoncrawl_utils.py
Lines changed: 0 additions & 40 deletions b/‎tests/dags/common/etl/test_commoncrawl_utils.py
Lines changed: 0 additions & 40 deletions
diff --git a/‎tests/dags/common/loader/test_resources/new_columns_crawl.tsv
Lines changed: 0 additions & 2 deletions b/‎tests/dags/common/loader/test_resources/new_columns_crawl.tsv
Lines changed: 0 additions & 2 deletions
diff --git a/‎tests/dags/common/loader/test_resources/new_columns_papis.tsv
Lines changed: 0 additions & 2 deletions b/‎tests/dags/common/loader/test_resources/new_columns_papis.tsv
Lines changed: 0 additions & 2 deletions
diff --git a/‎tests/dags/common/loader/test_resources/old_columns_crawl.tsv
Lines changed: 0 additions & 2 deletions b/‎tests/dags/common/loader/test_resources/old_columns_crawl.tsv
Lines changed: 0 additions & 2 deletions
@@ -14,23 +14,13 @@ The DAGs are shown in two forms:
 
 The following are DAGs grouped by their primary tag:
 
- 1. [Commoncrawl](#commoncrawl)
  1. [Data Refresh](#data_refresh)
  1. [Database](#database)
  1. [Maintenance](#maintenance)
  1. [Oauth](#oauth)
  1. [Provider](#provider)
  1. [Provider Reingestion](#provider-reingestion)
 
-## Commoncrawl
-
-| DAG ID | Schedule Interval |
-| --- | --- |
-| `commoncrawl_etl_workflow` | `0 0 * * 1` |
-| `sync_commoncrawl_workflow` | `0 16 15 * *` |
-
-
-
 ## Data Refresh
 
 | DAG ID | Schedule Interval |
 
@@ -10,12 +10,25 @@ This repository contains the methods used to identify over 1.4 billion Creative
 Commons licensed works. The challenge is that these works are dispersed
 throughout the web and identifying them requires a combination of techniques.
 
-Two approaches are currently in use:
+Currently, we only pull data from APIs which serve Creative Commons licensed media.
+In the past, we have also used web crawl data as a source.
 
-1. Web crawl data
-2. Application Programming Interfaces (API Data)
+## API Data
+
+[Apache Airflow](https://airflow.apache.org/) is used to manage the workflow for
+various API ETL jobs which pull and process data from a number of open APIs on
+the internet.
 
-## Web Crawl Data
+### API Workflows
+
+To view more information about all the available workflows (DAGs) within the project,
+see [DAGs.md](DAGs.md).
+
+See each provider API script's notes in their respective [handbook][ov-handbook] entry.
+
+[ov-handbook]: https://make.wordpress.org/openverse/handbook/openverse-handbook/
+
+## Web Crawl Data (retired)
 
 The Common Crawl Foundation provides an open repository of petabyte-scale web
 crawl data. A new dataset is published at the end of each month comprising over
@@ -31,10 +44,10 @@ The data is available in three file formats:
 For more information about these formats, please see the
 [Common Crawl documentation][ccrawl_doc].
 
-Openverse Catalog uses AWS Data Pipeline service to automatically create an Amazon EMR
-cluster of 100 c4.8xlarge instances that will parse the WAT archives to identify
+Openverse Catalog used AWS Data Pipeline service to automatically create an Amazon EMR
+cluster of 100 c4.8xlarge instances that parsed the WAT archives to identify
 all domains that link to creativecommons.org. Due to the volume of data, Apache
-Spark is used to streamline the processing. The output of this methodology is a
+Spark was also used to streamline the processing. The output of this methodology was a
 series of parquet files that contain:
 
 - the domains and its respective content path and query string (i.e. the exact
@@ -45,26 +58,13 @@ series of parquet files that contain:
 - the location of the webpage in the WARC file so that the page contents can be
   found.
 
-The steps above are performed in [`ExtractCCLinks.py`][ex_cc_links].
+The steps above were performed in [`ExtractCCLinks.py`][ex_cc_links].
+
+This method was retired in 2021.
 
 [ccrawl_doc]: https://commoncrawl.org/the-data/get-started/
 [ex_cc_links]: archive/ExtractCCLinks.py
 
-## API Data
-
-[Apache Airflow](https://airflow.apache.org/) is used to manage the workflow for
-various API ETL jobs which pull and process data from a number of open APIs on
-the internet.
-
-### API Workflows
-
-To view more information about all the available workflows (DAGs) within the project,
-see [DAGs.md](DAGs.md).
-
-See each provider API script's notes in their respective [handbook][ov-handbook] entry.
-
-[ov-handbook]: https://make.wordpress.org/openverse/handbook/openverse-handbook/
-
 ## Development setup for Airflow and API puller scripts
 
 There are a number of scripts in the directory
@@ -224,12 +224,13 @@ openverse-catalog
 ├── openverse_catalog/                      # Primary code directory
 │   ├── dags/                               # DAGs & DAG support code
 │   │   ├── common/                         #   - Shared modules used across DAGs
-│   │   ├── commoncrawl/                    #   - DAGs & scripts for commoncrawl parsing
+│   │   ├── data_refresh/                   #   - DAGs & code related to the data refresh process
 │   │   ├── database/                       #   - DAGs related to database actions (matview refresh, cleaning, etc.)
 │   │   ├── maintenance/                    #   - DAGs related to airflow/infrastructure maintenance
 │   │   ├── oauth2/                         #   - DAGs & code for Oauth2 key management
 │   │   ├── providers/                      #   - DAGs & code for provider ingestion
 │   │   │   ├── provider_api_scripts/       #       - API access code specific to providers
+│   │   │   ├── provider_csv_load_scripts/  #       - Schema initialization SQL definitions for SQL-based providers
 │   │   │   └── *.py                        #       - DAG definition files for providers
 │   │   └── retired/                        #   - DAGs & code that is no longer needed but might be a useful guide for the future
 │   └── templates/                          # Templates for generating new provider code
 
@@ -27,7 +27,7 @@ services:
       MINIO_ROOT_USER: ${AWS_ACCESS_KEY}
       MINIO_ROOT_PASSWORD: ${AWS_SECRET_KEY}
       # Comma separated list of buckets to create on startup
-      BUCKETS_TO_CREATE: ${OPENVERSE_BUCKET},openverse-airflow-logs,commonsmapper-v2,commonsmapper
+      BUCKETS_TO_CREATE: ${OPENVERSE_BUCKET},openverse-airflow-logs
     # Create empty buckets on every container startup
     # Note: $0 is included in the exec because "/bin/bash -c" swallows the first
     # argument, so it must be re-added at the beginning of the exec call
 
@@ -100,9 +100,6 @@ AWS_ACCESS_KEY=test_key
 AWS_SECRET_KEY=test_secret
 # General bucket used for TSV->DB ingestion and logging
 OPENVERSE_BUCKET=openverse-storage
-# Used only for commoncrawl parsing
-S3_BUCKET=not_set
-COMMONCRAWL_BUCKET=not_set
 # Seconds to wait before poking for availability of the data refresh pool when running a data_refresh
 # DAG. Used to shorten the time for testing purposes.
 DATA_REFRESH_POKE_INTERVAL=5
 
@@ -1,5 +1,4 @@
 # Ignore all non-DAG files
 common/
-commoncrawl/commoncrawl_scripts
 providers/provider_api_scripts
 retired