-
-
Notifications
You must be signed in to change notification settings - Fork 3
Instructions for Lab Members Performing Crawls
This page provides guidance for running and saving data from a crawl of our complete 11,708-site dataset. Performing a full crawl involves three main stages:
- Crawling
crawl-set-pt1.csv
throughcrawl-set-pt8.csv
(our crawl set divided into 8 batches) - Creating and crawling
redo-sites.csv
(which you'll generate based on the initial crawl results) - Parsing the crawled data and saving it to Google Drive
Below, we outline the steps involved in each stage.
-
Before starting the crawl, ensure you've set up Docker and cloned the crawler repository.
-
If you haven't already, follow steps 1–4 in the README to install and initialize Docker and set up the
gpc-web-crawler
repository:- Install Docker Desktop.
- Authenticate Docker.
- Clone the repository.
-
-
In your Terminal, navigate to the
gpc-web-crawler
directory. -
Remove previous crawl outputs if present:
rm -rf crawl_results
-
Run the following command to verify that the Docker compose stack (
gpc-web-crawler
) isn't already running:make check-if-up
- If the command prints
true
(the stack is running), runmake stop
to shut down the compose stack. - If it prints
false
(the stack is not running), proceed to the next step.
- If the command prints
-
For each batch number
n
(wheren
is from 1 to 8), repeat these steps:- Run:
make start-debug
- When prompted for a number between 1 and 8, enter your chosen batch number (
n
). - Once the crawl is complete, shut down the compose stack:
make stop
- Run:
-
With all 8 baches completed, generate the merged
well-known-data.csv
file by running the following command:
./merge_well_known_data.sh
- Rename
crawl_results
toCrawl_Data_Mon_Year
where Mon and Year are the month and year, respectively, of the crawl. - Upload the renamed
crawl_results
directory to the Web Crawler folder in Google drive.
After completing the initial crawl of the 8 batches, the next stage involves identifying and crawling the "redo sites." Redo sites are those that failed during the initial crawl due to issues with their subdomains. These sites will be recrawled without their subdomains.
To complete this stage:
-
Open the following Colab notebook.
-
Update the variable
path1
in the notebook to reference your most recent crawl dataset. -
Execute the notebook to produce the following files:
-
redo-sites.csv
: Lists sites requiring a second crawl (with subdomains removed). -
redo-original.csv
: Stores the original URLs (including subdomains) for reference.
-
-
Upload both
redo-sites.csv
andredo-original.csv
into a folder named similarly toCrawl_Data_Mon_Year
(corresponding to the month and year of your crawl) within the designated Google Drive directory. -
In your Terminal, navigate to the
gpc-web-crawler
directory. -
Run the following command to verify that the Docker compose stack (
gpc-web-crawler
) isn't already running:make check-if-up
- If the command prints
true
(the stack is running), runmake stop
to shut down the compose stack. - If it prints
false
(the stack is not running), proceed to the next step.
- If the command prints
-
Update the contents of
selenium-optmeowt-crawler/crawl-sets/sites.csv
with the contents ofredo-sites.csv
-
Run a custom batch on the list of redo-sites:
- Run:
make custom
- Once the crawl is complete, shut down the compose stack:
make stop
- Run:
-
In the newly created
crawl_results
folder, find the folder prefixedCUSTOMCRAWL
and rename it toredo
. -
Upload the newly renamed folder
redo
toCrawl_Data_Mon_Year
.
After the full crawl is done and the data is saved in the correct format, parse the data using this colab. The parsed data will appear in this google sheet. Graphs for that month can be created by running this colab. Graphs comparing data from multiple crawls can be created using this colab. Figures are automatically saved to this folder. This colab serves as a library for the other colabs.
Google Drive Web_Crawler directories and files:
-
crawl-set-pt1.csv/
--crawl-set-pt8.csv/
: Files of 8 batches of crawl set. -
Crawl_Data_Month_Year_
(e.g., Crawl_Data_April_2024): Folders of the result of our past crawls. -
Crawl_Data
: A file that compiles all the Crawl Data accumulated over the series of crawls (a compiled version of theCrawl_Data_Month_Year_
). -
sites_with_GPP
: A file that collates all the sites with GPP (as of December 2023); this analysis is now reflected as statistics as a figure inProcessing_Analysis_Data
. -
Ad_Network_Analysis
: A file that has the result of the manual analysis of up to 70 ad networks' privacy policies. -
Web_Crawl_Domains_2023
: A file that has a collation of detailed information regarding sites in our crawl set (i.e., their ad networks, contact information, Tranco ranks). -
Collecting_Sites_To_Crawl
: A folder with files that explain and justify our methodology and process of the collection of sites to crawl (ReadMe and Methodology). -
similarweb
: A folder of our analysis that process the SimilarWeb data and determine what Tranco rank would have sufficient traffic to be subject to the CCPA. -
GPC_Detection_Performance
: A folder of ground truth data collection on validation sets of sites for verifyiing the USPS and GPP strings via the USPAPI value, OptanonConsent cookie, and GPP string value, each before and after sending a GPC signal. -
Processing_Analysis_Data
: A folder that has all the colabs for parsing, processing and analyzing the crawl results and the figures created from the analysis.
The GPP String encoding/decoding process is described by the IAB here. The IAB has a website to decode and encode GPP strings. This is helpful for spot checking and is the quickest way to encode/decode single GPP strings. They also have a JS library to encode and decode GPP strings on websites. Because we cannot directly use this library to decode GPP strings in Python, we converted the JS library to Python and use that for decoding (Python library found here). The Python library will need to be updated when the IAB adds more sections to the GPP string. More information on updating the Python library and why we use it can be found in issue 89. GPP strings are automatically decoded using the Python library in the colabs.