Skip to content

Instructions for Lab Members Performing Crawls

Matt May edited this page May 9, 2025 · 27 revisions

This page provides guidance for running and saving data from a crawl of our complete 11,708-site dataset. Performing a full crawl involves three main stages:

  • Crawling crawl-set-pt1.csv through crawl-set-pt8.csv (our crawl set divided into 8 batches)
  • Creating and crawling redo-sites.csv (which you'll generate based on the initial crawl results)
  • Parsing the crawled data and saving it to Google Drive

Below, we outline the steps involved in each stage.

Crawling the First 8 Batches

  1. Before starting the crawl, ensure you've set up Docker and cloned the crawler repository.

    • If you haven't already, follow steps 1–4 in the README to install and initialize Docker and set up the gpc-web-crawler repository:

      • Install Docker Desktop.
      • Authenticate Docker.
      • Clone the repository.
  2. In your Terminal, navigate to the gpc-web-crawler directory.

  3. Remove previous crawl outputs if present:

    rm -rf crawl_results
  4. Run the following command to verify that the Docker compose stack (gpc-web-crawler) isn't already running:

    make check-if-up
    • If the command prints true (the stack is running), run make stop to shut down the compose stack.
    • If it prints false (the stack is not running), proceed to the next step.
  5. For each batch number n (where n is from 1 to 8), repeat these steps:

    1. Run:
      make start-debug
    2. When prompted for a number between 1 and 8, enter your chosen batch number (n).
    3. Once the crawl is complete, shut down the compose stack:
      make stop
  6. With all 8 baches completed, generate the merged well-known-data.csv file by running the following command:

./merge_well_known_data.sh
  1. Rename crawl_results to Crawl_Data_Mon_Year where Mon and Year are the month and year, respectively, of the crawl.
  2. Upload the renamed crawl_results directory to the Web Crawler folder in Google drive.

Identifying and Crawling Redo Sites

After completing the initial crawl of the 8 batches, the next stage involves identifying and crawling the "redo sites." Redo sites are those that failed during the initial crawl due to issues with their subdomains. These sites will be recrawled without their subdomains.

To complete this stage:

  1. Open the following Colab notebook.

  2. Update the variable path1 in the notebook to reference your most recent crawl dataset.

  3. Execute the notebook to produce the following files:

    • redo-sites.csv: Lists sites requiring a second crawl (with subdomains removed).
    • redo-original.csv: Stores the original URLs (including subdomains) for reference.
  4. Upload both redo-sites.csv and redo-original.csv into a folder named similarly to Crawl_Data_Mon_Year (corresponding to the month and year of your crawl) within the designated Google Drive directory.

  5. In your Terminal, navigate to the gpc-web-crawler directory.

  6. Run the following command to verify that the Docker compose stack (gpc-web-crawler) isn't already running:

    make check-if-up
    • If the command prints true (the stack is running), run make stop to shut down the compose stack.
    • If it prints false (the stack is not running), proceed to the next step.
  7. Update the contents of selenium-optmeowt-crawler/crawl-sets/sites.csv with the contents of redo-sites.csv

  8. Run a custom batch on the list of redo-sites:

    1. Run:
      make custom
    2. Once the crawl is complete, shut down the compose stack:
      make stop
  9. In the newly created crawl_results folder, find the folder prefixed CUSTOMCRAWL and rename it to redo.

  10. Upload the newly renamed folder redo to Crawl_Data_Mon_Year.

Parsing/analyzing crawl data:

After the full crawl is done and the data is saved in the correct format, parse the data using this colab. The parsed data will appear in this google sheet. Graphs for that month can be created by running this colab. Graphs comparing data from multiple crawls can be created using this colab. Figures are automatically saved to this folder. This colab serves as a library for the other colabs.

Creating a new release

After a full crawl, we also want to publish a new image to the container repository and create a new release for the crawl. To do so, see the following steps:

  1. If you haven't already, create a Personal Access Token (PAT) with write:packages permission here: https://github.com/settings/tokens/new?type=classic. Make sure to save the token when you create it, you can only view it once.

  2. Open a new terminal.

  3. In the gpc-web-crawler root directory, run the following command, replacing YOUR_PAT with the access token from step 1 and YOUR_GITHUB_USERNAME with your Github username:

    echo YOUR_PAT | docker login ghcr.io -u YOUR_GITHUB_USERNAME --password-stdin
    
  4. Run the following commands:

    docker compose build
    docker compose push
    

    This will push the images to the Github container registry with a tag value of "latest".

  5. In your browser, navigate to https://github.com/orgs/privacy-tech-lab/packages.

  6. For every package, click the three buttons circled in the following image: Screenshot 2025-04-17 184758

  7. Copy the value of sha256 and save it somewhere.

  8. Write the changelog of the release normally. At the end, include the following lines:

    To pull the exact image versions used in this release:
    
    docker pull ghcr.io/privacy-tech-lab/crawl-driver@sha256:<SHA256 PLACEHOLDER>
    docker pull ghcr.io/privacy-tech-lab/well-known-crawl@sha256:<SHA256 PLACEHOLDER>
    docker pull ghcr.io/privacy-tech-lab/rest-api@sha256:<SHA256 PLACEHOLDER>
    docker pull ghcr.io/privacy-tech-lab/mariadb-custom@sha256:<SHA256 PLACEHOLDER>
    

    Make sure to replace the SHA256 placeholder with the value you found in step 7 for the specific image.

  9. Publish the release.

How to see the crawler browser

  1. Delete lines 115 and 116 in compose.yaml. These lines start the crawl_browser with VNC mode off, effectively mirroring a headless crawler.
  2. Start the crawler.
  3. In your browser, navigate to the Selenium Grid Hub located at localhost:4444. You should see a page that looks like this: image
  4. Hit sessions on the left of the screen.
  5. On the single running session, hit the camera all the way on the left.
  6. When prompted for a password, enter secret.

Accuracy Check Protocol

High-level accuracy check system:

We will be conducting accuracy check once every few months alternating between the three different locations (Colorado, Connecticut & California). Since we are crawling three different locations using the same crawler and with the same methodology, one location chosen for an accuracy check would suffice as our general goal is to confirm whether our crawler is working as expected.

Additional Note: While our goal is to confirm the validity and consistency of the data acquired from both the crawler and manual check at the same time, should we find an error that could fall under any of the category here, we should note down the error statement and include it in our codebase so that it will be flagged as an error appropriately in the following automated crawl.

Random Selection Sample of Sites

We choose random sample of sites from our batches using Google Apps Script. The script focuses on relevant columns of interest (uspapi, usp cookies, OptanonConsent, gpp, usps, Well-known, gpp_version, etc...) and checks if a row has valid non-null data in at least one of these columns. For each column of interest, 5 random rows with valid data are selected and added to an output sheet titled "Accuracy Review" for manual ground truth analysis. In order to generate a sample of random sites, follow the steps below:

  1. Navigate to the relevant Google Sheet in the Overall Crawl Results Folder of the Google Drive for the desired crawl location.
  2. Select the sheet tab corresponding to the month you want to evaluate.
  3. Run the attached Google App Script function called selectRandomSites for that months tab to generate a random sample of sites.
    • To run the App Script, navigate to Extensions/AppScript in the Tool Bar at the top of the sheet and then select run.
  4. After running the script, it will output a test list of sampled sites to the Web Crawler Accuracy Overtime sheet in the Google Drive.

Setting up VNC for accuracy check:

  1. To ensure the possibility of a manual check while the crawl is running, we need a VNC interface that would be smooth. While Selenium Grid is an accessible and friendly VNC for the viewing of the crawler, We found TigerVNC to offer a better option for manual verification involving accessing the cookie storage and putting commands on the console.
  2. Download the self-contained binaries for TigerVNC appropriate to your local device.
  3. Update the compose.yaml file in the codebase to add the port 5900:5900 for the crawl_browser and delete the line environment: - SE_START_VNC=false. These changes ensure that the VNC would be accessible and visible; it will look like this in the end:
Screenshot 2025-03-21 at 6 12 37 PM

Accuracy Check Methodology

  1. Increase Timeout in line 63 of webcrawler.js to 120000 or 180000 to allow time for manually writing the commands in the console before and after gpc signal is detected
  2. Load set of custom sites to crawl to sites.csv. Tip: do the accuracy check in batches of 5 sites given the relatively slow interface of the VNC
  3. Start the crawler following the steps outlined in the ReadMe
  4. Open TigerVNC app on your local device and follow the steps to see the crawler browser. Please note that instead of localhost:4444, the VNC server inputted in TigerVNC should be localhost:5900
  5. Determine the value of the US Privacy String Value by (1) checking the site's cookies via the Network Monitor and (2) calling the USPAPI from the Web Console
  6. Determine the GPP String value by calling the GPP CMPAPI from the Web Console
  7. Determine OneTrust’s OptanonConsent cookie value by checking the site’s cookies via the Network Monitor
  8. After gpc signal is detected, repeat steps 4-7 for values after gpc
  9. Determine .Well-known value by appending /.well-known/gpc.json to the URL path

Google Drive directories:

For detailed information about the folders and files, please check out the Google Drive ReadMe

General info about data analysis in the colabs:

GPP String decoding:

The GPP String encoding/decoding process is described by the IAB here. The IAB has a website to decode and encode GPP strings. This is helpful for spot checking and is the quickest way to encode/decode single GPP strings. They also have a JS library to encode and decode GPP strings on websites. Because we cannot directly use this library to decode GPP strings in Python, we converted the JS library to Python and use that for decoding (Python library found here). The Python library will need to be updated when the IAB adds more sections to the GPP string. More information on updating the Python library and why we use it can be found in issue 89. GPP strings are automatically decoded using the Python library in the colabs.