Skip to content

guardian/google-search-indexing-observatory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Google Search Indexing Observatory

How long does it take for content published by news organisations to be available in Google search?

This broadens Ophan's Google Search Index Checker to check for content published by many news organisations, not just the Guardian. We're trying to work out if the intermittent multi-hour delays we've seen for some Guardian articles to be available in Google Search are typical for other news organisations too, or if there's actually something particular to the Guardian that needs to be fixed.

It's an 'observatory' in the same way that the EFF SSL Observatory is - creating and collating observations of distant sites and processes that are visible to us but beyond our control.

Steps performed by the Observatory

  1. Fetch the Sitemap XML for a news site
  2. Hit the Discovery Engine API (service is named Google Vertex Agent Builder) to check if the content listed is available in Google search. API Consumption & Cost 💰💰💰 for this can be monitored in the Google Cloud console.
  3. Stores whether each article is available (or not) in an AWS DynamoDb table.

Vertex Agent Builder

When setting up search functionality in the GCP Agent Builder, we need to create both an app and a dataStore in the Agent Builder for each website we want to search (in this case BBC, DailyMail, and NYT). While GCP's interface suggests this process creates a new search engine with its own database, this isn't actually what happens. Instead, it creates a filtered view of Google Search results, limited to the specific website URL we specify. Note: Even though our code doesn't directly reference the App ID, you must still create both the app and the dataStore for each website - creating just the dataStore isn't sufficient and leads to API errors.

Running the Checker locally

Pre-requisites

These mostly match the pre-requisites for running Ophan locally - specifically Java 11 & sbt, but also especially the requirement to have ophan AWS credentials from Janus.

Running the Lambda locally

Execute this on the command line:

$ sbt run

About

Tracking how long it takes for content published by news organisations to be available in Google search

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 13