Google Search Indexing Observatory

How long does it take for content published by news organisations to be available in Google search?

This broadens Ophan's Google Search Index Checker to check for content published by many news organisations, not just the Guardian. We're trying to work out if the intermittent multi-hour delays we've seen for some Guardian articles to be available in Google Search are typical for other news organisations too, or if there's actually something particular to the Guardian that needs to be fixed.

It's an 'observatory' in the same way that the EFF SSL Observatory is - creating and collating observations of distant sites and processes that are visible to us but beyond our control.

Steps performed by the Observatory

Fetch the Sitemap XML for a news site
Hit the Discovery Engine API (service is named Google Vertex Agent Builder) to check if the content listed is available in Google search. API Consumption & Cost 💰💰💰 for this can be monitored in the Google Cloud console.
Stores whether each article is available (or not) in an AWS DynamoDb table.

Vertex Agent Builder

When setting up search functionality in the GCP Agent Builder, we need to create both an app and a dataStore in the Agent Builder for each website we want to search (in this case BBC, DailyMail, and NYT). While GCP's interface suggests this process creates a new search engine with its own database, this isn't actually what happens. Instead, it creates a filtered view of Google Search results, limited to the specific website URL we specify. Note: Even though our code doesn't directly reference the App ID, you must still create both the app and the dataStore for each website - creating just the dataStore isn't sufficient and leads to API errors.

Running the Checker locally

Pre-requisites

These mostly match the pre-requisites for running Ophan locally - specifically Java 11 & sbt, but also especially the requirement to have ophan AWS credentials from Janus.

Running the Lambda locally

Execute this on the command line:

$ sbt run

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
.github		.github
cdk		cdk
project		project
src		src
.gitignore		.gitignore
.tool-versions		.tool-versions
LICENSE.md		LICENSE.md
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Google Search Indexing Observatory

Steps performed by the Observatory

Vertex Agent Builder

Running the Checker locally

Pre-requisites

Running the Lambda locally

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 13

Uh oh!

Languages

License

guardian/google-search-indexing-observatory

Folders and files

Latest commit

History

Repository files navigation

Google Search Indexing Observatory

Steps performed by the Observatory

Vertex Agent Builder

Running the Checker locally

Pre-requisites

Running the Lambda locally

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 13

Uh oh!

Languages

Packages