How long does it take for content published by news organisations to be available in Google search?
This broadens Ophan's Google Search Index Checker to check for content published by many news organisations, not just the Guardian. We're trying to work out if the intermittent multi-hour delays we've seen for some Guardian articles to be available in Google Search are typical for other news organisations too, or if there's actually something particular to the Guardian that needs to be fixed.
It's an 'observatory' in the same way that the EFF SSL Observatory is - creating and collating observations of distant sites and processes that are visible to us but beyond our control.
- Fetch the Sitemap XML for a news site
- Hit the Discovery Engine API (service is named Google Vertex Agent Builder) to check if the content listed is available in Google search. API Consumption & Cost 💰💰💰 for this can be monitored in the Google Cloud console.
- Stores whether each article is available (or not) in an AWS DynamoDb table.
When setting up search functionality in the GCP Agent Builder, we need to create both an app and a dataStore in the Agent Builder for each website we want to search (in this case BBC, DailyMail, and NYT). While GCP's interface suggests this process creates a new search engine with its own database, this isn't actually what happens. Instead, it creates a filtered view of Google Search results, limited to the specific website URL we specify. Note: Even though our code doesn't directly reference the App ID, you must still create both the app and the dataStore for each website - creating just the dataStore isn't sufficient and leads to API errors.
These mostly match the pre-requisites for running Ophan locally -
specifically Java 11 & sbt
, but also especially the requirement to have
ophan
AWS credentials
from Janus.
Execute this on the command line:
$ sbt run