-
-
Notifications
You must be signed in to change notification settings - Fork 17
Description
This is related to #173. We’d like to have ways to keep an eye new and removed pages more broadly/at scale. In #173, we’re proposing keep a running list of linked URLs from all the pages we monitor.
Another approach would be to regularly query the sitemap.xml
for .gov sites that have it. Not all sites have one, and its location can also be inconsistent, but usually it’s at https://<domain>/sitemap.xml
. It’s important to keep in mind some caveats here:
-
Not all sites have an XML sitemap.
-
The XML sitemaps do not always list all pages.
-
When a page’s URL changes (i.e. it’s moved or renamed; we frequently see page URLs change when the title changes), that will look like a page being added and another being removed when using this approach. Can lead to a lot of false positives.
(One mitigation here might be to send an HTTP request for any removed pages and see if they redirect to any new pages, but the scale of this operation could potentially get big and slow.)
Sitemaps can also be broken down recursively into sub-sitemaps, so this tool would need to make sure it keeps track of all that.
My main clever idea here is that we could scrape sitemaps and commit them to a git repo, so the repo history would mirror the sitemap history. Maybe a nice way to browse changes. We should probably commit both raw copies of all the sitemaps and sub-site-maps alongside a single file listing all URLs found (this is potentially huge! adding all epa.gov’s sitemaps together gets you ~70k pages. But useful for analysis).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status