Skip to content

Explore tooling for monitoring XML sitemaps #181

@Mr0grog

Description

@Mr0grog

This is related to #173. We’d like to have ways to keep an eye new and removed pages more broadly/at scale. In #173, we’re proposing keep a running list of linked URLs from all the pages we monitor.

Another approach would be to regularly query the sitemap.xml for .gov sites that have it. Not all sites have one, and its location can also be inconsistent, but usually it’s at https://<domain>/sitemap.xml. It’s important to keep in mind some caveats here:

  • Not all sites have an XML sitemap.

  • The XML sitemaps do not always list all pages.

  • When a page’s URL changes (i.e. it’s moved or renamed; we frequently see page URLs change when the title changes), that will look like a page being added and another being removed when using this approach. Can lead to a lot of false positives.

    (One mitigation here might be to send an HTTP request for any removed pages and see if they redirect to any new pages, but the scale of this operation could potentially get big and slow.)

Sitemaps can also be broken down recursively into sub-sitemaps, so this tool would need to make sure it keeps track of all that.

My main clever idea here is that we could scrape sitemaps and commit them to a git repo, so the repo history would mirror the sitemap history. Maybe a nice way to browse changes. We should probably commit both raw copies of all the sitemaps and sub-site-maps alongside a single file listing all URLs found (this is potentially huge! adding all epa.gov’s sitemaps together gets you ~70k pages. But useful for analysis).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions