Skip to content

feat: Add sitemap.xml support for efficient site discovery #19

@hayatosc

Description

@hayatosc

I'm loving using this MCP. But I sometimes feel it couldn't get all of docs from sites.

Problem

Currently, this discovers pages by extracting links from HTML <a> tags during the crawling process. This approach works well but can be inefficient for large sites and may miss pages that aren't linked from other pages.

Some sites (like https://nextjs.org) using sitemap.xml so I think using this can be more efficient site crawling.

Proposed Enhancement

  1. Automatically detect sitemap.xml at common locations ( /sitemap.xml , or it referenced in /robots.txt )
  2. Add configuration option - allow users to enable/disable sitemap usage via CLI flag --sitemap=/sitemap.xml
  3. Parse XML structure to extract all URLs listed in the sitemap

If you like this proposal, I will work for this. What's your opinion?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions