Skip to content

[web-crawler] robots.txt Politness #6

@CSenshi

Description

@CSenshi

Summary

Implement a feature in the web crawler that automatically discovers, fetches, parses, and enforces the rules specified in a website’s robots.txt file before crawling any URLs from that domain

This includes respecting Disallow, Allow, and Crawl-delay directives, and ensuring that the crawler does not access or queue URLs that are forbidden by the site's robots.txt policy. The crawler should cache robots.txt files

Affected Area(s)

Apps:

  • Url Shortener (apps/url-shortener)
  • Web Crawler (apps/web-crawler)

Libraries:

  • Shared (libs/shared)

Other:

  • Other (please specify):

Motivation

Respecting robots.txt prevents overloading servers and avoids crawling restricted areas, aligning with industry best practices and ethical standards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions