-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Implement a feature in the web crawler that automatically discovers, fetches, parses, and enforces the rules specified in a website’s robots.txt file before crawling any URLs from that domain
This includes respecting Disallow, Allow, and Crawl-delay directives, and ensuring that the crawler does not access or queue URLs that are forbidden by the site's robots.txt policy. The crawler should cache robots.txt files
Affected Area(s)
Apps:
- Url Shortener (apps/url-shortener)
- Web Crawler (apps/web-crawler)
Libraries:
- Shared (libs/shared)
Other:
- Other (please specify):
Motivation
Respecting robots.txt prevents overloading servers and avoids crawling restricted areas, aligning with industry best practices and ethical standards.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request