-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers
Description
Problem:
The logic breaks on the first matching rule, but robots.txt requires checking the most specific (longest) rule (rfc9309). The code sorts rules by priority but then ignores that sorting by breaking on the first match.
protego/src/protego/_ruleset.py
Line 96 in 23f56ef
break |
Impact will be incorrect allow/disallow decisions when multiple rules match the same URL.
Example:
Rules: Disallow: /admin and Allow: /admin/public
URL: /admin/public/page
Current behavior: Incorrectly blocked (matches /admin first)
Correct behavior: Should be allowed (longer /admin/public rule wins)
Possible fix:
def can_fetch(self, url: str) -> bool:
"""Return if the url can be fetched."""
url = quote_path(url)
most_specific_rule = None
longest_match = -1
for rule in self._rules:
match = rule.value.match(url)
if match:
match_length = len(match.group(0)) if hasattr(match, 'group') else len(rule.value.pattern)
if match_length > longest_match:
most_specific_rule = rule
longest_match = match_length
if most_specific_rule:
return most_specific_rule.field.lower() == "allow"
return True # Default allow if no matching rule
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers