Skip to content

Possible URL pattern matching bug #64

@shivanshuzyte

Description

@shivanshuzyte

Problem:

The logic breaks on the first matching rule, but robots.txt requires checking the most specific (longest) rule (rfc9309). The code sorts rules by priority but then ignores that sorting by breaking on the first match.

Impact will be incorrect allow/disallow decisions when multiple rules match the same URL.

Example:

Rules: Disallow: /admin and Allow: /admin/public
URL: /admin/public/page

Current behavior: Incorrectly blocked (matches /admin first)
Correct behavior: Should be allowed (longer /admin/public rule wins)

Possible fix:

def can_fetch(self, url: str) -> bool:
    """Return if the url can be fetched."""
    url = quote_path(url)
    most_specific_rule = None
    longest_match = -1

    for rule in self._rules:
        match = rule.value.match(url)
        if match:
            match_length = len(match.group(0)) if hasattr(match, 'group') else len(rule.value.pattern)
            if match_length > longest_match:
                most_specific_rule = rule
                longest_match = match_length

    if most_specific_rule:
        return most_specific_rule.field.lower() == "allow"

    return True  # Default allow if no matching rule

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions