Skip to content

Conversation

brndnprog
Copy link

This PR implements support for ua-synonyms in robots.json to address #144.

What's Included

  • Adds optional "ua-synonyms" array to represent alternate spellings of user agents
  • Updates all output formats to account for synonyms:
    • robots.txt, .htaccess, nginx, haproxy, and Caddyfile
  • Normalizes user-agent expansion across formats
  • Adds test coverage via TestUASynonymsSupport in tests.py

Why It Matters

This change eliminates the need to duplicate entire entries for alternate spellings of bots, reducing maintenance overhead while preserving full coverage across output files.

Copy link
Contributor

@glyn glyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate you taking on this feature, but please could you reduce the scope of this PR so that it no longer deletes existing comments or diagnostics. Quite apart from these changes being irrelevant to the feature, they make reviewing the change much harder than it should be. Thanks.

@brndnprog
Copy link
Author

Thanks for the feedback, Glyn. I’ve reduced the scope of this PR to only include what’s necessary for "ua-synonyms" support:

  • No formatting, comment, or diagnostic deletions
  • Changes are limited to json_to_* functions for robots.txt, .htaccess, nginx, caddy, and haproxy
  • Added 2 targeted tests to confirm "ua-synonyms" are processed correctly
    Please let me know if anything else needs adjusting.

Copy link
Contributor

@glyn glyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the PR is still littered with changes that are out of scope. This makes it difficult to focus on the changes that are strictly relevant to the PR. I pointed out some of these, but didn't point them all out - you'll get the picture.

Perhaps it would be worth starting a new branch and copying across only the code that is required.

If you think some of the other changes are warranted, they should be the subject of other PR(s).

# "Search Engine Crawlers",
# "SEO Crawlers",
# "Uncategorized",
"Undocumented AI Agents",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is out of scope for the PR.

category = section.find("h2").get_text()
if category not in to_include:
continue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is out of scope for the PR.

if field not in existing_content[name]:
return value
# Unclear value
# Replace unclear value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is out of scope for the PR.

code/robots.py Outdated
and value not in default_values
):
return value
# Existing value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is out of scope for the PR.

"operator": consolidate("operator", operator),
"respect": consolidate("respect", default_value),
"function": consolidate("function", f"{category}"),
"function": consolidate("function", category),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems out of scope for the PR.



def escape_md(s):
"""Escape markdown special characters in bot names."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is out of scope for the PR.


def json_to_table(robots_json):
"""Compose a markdown table with the information in robots.json"""
"""Compose a markdown table with the information in robots.json."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is out of scope for the PR.

table += (
f"| {escape_md(name)} | {robot['operator']} | {robot['respect']} | "
f"{robot['function']} | {robot['frequency']} | {robot['description']} |\n"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is out of scope for the PR.

formatted = "|".join(map(re.escape, lst))
return f"({formatted})"
"""Convert a list of user agents into a regex pattern."""
return f"({'|'.join(map(re.escape, lst))})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is out of scope for the PR.

code/robots.py Outdated
def json_to_htaccess(robot_json):
# Creates a .htaccess filter file. It uses a regular expression to filter out
# User agents that contain any of the blocked values.
htaccess = "RewriteEngine On\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is out of scope for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants