-
Notifications
You must be signed in to change notification settings - Fork 126
feat: add ua-synonyms support in robots.json to reduce duplication (#144) #151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add ua-synonyms support in robots.json to reduce duplication (#144) #151
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate you taking on this feature, but please could you reduce the scope of this PR so that it no longer deletes existing comments or diagnostics. Quite apart from these changes being irrelevant to the feature, they make reviewing the change much harder than it should be. Thanks.
Thanks for the feedback, Glyn. I’ve reduced the scope of this PR to only include what’s necessary for "ua-synonyms" support:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the PR is still littered with changes that are out of scope. This makes it difficult to focus on the changes that are strictly relevant to the PR. I pointed out some of these, but didn't point them all out - you'll get the picture.
Perhaps it would be worth starting a new branch and copying across only the code that is required.
If you think some of the other changes are warranted, they should be the subject of other PR(s).
# "Search Engine Crawlers", | ||
# "SEO Crawlers", | ||
# "Uncategorized", | ||
"Undocumented AI Agents", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is out of scope for the PR.
category = section.find("h2").get_text() | ||
if category not in to_include: | ||
continue | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is out of scope for the PR.
if field not in existing_content[name]: | ||
return value | ||
# Unclear value | ||
# Replace unclear value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is out of scope for the PR.
code/robots.py
Outdated
and value not in default_values | ||
): | ||
return value | ||
# Existing value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is out of scope for the PR.
"operator": consolidate("operator", operator), | ||
"respect": consolidate("respect", default_value), | ||
"function": consolidate("function", f"{category}"), | ||
"function": consolidate("function", category), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems out of scope for the PR.
|
||
|
||
def escape_md(s): | ||
"""Escape markdown special characters in bot names.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is out of scope for the PR.
|
||
def json_to_table(robots_json): | ||
"""Compose a markdown table with the information in robots.json""" | ||
"""Compose a markdown table with the information in robots.json.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is out of scope for the PR.
table += ( | ||
f"| {escape_md(name)} | {robot['operator']} | {robot['respect']} | " | ||
f"{robot['function']} | {robot['frequency']} | {robot['description']} |\n" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is out of scope for the PR.
formatted = "|".join(map(re.escape, lst)) | ||
return f"({formatted})" | ||
"""Convert a list of user agents into a regex pattern.""" | ||
return f"({'|'.join(map(re.escape, lst))})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is out of scope for the PR.
code/robots.py
Outdated
def json_to_htaccess(robot_json): | ||
# Creates a .htaccess filter file. It uses a regular expression to filter out | ||
# User agents that contain any of the blocked values. | ||
htaccess = "RewriteEngine On\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is out of scope for the PR.
This PR implements support for
ua-synonyms
inrobots.json
to address #144.What's Included
"ua-synonyms"
array to represent alternate spellings of user agentsrobots.txt
,.htaccess
,nginx
,haproxy
, andCaddyfile
TestUASynonymsSupport
intests.py
Why It Matters
This change eliminates the need to duplicate entire entries for alternate spellings of bots, reducing maintenance overhead while preserving full coverage across output files.