Indexing: robots.txt blocks access to most but not all of the site

There have been some sites flagged as having some content in the index, but no home page. See https://github.com/searchmysite/searchmysite.net/issues/150 and https://github.com/searchmysite/searchmysite.net/issues/102 for why this is an issue.

There are at least two sites where the robots.txt blocks access to almost all of the site, but not all of the site, so the site isn't automatically deindexed, but the index doesn't contain any useful content:
- https://mike-burns.com/robots.txt disallows indexing, but https://keys1.mike-burns.com/keys.atom is allowed and is the 1 document in the index
-  https://lostletters.neocities.org/robots.txt allows indexing the feed at https://lostletters.neocities.org/feed.xml but disallows everything else including the home and the pages the feed points to so that https://lostletters.neocities.org/feed.xml becaomes the 1 document in the index.

Workaround is to manually identify these issues, and manually disable indexing for these sites. Need to think about whether there is a better way of handling. Not sure about ideas at this stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Indexing: robots.txt blocks access to most but not all of the site #151

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Indexing: robots.txt blocks access to most but not all of the site #151

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions