Skip to content

Link crawling might be necessary after all #18

@justinccdev

Description

@justinccdev

Originally, I hoped we could require sites to link all marked up pages directly to their sitemap.xml, as done by Biosamples. This may be another position that needs revision, though I wouldn't count it out just yet. The alternative is to also crawl via webpage links, though my expectation is that this will result in slower performance (I could be wrong, might not be that significant).

PDBe is our example for this, as another large website that's not dissimilar to Biosamples. PDBe do not appear to link pages via their sitemap. In fact, even with link crawling there's no obvious way to actually reach all their data, as it's behind a search interface. @ricardoaat is going to investigate this and see if there is a way of crawling that site. If not, the best case is that they do have their sitemap.xml link to all their entries (though this may just push us the problem back until we encounter a site that will not do this or is marked up but has little technical capacity to respond to requests). Another case is that PDBe do start providing links to all their entries but not through their sitemap.xml

What we want to avoid, if at all possible, is having custom code to crawl certain sites (e.g. by entering *.* in the search form at PDBe). This will not scale when we try to crawl many different sites and is very sensitive to changes in the target site.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions