Skip to content

Transition to an established crawler package

Justin Clark-Casey edited this page May 17, 2018 · 1 revision

It's been fun hacking together a crawler, but now we're being faced with scalability challenges in doing things like crawling all of Biosamples, and even that is relatively small in the grand scheme of things.

So rather than write more custom code (though might still maintain this for a while to meet upcoming hackathons), it's time to look seriously at transitioning to an established crawler package. Very desirable haves are:

  • Open-source with an osf-approved license
  • Recent non-trivial commits by more than one person
  • Reasonably active mailing list
  • Recent releases
  • Ideally, should have an easy-ish way to extract JSON-LD inserted into websites that wasn't in the original html, using Selenium or similar.
  • Not written in Perl
  • Preference towards Python but other languages fine

Here are the candidates:

project language tried? comments
frontera Python 2/3 n By scrapinghub, who also do scrapy and can optionally use scrapy. This is a crawl frontier, rather than a full crawling solution. Scraping hub has a json-ld extraction library though this is not obviously related to frontera. Somewhat active mailing list, parallel processes
heritrix ? n The internet archive's crawler. Last release was 3.2.0 in 2014
norconex Java n Recent releases, single committer, license unclear
nutch Java y Well-established, Apache, 1.x branch apparently more active than 2.x, recent releases, recent commits by more than one person, incomplete and sometimes inaccurate document with todo placeholders
scrapy Python 2/3 n Backed by scrapinghub. No easy support of Selenium?. Single process
stormcrawler Java y Recent releases, fairly active mailing list, vast majority of commits by one person, has a json-ld extraction filter. Tutorial leaves something to be desired.

Conclusion

Tentatively, I'm finding Scrapy/Frontera to be the best of the bunch. The doc for Scrapy seems very good (Frontera slightly less so but still okay), it's written in Python so can reused parts of the existing crawler as necessary, is a relatively new project so little cruft by the looks of it, and its company-backed rather than by a single individual, which I think should speak well to its longevity and community.

On the downsides, it's not using very tested scalability projects like Apache Cassandra, Storm. But right now, I think these are upweighed by the upsides above.

If we get some funding, it may even be convenient to run scrapers in the scrapinghub cloud.

An imminent step is to start a new repository for exploring re-implementing this crawler using Scrapy/Frontera.

Clone this wiki locally