Transition to an established crawler package

It's been fun hacking together a crawler, but now we're being faced with scalability challenges in doing things like crawling all of Biosamples, and even that is relatively small in the grand scheme of things.

So rather than write more custom code (though might still maintain this for a while to meet upcoming hackathons), it's time to look seriously at transitioning to an established crawler package. Very desirable haves are:

Open-source with an osf-approved license
Recent non-trivial commits by more than one person
Reasonably active mailing list
Recent releases
Ideally, should have an easy-ish way to extract JSON-LD inserted into websites that wasn't in the original html, using Selenium or similar.
Not written in Perl
Preference towards Python but other languages fine

Here are the candidates:

project	language	tried?	comments
frontera	Python 2/3	n	By scrapinghub, who also do scrapy and can optionally use scrapy. This is a crawl frontier, rather than a full crawling solution. Scraping hub has a json-ld extraction library though this is not obviously related to frontera. Somewhat active mailing list, parallel processes
heritrix	?	n	The internet archive's crawler. Last release was 3.2.0 in 2014
norconex	Java	n	Recent releases, single committer, license unclear
nutch	Java	y	Well-established, Apache, 1.x branch apparently more active than 2.x, recent releases, recent commits by more than one person, incomplete and sometimes inaccurate document with todo placeholders
scrapy	Python 2/3	n	Backed by scrapinghub. No easy support of Selenium?. Single process
stormcrawler	Java	y	Recent releases, fairly active mailing list, vast majority of commits by one person, has a json-ld extraction filter. Tutorial leaves something to be desired.

Conclusion

Tentatively, I'm finding Scrapy/Frontera to be the best of the bunch. The doc for Scrapy seems very good (Frontera slightly less so but still okay), it's written in Python so can reused parts of the existing crawler as necessary, is a relatively new project so little cruft by the looks of it, and its company-backed rather than by a single individual, which I think should speak well to its longevity and community.

On the downsides, it's not using very tested scalability projects like Apache Cassandra, Storm. But right now, I think these are upweighed by the upsides above.

If we get some funding, it may even be convenient to run scrapers in the scrapinghub cloud.

An imminent step is to start a new repository for exploring re-implementing this crawler using Scrapy/Frontera.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transition to an established crawler package

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally