-
Notifications
You must be signed in to change notification settings - Fork 0
Transition to an established crawler package
It's been fun hacking together a crawler, but now we're being faced with scalability challenges in doing things like crawling all of Biosamples, and even that is relatively small in the grand scheme of things.
So rather than write more custom code (though might still maintain this for a while to meet upcoming hackathons), it's time to look seriously at transitioning to an established crawler package. Very desirable haves are:
- Open-source with an osf-approved license
- Recent non-trivial commits by more than one person
- Reasonably active mailing list
- Recent releases
- Ideally, should have an easy-ish way to extract JSON-LD inserted into websites that wasn't in the original html, using Selenium or similar.
- Not written in Perl
- Preference towards Python but other languages fine
Here are the candidates:
| project | language | tried? | comments |
|---|---|---|---|
| frontera | Python 2/3 | n | By scrapinghub, who also do scrapy and can optionally use scrapy. This is a crawl frontier, rather than a full crawling solution. Scraping hub has a json-ld extraction library though this is not obviously related to frontera. Somewhat active mailing list, parallel processes |
| heritrix | ? | n | The internet archive's crawler. Last release was 3.2.0 in 2014 |
| norconex | Java | n | Recent releases, single committer, license unclear |
| nutch | Java | y | Well-established, Apache, 1.x branch apparently more active than 2.x, recent releases, recent commits by more than one person, incomplete and sometimes inaccurate document with todo placeholders |
| scrapy | Python 2/3 | n | Backed by scrapinghub. No easy support of Selenium?. Single process |
| stormcrawler | Java | y | Recent releases, fairly active mailing list, vast majority of commits by one person, has a json-ld extraction filter. Tutorial leaves something to be desired. |
Tentatively, I'm finding Scrapy/Frontera to be the best of the bunch. The doc for Scrapy seems very good (Frontera slightly less so but still okay), it's written in Python so can reused parts of the existing crawler as necessary, is a relatively new project so little cruft by the looks of it, and its company-backed rather than by a single individual, which I think should speak well to its longevity and community.
On the downsides, it's not using very tested scalability projects like Apache Cassandra, Storm. But right now, I think these are upweighed by the upsides above.
If we get some funding, it may even be convenient to run scrapers in the scrapinghub cloud.
An imminent step is to start a new repository for exploring re-implementing this crawler using Scrapy/Frontera.