This is a big one, but it's possible that most of this crawler should be replaced with Apache Nutch or similar. I originally hacked this out as a proof-of-concept but as usual, it grew a bit from there. However, now meeting scalability issues (parallel crawling, possibly on multiple machines, crawling to a large database, etc.) that we need to take a serious use at a well-established alternative like Nutch.
Some questions
- Is Nutch suitable? If so, 1.x or 2.x?