Skip to content

Look at replacing most of crawler with an external crawling package #5

@justinccdev

Description

@justinccdev

This is a big one, but it's possible that most of this crawler should be replaced with Apache Nutch or similar. I originally hacked this out as a proof-of-concept but as usual, it grew a bit from there. However, now meeting scalability issues (parallel crawling, possibly on multiple machines, crawling to a large database, etc.) that we need to take a serious use at a well-established alternative like Nutch.

Some questions

  • Is Nutch suitable? If so, 1.x or 2.x?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions