Skip to content
/ nutch Public
forked from Yongyao/nutch

A domain-specific Web crawler for planetary defense based on Apache Nutch, funded by NASA

License

Notifications You must be signed in to change notification settings

supwar1/nutch

 
 

Repository files navigation

Planetary defense (PD) Web crawler

Most open source Web crawlers (e.g. Apache Nutch) deal with focused crawling by relying on a keyword or document list composed by subject matter experts and similarity measures such as cosine similarity and Naïve Bayes classifier. This work has extended Nutch by developing a semi-supervised method of creating keyword list and considering both text content and hyperlink structure in the Planetary Defense Framework Gateway project, a NASA funded effort aimed to develop a cyberinfrastructure for scientific collaboration across different organizations. Please refer to the slides here for more detail.

Apache Nutch

For the latest information about Nutch, please visit our website at:

http://nutch.apache.org

and our wiki, at:

http://wiki.apache.org/nutch/

To get started using Nutch read Tutorial:

http://wiki.apache.org/nutch/NutchTutorial

About

A domain-specific Web crawler for planetary defense based on Apache Nutch, funded by NASA

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 96.2%
  • HTML 2.6%
  • Shell 0.8%
  • CSS 0.1%
  • Rich Text Format 0.1%
  • Dockerfile 0.1%
  • XSLT 0.1%