Skip to content

MariaRose3/Web-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web-Crawler

Project for Winter of Code 3.0

IIT (ISM) Dhanbad

The crawler takes only four arguments, '-u' is the URL to be crawled, '-o' is the name of the directory in which the results will be stored and '-b' is the number of threads. It only crawls the links in the given domain. When encountered with links from other domains, they are collected and stored in another file in the same directory.

The basic idea is that all the links extracted in the first crawl are stored in a file called queue.txt. Then the links needed for each thread is extracted from this file. One these links are crawled the crawled links are stored in another file called crawled.txt and then removed from queue.txt. The result obtained in the next crawl is then added to queue.txt, provided it is not in crawled.txt. This ensures that the same links are not crawled again. All the above mentioned files are in a directory whose name you have to specify on the command line.

Aknowledgement: The idea and the script is inspired by various online sourses.

About

Project for Winter of Code 3.0

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages