Web-Crawler

Project for Winter of Code 3.0

IIT (ISM) Dhanbad

The crawler takes only four arguments, '-u' is the URL to be crawled, '-o' is the name of the directory in which the results will be stored and '-b' is the number of threads. It only crawls the links in the given domain. When encountered with links from other domains, they are collected and stored in another file in the same directory.

The basic idea is that all the links extracted in the first crawl are stored in a file called queue.txt. Then the links needed for each thread is extracted from this file. One these links are crawled the crawled links are stored in another file called crawled.txt and then removed from queue.txt. The result obtained in the next crawl is then added to queue.txt, provided it is not in crawled.txt. This ensures that the same links are not crawled again. All the above mentioned files are in a directory whose name you have to specify on the command line.

Aknowledgement: The idea and the script is inspired by various online sourses.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md
Web_Crawler.py		Web_Crawler.py
cli.py		cli.py
cli1.py		cli1.py
crawler.py		crawler.py
file_manage.py		file_manage.py
find_link.py		find_link.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web-Crawler

About

Uh oh!

Releases

Packages

Languages

MariaRose3/Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Web-Crawler

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages