A simple and extensible web crawler written in Go.
This project is a basic web crawler designed to fetch, parse, and archive web pages into a MongoDB database. It demonstrates core crawling techniques such as queue-based URL management, polite crawling, parallel fetching, HTML parsing, and duplicate avoidance.
- Fetches and parses HTML from web pages
- Extracts page titles, main body content, and hyperlinks
- Enqueues new links for further crawling (breadth-first)
- Polite crawling with delays between requests
- Prevents duplicate crawling using hashed URL tracking
- Stores page data (URL, title, content) in MongoDB
- Modular structure (crawler, queue, db, main)
- Go 1.24+
- A running MongoDB instance (local or remote)
git
for cloning the repository
-
Clone the repository:
git clone https://github.com/sahitya-chandra/web-crawler.git cd web-crawler
-
Install dependencies:
go mod download
-
Set up environment variables:
- Create a
.env
file in the root directory:MONGODB_URI=mongodb://localhost:27017
- Create a
-
Run the crawler:
go run main.go
By default, the crawler starts at
https://example.com/
and archives up to 500 pages (can be changed inmain.go
). -
Database Output:
- Crawled web pages are stored in the
crawlerArchive.webpages
collection in MongoDB. - Each document contains:
url
: The crawled URLtitle
: Page titlecontent
: Main body content (first 500 words)
- Crawled web pages are stored in the
main.go
— Entry point; orchestrates queueing, crawling, and database storagecrawler/
— Fetches and parses HTML, extracts links and contentqueue/
— Thread-safe queue implementation for URLsdb/
— MongoDB connection and basic storage helpers
Crawled: https://example.com/, Title: Example Domain
- Adjust the starting URL, maximum pages, or crawling logic in
main.go
as needed. - MongoDB URI and other secrets are managed via the
.env
file.
This project is for educational/demo purposes. No license specified.
Pull requests and suggestions are welcome!
Results may be incomplete. See the GitHub code search results for more.