A high-performance, concurrent web crawler written in Go that extracts URLs, downloads content, and converts web pages to markdown format.
Important
β Pleae Star this repository if you find it useful!
GoSpider is a multi-threaded web crawler that leverages Go's concurrency primitives to crawl websites at scale. It implements a producer-consumer architecture with configurable worker pools, allowing it to process thousands of URLs simultaneously while maintaining memory efficiency and system stability.
The crawler is designed for various use cases including web archiving, content analysis, search engine development, and data mining. It handles cross-domain crawling, maintains URL deduplication, and provides real-time progress monitoring.
Note
Screenshot Below: Processing 100k urls, parsing as markdown, downloading .md file and other assets, all in under half an hour
- Concurrent Architecture: Producer-consumer pattern with configurable worker pools (1-1000+ workers)
- Performance Optimizations:
- Connection pooling with up to 500 connections per host
- 10,000 URL buffer queue for smooth operation
- Parallel file writing with 16 dedicated writers and 1MB buffers
- Directory caching to minimize filesystem operations
- Smart Crawling:
- Domain-based limiting to prevent overwhelming single hosts
- URL deduplication using in-memory hash maps
- Graceful queue management with consecutive empty checks
- Relative to absolute URL conversion
- Proxy Management:
- Random proxy rotation from configurable proxy list
- Automatic proxy testing and validation
- Fallback to direct connection on proxy failures
- Support for HTTP/HTTPS proxies
- Content Processing:
- HTML to Markdown conversion preserving structure and links
- URL extraction from both raw HTML and converted markdown
- Content-type detection and appropriate handling
- Optional image downloading with async processing
- Monitoring & Statistics:
- Real-time progress updates (URLs/second, completion rate)
- Domain coverage tracking
- Queue depth monitoring
- Success/failure rate calculation
- Resilience Features:
- 30-second timeout for slow servers
- Proper resource cleanup with defer statements
- Non-blocking operations for auxiliary tasks
- Error handling with graceful degradation
GoSpider uses a sophisticated producer-consumer architecture optimized for high throughput:
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Main Thread βββββΆβ Buffered Queue βββββΆβ Worker Pool β
β (Producer) β β (10K capacity) β β (N Consumers) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β β
β βΌ
βΌ βββββββββββββββββββ
βββββββββββββββββββ β HTTP Client β
β URL Discovery β β (Connection Pool)β
β & Deduplication β βββββββββββββββββββ
βββββββββββββββββββ β
βΌ
βββββββββββββββββββββββββββββββββββ
β Content Processing β
βββββββββββββββββββ¬ββββββββββββββββ€
β HTMLβMarkdown β URL Extractionβ
βββββββββββββββββββ΄ββββββββββββββββ
β
ββββββββββββββββββββ΄ββββββββββββββββ
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β File Writers β β Image Downloaderβ
β (16 parallel) β β (Async) β
βββββββββββββββββββ βββββββββββββββββββ
- Buffered Channel: 10,000 capacity URL queue prevents blocking
- Thread-Safe Maps: Concurrent-safe visited URL tracking and domain counting
- Smart Distribution: Main thread monitors queue state and distributes work efficiently
- Singleton Pattern: Single optimized client instance using
sync.Once
- Connection Pooling:
- MaxIdleConns: 2000
- MaxIdleConnsPerHost: 500
- MaxConnsPerHost: 500
- Proxy Support: Round-robin proxy selection with automatic fallback
- Parallel Operations: URL extraction runs concurrently with markdown conversion
- Link Resolution: Converts relative URLs to absolute using domain context
- Dual Extraction: Extracts URLs from both HTML source and markdown output
- Writer Pool: 16 dedicated goroutines for file writing
- Buffered Writing: 1MB buffers reduce system calls
- Directory Cache: Avoids repeated directory existence checks
- Fallback Mode: Synchronous writing when async queue is full
- Real-time Updates: Per-second statistics refresh
- Metrics Tracked:
- Processing rate (URLs/second)
- Queue depth
- Domain coverage
- Success/failure rates
- Time elapsed
GoSpider's concurrency model is built around Go's CSP (Communicating Sequential Processes) paradigm:
-
Main Goroutine (Producer):
- Initializes the URL queue with the seed URL
- Monitors queue state and worker completion
- Implements graceful shutdown with consecutive empty checks
-
Worker Goroutines (Consumers):
- Each worker runs in its own goroutine
- Pulls URLs from the shared channel
- Processes content independently
- Sends discovered URLs back to the queue
-
File Writer Goroutines:
- Separate pool of 16 writers
- Receives file write requests via dedicated channel
- Buffers writes to reduce I/O overhead
- URL Deduplication: Uses Go's native map with string keys for O(1) lookup
- Domain Tracking: Separate map tracks unique domains encountered
- Buffer Reuse: HTTP response bodies are properly closed to allow buffer reuse
- Goroutine Lifecycle: Workers use WaitGroup for proper cleanup
Metric | Value | Description |
---|---|---|
URL Processing Rate | 10-100/sec | Depends on network latency and content size |
Memory Usage | ~500MB | For 100,000 URLs with typical web content |
Concurrent Connections | Up to 500/host | Configurable via HTTP client settings |
File Write Throughput | 50-100 MB/s | With SSD and parallel writers |
Startup Time | ~1-2 second | Including proxy validation |
- Network Errors: Logged but don't stop crawling
- Proxy Failures: Automatic fallback to next proxy or direct connection
- File System Errors: Fallback from async to sync writing
- Parse Errors: Skipped with logging, crawl continues
- Timeout Handling: 30-second timeout prevents hanging on slow servers
- Go 1.24.4 or higher
- Git (for installation from source)
# Clone the repository
git clone https://github.com/aryanranderiya/gospider.git
cd gospider
# Install dependencies
go mod download
# Build the binary
go build -o gospider cmd/main.go
# Run directly
./gospider -url="https://example.com"
go install github.com/aryanranderiya/gospider/cmd@latest
Download the latest binary from the releases page.
-
Initialization:
- Loads configuration and validates proxies (if enabled)
- Creates HTTP client with optimized settings
- Initializes worker pool and file writers
- Seeds queue with starting URL
-
URL Processing Loop:
For each URL in queue: ββ Check if already visited ββ Check domain limits ββ Fetch content (with proxy if configured) ββ Extract URLs from HTML ββ Convert to Markdown ββ Extract URLs from Markdown ββ Queue new URLs for processing ββ Save files (if enabled)
-
Graceful Shutdown:
- Monitors queue emptiness
- Waits for workers to finish
- Closes file writers
- Displays final statistics
# Crawl a single website
./gospider -url="https://example.com"
# Crawl with custom limits
./gospider -url="https://example.com" -domains=50 -urls=500 -workers=10
# Enable verbose output
./gospider -url="https://example.com" -verbose
# Download images and save files
./gospider -url="https://example.com" -images -save
# Use proxy rotation
./gospider -url="https://example.com" -proxies -workers=20
# Large-scale crawling
./gospider -url="https://example.com" -domains=1000 -urls=50000 -workers=50
# Save everything with detailed logging
./gospider -url="https://example.com" -save -images -verbose -domains=100
go run main.go -url=https://en.wikipedia.org/wiki/Apple_Inc. -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://books.toscrape.com/ -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://paulgraham.com/ -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://quotes.toscrape.com/ -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://iep.utm.edu/ -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://gobyexample.com/ -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://aryanranderiya.com -workers=1000 -domains=1 -urls=100000
Flag | Type | Default | Description |
---|---|---|---|
-url |
string | required | Starting URL to crawl |
-domains |
int | 100 | Maximum number of domains to crawl |
-urls |
int | 1000 | Maximum number of URLs to process (0 = unlimited) |
-workers |
int | 5 | Number of concurrent workers |
-proxies |
bool | false | Use proxies from proxies.txt file |
-images |
bool | false | Download images found during crawling |
-save |
bool | false | Save markdown files to disk |
-verbose |
bool | false | Enable verbose output |
Create a proxies.txt
file in the project root with one proxy per line:
http://proxy1.example.com:8080
http://proxy2.example.com:8080
When using the -save
flag, GoSpider creates an organized directory structure:
output/
βββ example.com/
β βββ index.md
β βββ about.md
β βββ images/
β βββ logo.png
β βββ banner.jpg
βββ blog.example.com/
β βββ post-1.md
β βββ post-2.md
βββ docs.example.com/
βββ api-reference.md
Create a proxies.txt
file with one proxy per line:
http://proxy1.example.com:8080
http://user:pass@proxy2.example.com:3128
socks5://proxy3.example.com:1080
Proxy features:
- Automatic validation on startup
- Random selection per request
- Failure tracking and blacklisting
- Transparent fallback to direct connection
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.