Skip to content

A high-performance, concurrent web crawler in Go that extracts URLs, downloads content, and converts tens of thousands of web pages to Markdown in minutes

License

Notifications You must be signed in to change notification settings

aryanranderiya/GoSpider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GoSpider πŸ•·οΈ Go Version License

A high-performance, concurrent web crawler written in Go that extracts URLs, downloads content, and converts web pages to markdown format.

Important

⭐ Pleae Star this repository if you find it useful!

Gospider (1)

🎯 Overview

GoSpider is a multi-threaded web crawler that leverages Go's concurrency primitives to crawl websites at scale. It implements a producer-consumer architecture with configurable worker pools, allowing it to process thousands of URLs simultaneously while maintaining memory efficiency and system stability.

The crawler is designed for various use cases including web archiving, content analysis, search engine development, and data mining. It handles cross-domain crawling, maintains URL deduplication, and provides real-time progress monitoring.

Note

Screenshot Below: Processing 100k urls, parsing as markdown, downloading .md file and other assets, all in under half an hour

Code 2025-06-26 06 49 07

Key Features

  • Concurrent Architecture: Producer-consumer pattern with configurable worker pools (1-1000+ workers)
  • Performance Optimizations:
    • Connection pooling with up to 500 connections per host
    • 10,000 URL buffer queue for smooth operation
    • Parallel file writing with 16 dedicated writers and 1MB buffers
    • Directory caching to minimize filesystem operations
  • Smart Crawling:
    • Domain-based limiting to prevent overwhelming single hosts
    • URL deduplication using in-memory hash maps
    • Graceful queue management with consecutive empty checks
    • Relative to absolute URL conversion
  • Proxy Management:
    • Random proxy rotation from configurable proxy list
    • Automatic proxy testing and validation
    • Fallback to direct connection on proxy failures
    • Support for HTTP/HTTPS proxies
  • Content Processing:
    • HTML to Markdown conversion preserving structure and links
    • URL extraction from both raw HTML and converted markdown
    • Content-type detection and appropriate handling
    • Optional image downloading with async processing
  • Monitoring & Statistics:
    • Real-time progress updates (URLs/second, completion rate)
    • Domain coverage tracking
    • Queue depth monitoring
    • Success/failure rate calculation
  • Resilience Features:
    • 30-second timeout for slow servers
    • Proper resource cleanup with defer statements
    • Non-blocking operations for auxiliary tasks
    • Error handling with graceful degradation

πŸ—οΈ Architecture

GoSpider uses a sophisticated producer-consumer architecture optimized for high throughput:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Main Thread   │───▢│  Buffered Queue  │───▢│  Worker Pool    β”‚
β”‚   (Producer)    β”‚    β”‚  (10K capacity)  β”‚    β”‚  (N Consumers)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                              β”‚
         β”‚                                              β–Ό
         β–Ό                                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                           β”‚  HTTP Client    β”‚
β”‚ URL Discovery   β”‚                           β”‚ (Connection Pool)β”‚
β”‚ & Deduplication β”‚                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                    β”‚
                                                       β–Ό
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚     Content Processing          β”‚
                              β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
                              │ HTML→Markdown   │ URL Extraction│
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                       β”‚
                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β–Ό                                  β–Ό
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚  File Writers   β”‚               β”‚ Image Downloaderβ”‚
                           β”‚ (16 parallel)   β”‚               β”‚   (Async)       β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

Queue System

  • Buffered Channel: 10,000 capacity URL queue prevents blocking
  • Thread-Safe Maps: Concurrent-safe visited URL tracking and domain counting
  • Smart Distribution: Main thread monitors queue state and distributes work efficiently

HTTP Client

  • Singleton Pattern: Single optimized client instance using sync.Once
  • Connection Pooling:
    • MaxIdleConns: 2000
    • MaxIdleConnsPerHost: 500
    • MaxConnsPerHost: 500
  • Proxy Support: Round-robin proxy selection with automatic fallback

Content Processing

  • Parallel Operations: URL extraction runs concurrently with markdown conversion
  • Link Resolution: Converts relative URLs to absolute using domain context
  • Dual Extraction: Extracts URLs from both HTML source and markdown output

File System Operations

  • Writer Pool: 16 dedicated goroutines for file writing
  • Buffered Writing: 1MB buffers reduce system calls
  • Directory Cache: Avoids repeated directory existence checks
  • Fallback Mode: Synchronous writing when async queue is full

Monitoring System

  • Real-time Updates: Per-second statistics refresh
  • Metrics Tracked:
    • Processing rate (URLs/second)
    • Queue depth
    • Domain coverage
    • Success/failure rates
    • Time elapsed

πŸ”§ Technical Implementation Details

Concurrency Model

GoSpider's concurrency model is built around Go's CSP (Communicating Sequential Processes) paradigm:

  1. Main Goroutine (Producer):

    • Initializes the URL queue with the seed URL
    • Monitors queue state and worker completion
    • Implements graceful shutdown with consecutive empty checks
  2. Worker Goroutines (Consumers):

    • Each worker runs in its own goroutine
    • Pulls URLs from the shared channel
    • Processes content independently
    • Sends discovered URLs back to the queue
  3. File Writer Goroutines:

    • Separate pool of 16 writers
    • Receives file write requests via dedicated channel
    • Buffers writes to reduce I/O overhead

Memory Management

  • URL Deduplication: Uses Go's native map with string keys for O(1) lookup
  • Domain Tracking: Separate map tracks unique domains encountered
  • Buffer Reuse: HTTP response bodies are properly closed to allow buffer reuse
  • Goroutine Lifecycle: Workers use WaitGroup for proper cleanup

Performance Characteristics

Metric Value Description
URL Processing Rate 10-100/sec Depends on network latency and content size
Memory Usage ~500MB For 100,000 URLs with typical web content
Concurrent Connections Up to 500/host Configurable via HTTP client settings
File Write Throughput 50-100 MB/s With SSD and parallel writers
Startup Time ~1-2 second Including proxy validation

Error Handling Strategy

  1. Network Errors: Logged but don't stop crawling
  2. Proxy Failures: Automatic fallback to next proxy or direct connection
  3. File System Errors: Fallback from async to sync writing
  4. Parse Errors: Skipped with logging, crawl continues
  5. Timeout Handling: 30-second timeout prevents hanging on slow servers

πŸ“¦ Installation

Prerequisites

  • Go 1.24.4 or higher
  • Git (for installation from source)

Option 1: Install from Source

# Clone the repository
git clone https://github.com/aryanranderiya/gospider.git
cd gospider

# Install dependencies
go mod download

# Build the binary
go build -o gospider cmd/main.go

# Run directly
./gospider -url="https://example.com"

Option 2: Using Go Install

go install github.com/aryanranderiya/gospider/cmd@latest

Option 3: Download Binary

Download the latest binary from the releases page.

πŸ” How It Works

Crawling Process

  1. Initialization:

    • Loads configuration and validates proxies (if enabled)
    • Creates HTTP client with optimized settings
    • Initializes worker pool and file writers
    • Seeds queue with starting URL
  2. URL Processing Loop:

    For each URL in queue:
    β”œβ”€ Check if already visited
    β”œβ”€ Check domain limits
    β”œβ”€ Fetch content (with proxy if configured)
    β”œβ”€ Extract URLs from HTML
    β”œβ”€ Convert to Markdown
    β”œβ”€ Extract URLs from Markdown
    β”œβ”€ Queue new URLs for processing
    └─ Save files (if enabled)
    
  3. Graceful Shutdown:

    • Monitors queue emptiness
    • Waits for workers to finish
    • Closes file writers
    • Displays final statistics

πŸš€ Usage

Basic Usage

# Crawl a single website
./gospider -url="https://example.com"

# Crawl with custom limits
./gospider -url="https://example.com" -domains=50 -urls=500 -workers=10

# Enable verbose output
./gospider -url="https://example.com" -verbose

# Download images and save files
./gospider -url="https://example.com" -images -save

Advanced Usage

# Use proxy rotation
./gospider -url="https://example.com" -proxies -workers=20

# Large-scale crawling
./gospider -url="https://example.com" -domains=1000 -urls=50000 -workers=50

# Save everything with detailed logging
./gospider -url="https://example.com" -save -images -verbose -domains=100

Examples of Usage

go run main.go -url=https://en.wikipedia.org/wiki/Apple_Inc. -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://books.toscrape.com/ -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://paulgraham.com/ -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://quotes.toscrape.com/ -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://iep.utm.edu/ -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://gobyexample.com/ -workers=1000 -domains=1 -urls=100000
go run main.go -url=https://aryanranderiya.com -workers=1000 -domains=1 -urls=100000

Command Line Options

Flag Type Default Description
-url string required Starting URL to crawl
-domains int 100 Maximum number of domains to crawl
-urls int 1000 Maximum number of URLs to process (0 = unlimited)
-workers int 5 Number of concurrent workers
-proxies bool false Use proxies from proxies.txt file
-images bool false Download images found during crawling
-save bool false Save markdown files to disk
-verbose bool false Enable verbose output

βš™οΈ Configuration

Proxy Configuration

Create a proxies.txt file in the project root with one proxy per line:

http://proxy1.example.com:8080
http://proxy2.example.com:8080

Output Structure

When using the -save flag, GoSpider creates an organized directory structure:

output/
β”œβ”€β”€ example.com/
β”‚   β”œβ”€β”€ index.md
β”‚   β”œβ”€β”€ about.md
β”‚   └── images/
β”‚       β”œβ”€β”€ logo.png
β”‚       └── banner.jpg
β”œβ”€β”€ blog.example.com/
β”‚   β”œβ”€β”€ post-1.md
β”‚   └── post-2.md
└── docs.example.com/
    └── api-reference.md

Proxy Configuration

Create a proxies.txt file with one proxy per line:

http://proxy1.example.com:8080
http://user:pass@proxy2.example.com:3128
socks5://proxy3.example.com:1080

Proxy features:

  • Automatic validation on startup
  • Random selection per request
  • Failure tracking and blacklisting
  • Transparent fallback to direct connection

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

About

A high-performance, concurrent web crawler in Go that extracts URLs, downloads content, and converts tens of thousands of web pages to Markdown in minutes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages