A blazing fast, concurrent web crawler built in Go that helps you extract product information from e-commerce websites. This bad boy can handle sitemaps, crawl pages, and extract structured product data like a champ.
- Sitemap detection and parsing
- Smart URL filtering and pattern matching
- Product schema extraction (supports JSON-LD)
- Fallback to OpenGraph meta tags
- CSV export functionality
- Concurrent crawling with rate limiting
- Price normalization (handles IRR currency)
- Colly - The beast powering our crawling
- httpx - For robust HTTP interactions
- goflags - CLI flag parsing
- gologger - Logging made sexy
- Clone this repo:
git clone https://github.com/yourusername/HyperHunt-GO-web-crawler.git
cd HyperHunt-GO-web-crawler
- Install dependencies:
go mod download
- Run it:
go run main.go
- First, it checks for a sitemap at common locations (
/sitemap.xml
or/sitemap_index.xml
) - If found, it parses the sitemap to extract all product URLs
- For each URL, it:
- Attempts to extract product data from JSON-LD schema
- Falls back to OpenGraph meta tags if needed
- Normalizes prices and data formats
- Exports results to CSV files:
raw_links.csv
: All discovered URLsproper_urls.csv
: Filtered URLs matching product patterns
.
βββ main.go # Entry point
βββ pkg/
β βββ crawler/ # Core crawling logic
β βββ fileops/ # File operations (CSV handling)
β βββ models/ # Data models
β βββ utils/ # Helper functions
The crawler is configured to work with specific e-commerce sites out of the box. You can modify the base URL in main.go
:
baseURL := "https://your-target-site.com/"
PRs are welcome! Just:
- Fork it
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a PR