Skip to content

NamNV2496/go-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

A simple web crawler written in Go that crawls websites and extracts information.

Main Features

  • Crawls web pages starting from a given URL with method GET, POST and CURL
  • Manage multiple queue with priority for crawlers
  • Run by cron job
  • Send message to Telegram

Technologies

- Go: 1.23.4
- Redis: 7-alpine
- Kafka: 3.7.1
- PostgreSQL: 17

- Ratelimit: "github.com/go-redis/redis_rate/v9"
- Cron job: "github.com/robfig/cron/v3"
- Gorm: "gorm.io/gorm"

Architecture

flowchart LR
    conjob[conjob] --> urlservice[url-service]
    urlservice --> |"get urls"| urldb[url DB]
    urlservice -- "priority queue" --> priorityq((Kafka))
    urlservice -- "normal queue" --> normalq((Kafka))
    priorityq -- "priority queue" --> worker
    normalq -- "normal queue" --> worker
    worker[worker<br>ratelimiter] -- "publish result" --> resultq((Kafka))
    resultq -- "consume" --> db[(DB)]
    worker -- "notify through tele if<br>match with rules" --> telegram[Telegram]
Loading

alt text

title Crawler
sequenceDiagram
    participant cron-job
    participant crawler-service
    participant url-DB
    participant Kafka
    participant crawler-worker
    participant result-DB

    cron-job ->> cron-job: normal queue run every 5m
    cron-job ->> cron-job: priority queue run every 15m
    cron-job ->> crawler-service: trigger to get job
    crawler-service ->> url-DB: GetUrlByDomainsAndQueues
    crawler-service <-- url-DB: get urls
    crawler-service ->> Kafka: public message to queue by topic
    Kafka -->> crawler-worker: consume message
    crawler-worker ->> crawler-worker: crawl by url
    crawler-worker ->> result-DB: write data to result DB

alt text

How to run

Setup your environment variables in .env.example and change it to .env

# Start docker
docker-compose up -d
cd crawler-service

# Terminal 1
go run main.go server

make service


# Terminal 2
go run main.go crawler-worker

make worker

1. Create new bot and get token

alt text alt text

2. Run command to get chat Id

curl -s https://api.telegram.org/bot${TOKEN}/getUpdates

alt text

3. Worker consumer message example

publish to crawler queue: normal, request: curl --location 'https://m.cafef.vn/du-lieu/Ajax/ajaxgoldprice.ashx?index=11' --header 'Accept: */*' --header 'Accept-Language: en-US,en;q=0.9,vi;q=0.8' --header 'Connection: keep-alive' --header 'Referer: https://m.cafef.vn/du-lieu/gia-vang-hom-nay/trong-nuoc.chn' --header 'Sec-Fetch-Dest: empty' --header 'Sec-Fetch-Mode: cors' --header 'Sec-Fetch-Site: same-origin' --header 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36 Edg/135.0.0.0' --header 'sec-ch-ua: "Microsoft Edge";v="135", "Not-A.Brand";v="8", "Chromium";v="135"' --header 'sec-ch-ua-mobile: ?0' --header 'sec-ch-ua-platform: "macOS"' --header 'Cookie: _ga=GA1.2.1174992577.1733489327; _ga_860L8F5EZP=GS1.1.1740282133.10.0.1740282328.0.0.0; ASP.NET_SessionId=wnors2tpgmcb0lwvqwebtsf5; favorite_stocks_state=1'

About

A Web crawler in Golang + SEO (Search engine optimize)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published