- A Nodejs script that scrapes metadata & social links from public webpages.
- Node
- Scraper running node version: (v14, v18, v20.10.0 default)
- Node Version Manager nvm
- Puppeteer
- Node library which provides a high-level API to control Chrome
- Typescript
- TypeScript is JavaScript with syntax for types. Doc
- Node.Js With TypeScript
- puppeteer
- puppeteer-extra
- puppeteer-extra-plugin-stealth
- sharp
- fs-extra
- temp
- rimraf (dev)
- nodemon (dev)
- ts-node (dev)
- typescript (dev)
Install all dependencies with:
npm install
build
└── index.js
└── ...
config
└── config.json
src
└── pages
├── index.ts
├── identifiers.ts
└── environment
├── config.ts
└── utils
├── index.ts
└── index.ts
types
└── index.d.ts
outputs
└── *.json
screenshots
└── *.jpg
build
: The latest generated javascript code.config
: Deployment and proxy configuration.src
: The main coding part of the scraper, written by typescript.types
: Type or Interface definition.outputs
: Scraped data in JSON format.screenshots
: Compressed screenshots in JPG format.
DOMAINS
(required): Comma-separated list of domains to scrape, e.g.github.com,ranbot.online
HEADLESS
(optional): Set totrue
orfalse
to control browser mode (default from config).ENV
(optional): Used in Docker, default isproduction
.CONCURRENCY
(optional): Used in Docker, default is8
.
npm run start:dev
Starts the application in development using nodemon and ts-node to do cold reloading.
npm run build
Builds the app at build, cleaning the folder first.
npm run start
Starts the app in production by first building the project with npm run build
, and then executing the compiled JavaScript at build/index.js
.
env DOMAINS=github.com node build/index.js
Or with multiple domains:
env DOMAINS=github.com,ranbot.online node build/index.js
- Screenshots:
./screenshots/<domain>.jpg
(compressed, 1024px wide) - Data:
./outputs/<domain>_<timestamp>.json
(pretty-printed)
Build and run the scraper in Docker:
docker build -t web-scraper .
docker run -e DOMAINS=github.com,ranbot.online web-scraper
➜ web-scraper git:(main) ✗ env DOMAINS=github.com,ranbot.online node build/index.js
[2025-05-25T08:08:26.742Z] >> Starting Web Scraper ......
[2025-05-25T08:08:26.974Z] ┌─────────┬───────┬────────────────────────────────────────┐
│ (index) │ tries │ identifier │
├─────────┼───────┼────────────────────────────────────────┤
│ 0 │ 0 │ { id: 0, identifier: 'github.com' } │
│ 1 │ 0 │ { id: 1, identifier: 'ranbot.online' } │
└─────────┴───────┴────────────────────────────────────────┘
[2025-05-25T08:08:26.974Z] >> Queue Size: 2
[2025-05-25T08:08:26.974Z] { tries: 0, identifier: { id: 0, identifier: 'github.com' } }
[2025-05-25T08:08:27.029Z] [github.com] -> visiting: https://github.com
[2025-05-25T08:08:35.876Z] [github.com] -> page loaded
[2025-05-25T08:08:41.790Z] [github.com] -> screenshot written to ./screenshots/github.jpg
[2025-05-25T08:08:41.792Z] [github.com] -> data written to ./outputs/github.com_2025-05-25T08-08-41.791Z.json
[2025-05-25T08:08:41.794Z] { tries: 0, identifier: { id: 1, identifier: 'ranbot.online' } }
[2025-05-25T08:08:41.864Z] [ranbot.online] -> visiting: https://ranbot.online
[2025-05-25T08:08:49.567Z] [ranbot.online] -> page loaded
[2025-05-25T08:08:52.675Z] [ranbot.online] -> screenshot written to ./screenshots/ranbot.jpg
[2025-05-25T08:08:52.675Z] [ranbot.online] -> data written to ./outputs/ranbot.online_2025-05-25T08-08-52.675Z.json