Skip to content

ranbot-ai/web-scraper

Repository files navigation

Web Scraper | 2025 Activity

  • A Nodejs script that scrapes metadata & social links from public webpages.

Technology

  • Node
    • Scraper running node version: (v14, v18, v20.10.0 default)
    • Node Version Manager nvm
  • Puppeteer
    • Node library which provides a high-level API to control Chrome
  • Typescript

Dependencies

  • puppeteer
  • puppeteer-extra
  • puppeteer-extra-plugin-stealth
  • sharp
  • fs-extra
  • temp
  • rimraf (dev)
  • nodemon (dev)
  • ts-node (dev)
  • typescript (dev)

Install all dependencies with:

npm install

Structure

  build
    └── index.js
    └── ...
  config
    └── config.json
  src
    └── pages
        ├── index.ts
        ├── identifiers.ts
    └── environment
        ├── config.ts
    └── utils
        ├── index.ts
    └── index.ts
  types
    └── index.d.ts
  outputs
    └── *.json
  screenshots
    └── *.jpg
  • build: The latest generated javascript code.
  • config: Deployment and proxy configuration.
  • src: The main coding part of the scraper, written by typescript.
  • types: Type or Interface definition.
  • outputs: Scraped data in JSON format.
  • screenshots: Compressed screenshots in JPG format.

Environment Variables

  • DOMAINS (required): Comma-separated list of domains to scrape, e.g. github.com,ranbot.online
  • HEADLESS (optional): Set to true or false to control browser mode (default from config).
  • ENV (optional): Used in Docker, default is production.
  • CONCURRENCY (optional): Used in Docker, default is 8.

Scripts Overview

npm run start:dev

Starts the application in development using nodemon and ts-node to do cold reloading.

npm run build

Builds the app at build, cleaning the folder first.

npm run start

Starts the app in production by first building the project with npm run build, and then executing the compiled JavaScript at build/index.js.

Usage Examples

env DOMAINS=github.com node build/index.js

Or with multiple domains:

env DOMAINS=github.com,ranbot.online node build/index.js

Output

  • Screenshots: ./screenshots/<domain>.jpg (compressed, 1024px wide)
  • Data: ./outputs/<domain>_<timestamp>.json (pretty-printed)

Docker Usage

Build and run the scraper in Docker:

docker build -t web-scraper .
docker run -e DOMAINS=github.com,ranbot.online web-scraper

Response Example

➜  web-scraper git:(main) ✗ env DOMAINS=github.com,ranbot.online node build/index.js
[2025-05-25T08:08:26.742Z] >> Starting Web Scraper ......
[2025-05-25T08:08:26.974Z] ┌─────────┬───────┬────────────────────────────────────────┐
│ (index) │ tries │               identifier               │
├─────────┼───────┼────────────────────────────────────────┤
│    0    │   0   │  { id: 0, identifier: 'github.com' }   │
│    1    │   0   │ { id: 1, identifier: 'ranbot.online' } │
└─────────┴───────┴────────────────────────────────────────┘
[2025-05-25T08:08:26.974Z] >> Queue Size: 2
[2025-05-25T08:08:26.974Z] { tries: 0, identifier: { id: 0, identifier: 'github.com' } }
[2025-05-25T08:08:27.029Z] [github.com] -> visiting: https://github.com
[2025-05-25T08:08:35.876Z] [github.com] -> page loaded
[2025-05-25T08:08:41.790Z] [github.com] -> screenshot written to ./screenshots/github.jpg
[2025-05-25T08:08:41.792Z] [github.com] -> data written to ./outputs/github.com_2025-05-25T08-08-41.791Z.json
[2025-05-25T08:08:41.794Z] { tries: 0, identifier: { id: 1, identifier: 'ranbot.online' } }
[2025-05-25T08:08:41.864Z] [ranbot.online] -> visiting: https://ranbot.online
[2025-05-25T08:08:49.567Z] [ranbot.online] -> page loaded
[2025-05-25T08:08:52.675Z] [ranbot.online] -> screenshot written to ./screenshots/ranbot.jpg
[2025-05-25T08:08:52.675Z] [ranbot.online] -> data written to ./outputs/ranbot.online_2025-05-25T08-08-52.675Z.json

Contributors

About

A NodeJS script that scrapes metadata from public websites | 2025

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published