Skip to content

ademchaoua/crawling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawling Project

_________                       ______            
__  ____/____________ ___      ____  /____________
_  /    __  ___/  __ `/_ | /| / /_  /_  _ \_  ___/
/ /___  _  /   / /_/ /__ |/ |/ /_  / /  __/  /    
\____/  /_/    \__,_/ ____/|__/ /_/  \___//_/     
  

Hybrid Web Crawler

A powerful and resource-efficient web crawler built with Node.js. It features a hybrid crawling strategy, an interactive CLI dashboard, and robust job management.

GitHub stars GitHub forks

Key FeaturesTech StackGetting StartedRunning the AppContributingLicense

Node.js NPM License GitHub issues GitHub pull requests GitHub last commit GitHub repo size Contributions welcome

✨ Key Features

  • Hybrid Crawling Strategy: Starts with a lightweight HTTP fetch and intelligently escalates to a full Puppeteer browser only when necessary (e.g., for sites protected by Cloudflare). This saves significant system resources.
  • Multi-Threaded Performance: Utilizes all available CPU cores with Node.js worker_threads to process multiple jobs in parallel.
  • Interactive CLI Dashboard: A real-time dashboard shows the status of the crawl queue (pending, processing, done, failed) and provides detailed statistics for each source.
  • Dynamic Job Control: Add new crawling jobs on the fly through the interactive command line.
  • Robust Job Management:
    • Automatic Retry: Retries jobs that fail due to temporary network errors.
    • Stuck Job Recovery: On startup, automatically requeues jobs that were stuck in a processing state from a previous run.
    • Bad Source Pruning: Automatically detects and stops crawling from source URLs that consistently produce errors, preventing wasted resources.
  • Efficient Link Extraction: Discovers and queues new, same-origin links from crawled pages to expand the crawl frontier.
  • Configurable Data Extraction: Specify exactly what content to extract from pages using CSS selectors for each job.

🛠️ Tech Stack

  • Core: Node.js, Worker Threads
  • Crawling: Undici (for lightweight fetching), Puppeteer (for heavy-duty, JS-rendered pages)
  • Data Extraction: Cheerio
  • Database: MongoDB
  • CLI: Chalk, cli-table3
  • Development: Nodemon
  • Testing: Vitest

🚀 Getting Started

Prerequisites

  • Node.js (v20 or higher)
  • npm (v10 or higher)
  • MongoDB

Installation

  1. Clone the repository:
    git clone https://github.com/ademchaoua/crawling.git
  2. Navigate to the project directory:
    cd crawling
  3. Install the dependencies:
    npm install

Configuration

Project configuration is located in config/index.js. You can modify settings like database connections, crawl concurrency, and retry logic there. No .env file is required by default.

🏃 Running the Application

  • To start the crawler, run:
    npm start
  • For development with auto-reloading, run:
    npm run dev

Once running, the interactive dashboard will appear.

Interactive Commands

  • Add a new crawl job:

    add <url> <cssSelector1,cssSelector2,...>
    
    • <url>: The starting URL to crawl.
    • <cssSelectors>: A comma-separated list of CSS selectors to extract content from.
    • Example: add https://news.ycombinator.com .titleline,.sitebit a
  • Exit the application:

    exit
    

📂 Project Structure

.
├── config/
│   └── index.js        # Main project configuration
├── src/
│   ├── core/
│   │   ├── processer.js  # HTML fetching, data/link extraction (Cheerio)
│   │   └── worker.js     # Core crawling logic for both fetch and Puppeteer workers
│   ├── db/
│   │   └── index.js      # MongoDB connection, collections, and queries
│   ├── logger/
│   │   └── index.js      # Console and file logging setup
│   └── main.js         # Application entry point, CLI dashboard, and worker management
├── tests/
│   └── processor.test.js # Unit tests
├── package.json
└── README.md

🤝 Contributing

Contributions are welcome! We have a Code of Conduct that we expect all contributors to adhere to. Please read it before contributing.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📜 License

Distributed under the ISC License.

📧 Contact

Adem Chaoua - adem.chaoua.1444@gmail.com

Project Link: https://github.com/ademchaoua/crawling

About

No description, website, or topics provided.

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published