Crawling Project

_________                       ______            
__  ____/____________ ___      ____  /____________
_  /    __  ___/  __ `/_ | /| / /_  /_  _ \_  ___/
/ /___  _  /   / /_/ /__ |/ |/ /_  / /  __/  /    
\____/  /_/    \__,_/ ____/|__/ /_/  \___//_/

Hybrid Web Crawler

A powerful and resource-efficient web crawler built with Node.js. It features a hybrid crawling strategy, an interactive CLI dashboard, and robust job management.

Key Features • Tech Stack • Getting Started • Running the App • Contributing • License

✨ Key Features

Hybrid Crawling Strategy: Starts with a lightweight HTTP fetch and intelligently escalates to a full Puppeteer browser only when necessary (e.g., for sites protected by Cloudflare). This saves significant system resources.
Multi-Threaded Performance: Utilizes all available CPU cores with Node.js worker_threads to process multiple jobs in parallel.
Interactive CLI Dashboard: A real-time dashboard shows the status of the crawl queue (pending, processing, done, failed) and provides detailed statistics for each source.
Dynamic Job Control: Add new crawling jobs on the fly through the interactive command line.
Robust Job Management:
- Automatic Retry: Retries jobs that fail due to temporary network errors.
- Stuck Job Recovery: On startup, automatically requeues jobs that were stuck in a processing state from a previous run.
- Bad Source Pruning: Automatically detects and stops crawling from source URLs that consistently produce errors, preventing wasted resources.
Efficient Link Extraction: Discovers and queues new, same-origin links from crawled pages to expand the crawl frontier.
Configurable Data Extraction: Specify exactly what content to extract from pages using CSS selectors for each job.

🛠️ Tech Stack

Core: Node.js, Worker Threads
Crawling: Undici (for lightweight fetching), Puppeteer (for heavy-duty, JS-rendered pages)
Data Extraction: Cheerio
Database: MongoDB
CLI: Chalk, cli-table3
Development: Nodemon
Testing: Vitest

🚀 Getting Started

Prerequisites

Node.js (v20 or higher)
npm (v10 or higher)
MongoDB

Installation

Clone the repository:

git clone https://github.com/ademchaoua/crawling.git

Navigate to the project directory:
```
cd crawling
```
Install the dependencies:
```
npm install
```

Configuration

Project configuration is located in config/index.js. You can modify settings like database connections, crawl concurrency, and retry logic there. No .env file is required by default.

🏃 Running the Application

To start the crawler, run:
```
npm start
```
For development with auto-reloading, run:
```
npm run dev
```

Once running, the interactive dashboard will appear.

Interactive Commands

Add a new crawl job:
```
add <url> <cssSelector1,cssSelector2,...>
```
- <url>: The starting URL to crawl.
- <cssSelectors>: A comma-separated list of CSS selectors to extract content from.
- Example: add https://news.ycombinator.com .titleline,.sitebit a
Exit the application:
```
exit
```

📂 Project Structure

.
├── config/
│   └── index.js        # Main project configuration
├── src/
│   ├── core/
│   │   ├── processer.js  # HTML fetching, data/link extraction (Cheerio)
│   │   └── worker.js     # Core crawling logic for both fetch and Puppeteer workers
│   ├── db/
│   │   └── index.js      # MongoDB connection, collections, and queries
│   ├── logger/
│   │   └── index.js      # Console and file logging setup
│   └── main.js         # Application entry point, CLI dashboard, and worker management
├── tests/
│   └── processor.test.js # Unit tests
├── package.json
└── README.md

🤝 Contributing

Contributions are welcome! We have a Code of Conduct that we expect all contributors to adhere to. Please read it before contributing.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

📜 License

Distributed under the ISC License.

📧 Contact

Adem Chaoua - adem.chaoua.1444@gmail.com

Project Link: https://github.com/ademchaoua/crawling

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
config		config
src		src
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Crawling Project

Hybrid Web Crawler

✨ Key Features

🛠️ Tech Stack

🚀 Getting Started

Prerequisites

Installation

Configuration

🏃 Running the Application

Interactive Commands

📂 Project Structure

🤝 Contributing

📜 License

📧 Contact

About

Uh oh!

Releases

Packages

Languages

ademchaoua/crawling

Folders and files

Latest commit

History

Repository files navigation

Crawling Project

Hybrid Web Crawler

✨ Key Features

🛠️ Tech Stack

🚀 Getting Started

Prerequisites

Installation

Configuration

🏃 Running the Application

Interactive Commands

📂 Project Structure

🤝 Contributing

📜 License

📧 Contact

About

Resources

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages