_________ ______ __ ____/____________ ___ ____ /____________ _ / __ ___/ __ `/_ | /| / /_ /_ _ \_ ___/ / /___ _ / / /_/ /__ |/ |/ /_ / / __/ / \____/ /_/ \__,_/ ____/|__/ /_/ \___//_/
A powerful and resource-efficient web crawler built with Node.js. It features a hybrid crawling strategy, an interactive CLI dashboard, and robust job management.
Key Features • Tech Stack • Getting Started • Running the App • Contributing • License
- Hybrid Crawling Strategy: Starts with a lightweight HTTP fetch and intelligently escalates to a full Puppeteer browser only when necessary (e.g., for sites protected by Cloudflare). This saves significant system resources.
- Multi-Threaded Performance: Utilizes all available CPU cores with Node.js
worker_threadsto process multiple jobs in parallel. - Interactive CLI Dashboard: A real-time dashboard shows the status of the crawl queue (pending, processing, done, failed) and provides detailed statistics for each source.
- Dynamic Job Control: Add new crawling jobs on the fly through the interactive command line.
- Robust Job Management:
- Automatic Retry: Retries jobs that fail due to temporary network errors.
- Stuck Job Recovery: On startup, automatically requeues jobs that were stuck in a
processingstate from a previous run. - Bad Source Pruning: Automatically detects and stops crawling from source URLs that consistently produce errors, preventing wasted resources.
- Efficient Link Extraction: Discovers and queues new, same-origin links from crawled pages to expand the crawl frontier.
- Configurable Data Extraction: Specify exactly what content to extract from pages using CSS selectors for each job.
- Core: Node.js, Worker Threads
- Crawling: Undici (for lightweight fetching), Puppeteer (for heavy-duty, JS-rendered pages)
- Data Extraction: Cheerio
- Database: MongoDB
- CLI: Chalk, cli-table3
- Development: Nodemon
- Testing: Vitest
- Node.js (v20 or higher)
- npm (v10 or higher)
- MongoDB
- Clone the repository:
git clone https://github.com/ademchaoua/crawling.git
- Navigate to the project directory:
cd crawling - Install the dependencies:
npm install
Project configuration is located in config/index.js. You can modify settings like database connections, crawl concurrency, and retry logic there. No .env file is required by default.
- To start the crawler, run:
npm start
- For development with auto-reloading, run:
npm run dev
Once running, the interactive dashboard will appear.
-
Add a new crawl job:
add <url> <cssSelector1,cssSelector2,...><url>: The starting URL to crawl.<cssSelectors>: A comma-separated list of CSS selectors to extract content from.- Example:
add https://news.ycombinator.com .titleline,.sitebit a
-
Exit the application:
exit
.
├── config/
│ └── index.js # Main project configuration
├── src/
│ ├── core/
│ │ ├── processer.js # HTML fetching, data/link extraction (Cheerio)
│ │ └── worker.js # Core crawling logic for both fetch and Puppeteer workers
│ ├── db/
│ │ └── index.js # MongoDB connection, collections, and queries
│ ├── logger/
│ │ └── index.js # Console and file logging setup
│ └── main.js # Application entry point, CLI dashboard, and worker management
├── tests/
│ └── processor.test.js # Unit tests
├── package.json
└── README.md
Contributions are welcome! We have a Code of Conduct that we expect all contributors to adhere to. Please read it before contributing.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the ISC License.
Adem Chaoua - adem.chaoua.1444@gmail.com
Project Link: https://github.com/ademchaoua/crawling