Web Crawler Utility

This utility is a web crawler built using Playwright and Node.js. It is designed to crawl web pages, collect metrics such as HTML and JSON request counts, and handle retries for failed links. The results are saved in CSV files for further analysis.

The HTML and JSON request count is only of GET APIs with status < 400 (200s, 300s)

Features

Crawl Web Pages: Iterates through URLs and collects metrics like HTML and JSON request counts.
Authentication: Supports HTTP and SSO authentication for specific domains.
Retry Mechanism: Retries failed links up to a configurable number of attempts.
Depth Control: Configurable crawling depth to limit or expand the scope of crawling(-1 for unlimited, 0 for main page only, greater than 0 means crawl the main page and its links up to the specified depth).
Restart and Retry Modes:
Restart Mode: Restarts crawling from a saved state.
Retry Mode: Retries crawling for only failed links from the previous state with html or json request count as -1.
CSV Integration: Reads input configurations and writes results to CSV files.
Error Handling: Handles HTTP errors, timeouts, and redirects gracefully.

Prerequisites

Node.js: Ensure Node.js is installed on your system.
Dependencies: Install the required dependencies using npm install.
Ensure the config.json file is properly configured (see the Configuration section below).

Configuration

The script uses a config.json file for its configuration. Below are the key parameters:

Key Parameters:
MAX_CONNECTIONS: Maximum number of concurrent connections.
TIMEOUT: Timeout for page loads (in milliseconds).
RETRY_COUNT: Number of retry attempts for failed links.
DEPTH: Crawling depth (0 for main page only, greater than 0 for crawling the main page and its links up to the specified depth, -1 for unlimited crawling till no more pages and links to visit in application).
HEADLESS: Run browser in headless mode (true or false).
SEARCH_CONFIG: Path to the input CSV file containing URLs to crawl.
CRAWLER_STATE: Path to the state file for restart or retry modes.
RESULT_CSV: Path to the output CSV file for results.
AUTH: HTTP authentication credentials for specific domains.
DOMAINS: List of allowed domains to crawl.

Usage

Set token value for authentication as Token

For Powershell:
$env:access_token="TOKEN-VALUE"
echo $env:MY_VAR

For Windows Command Prompt:
set access_token=TOKEN-VALUE
echo %MY_VAR%

For Linux/MacOS:
export access_token=TOKEN-VALUE

From Jenkins Params:
Values can be set before running the job

Run the Crawler

npm run crawler
OR
node crawler.js

Restart Mode: Restart crawling from the saved state. Backsup old state file and then re-visits all URLs and updates results.

npm run restart
OR
node crawler.js --restart

Retry Mode: Retry crawling for failed links(html and json count as -1)

npm run retry
OR
node crawler.js --retry

Input and Output
Input: search_config.csv
The input CSV file should contain the following columns:

Domain

Path

parameters

Login

Output: result.csv
The output CSV file will contain the following columns:

Domain: The domain of the crawled URL.
Path: The path of the crawled URL.
HTML request count: Number of HTML requests made.
JSON request count: Number of JSON requests made.
Forwarded: Whether the URL was forwarded.
Error: Any error encountered during crawling.

Domain

Path

HTML request count

JSON request count

Forwarded

Error

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
JenkinsFile		JenkinsFile
config.json		config.json
crawler.js		crawler.js
crawler_state.csv		crawler_state.csv
package-lock.json		package-lock.json
package.json		package.json
readme.md		readme.md
result.csv		result.csv
search_config.csv		search_config.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Crawler Utility

Features

Prerequisites

Configuration

Usage

About

Uh oh!

Releases

Packages

Languages

PriyankaPoojari/web-crawler-requestCount-capture

Folders and files

Latest commit

History

Repository files navigation

Web Crawler Utility

Features

Prerequisites

Configuration

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages