Skip to content

⚠️Performs a polite, concurrent endpoint scan of a target website using rotating User-Agents, respecting robots.txt, with configurable wordlist, concurrency, delays, retries, timeouts, and logs results to screen and JSON.

Notifications You must be signed in to change notification settings

VibhuDixit-2215001940/ConcurCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎯 ConcurCrawler

ConcurCrawler License Python

A fast, polite, and configurable concurrent request engine for security testing, reconnaissance, and scraping — built with respect for robots.txt, rotating User-Agents, rate-limiting, and robust logging.
Perfect for bug bounty recon, site inventorying, and safe automated scanning.


Table of Contents


Features

  • ✅ Concurrent HTTP requests with configurable concurrency
  • ✅ Rate-limiting using a semaphore with optional per-request delay
  • ✅ Rotating User-Agent list (configurable)
  • ✅ robots.txt parsing & honoring crawl-delay / disallowed paths
  • ✅ Timeout and retry logic with exponential backoff
  • ✅ Console logging + structured results.json
  • ✅ Configurable wordlist, concurrency, delay, timeouts, retries

Why ConcurCrawler?

ConcurCrawler is built for security engineers and bug bounty hunters who need a reliable, respectful, and configurable tool to perform large-scale URL probing without hammering targets. It balances speed and politeness: when you need to be fast — it’s fast; when you must be careful — it’s polite.

image

Architecture & Flow

Component Diagram (Mermaid)

graph TB
  subgraph Core
    Q[Queue Manager] --> S[Semaphore Controller]
    R[Request Worker] --> L[Logger]
    U[User-Agent Rotator] --> R
    V[robots.txt Parser] --> Q
  end
  subgraph IO
    W[Wordlist Loader] --> Q
    X[results.json Writer] --> L
    Y[Console Output] --> L
  end
  Z[CLI / Config] --> Core
  Core --> IO
Loading

Installation

# clone your repo
git clone https://github.com/<your-username>/ConcurCrawler.git
cd ConcurCrawler

# create venv and install
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Usage

# basic usage
python concurcrawler.py \
  --target https://example.com \
  --wordlist wordlist.txt \
  --concurrency 20 \
  --delay 0.5 \
  --timeout 10 \
  --retries 2 \
  --ua-list uas.txt \
  --output results.json

Short flags and examples:

  • --target Target base URL (required)
  • --wordlist Newline-separated wordlist of paths to append
  • --concurrency Number of concurrent workers (semaphore permits)
  • --delay Optional per-request delay (seconds) applied after release
  • --timeout Per-request timeout (seconds)
  • --retries Number of retries per request (exponential backoff)
  • --ua-list Newline-separated User-Agent strings
  • --output Path to results JSON (default results.json)
  • --respect-robots Enable robots.txt checks (default: true)

Configuration

Example config.yaml:

target: "https://example.com"
concurrency: 25
delay: 0.2
timeout: 10
retries: 3
respect_robots: true
wordlist: "wordlist.txt"
ua_list: "uas.txt"
output: "results.json"
log_level: "INFO"

Technical Details

Concurrency & Rate-Limiting (Semaphore + Delay)

ConcurCrawler uses a counting semaphore to limit the number of in-flight requests to the target. This ensures you don't exceed the configured concurrency (or the target's tolerance).

Pseudo-flow:

  1. Acquire semaphore permit.
  2. Perform HTTP request.
  3. On completion (success/final-failure/skip), release permit.
  4. Optionally sleep delay seconds after releasing to space out bursts.

Why semaphore + delay?

  • Semaphore prevents excessive parallel connections.
  • Delay smooths bursts over time and respects polite crawling behavior (alongside robots.txt crawl-delay).

Implementation note (Python asyncio example):

sem = asyncio.Semaphore(concurrency)

async def worker(url):
    async with sem:
        resp = await fetch(url)   # handles UA, timeout, retries
    await asyncio.sleep(delay)

Rotating User-Agent Header List

Rotate User-Agent per request (round-robin or random) from a user-supplied list. This helps emulate diverse client profiles and avoid trivial blocks.

Best practices:

  • Include a configurable list (file) with many realistic UA strings.
  • Avoid overly aggressive rotation frequency that looks like bot behavior.
  • Optionally randomize order with a seeded RNG for reproducibility.

Example:

ua = random.choice(ua_list)
headers = {"User-Agent": ua}

Respecting robots.txt

ConcurCrawler fetches https://target/robots.txt (or http:// accordingly), parses Disallow rules and Crawl-delay (if present) and filters the URL queue.

Key behaviors:

  • If --respect-robots is enabled, any path matching a Disallow entry is skipped and logged as skipped.
  • When Crawl-delay exists, it can override or augment the configured --delay behaviour (the crawler takes the maximum of the two).
  • If robots.txt is unreachable, default to configured delay and proceed (but log a warning).

Parsing note: use a standard parser (e.g., urllib.robotparser or reppy for richer semantics).

Timeouts & Retries

  • Each request uses a configurable timeout.
  • Retries use exponential backoff with jitter:
    • wait = base * (2 ** attempt) + random_jitter
  • Retries happen for network errors and retryable HTTP statuses (e.g., 429, 502, 503, 504), not for 4xx (except optionally 429).
  • Each attempt respects the semaphore and UA rotation.

Pseudo:

for attempt in range(retries + 1):
    try:
        response = session.get(url, headers=headers, timeout=timeout)
        if response.status_code in retryable:
            raise RetryableHTTP(response.status_code)
        break
    except Exception as e:
        if attempt < retries:
            await sleep(backoff_time(attempt))
        else:
            log final failure

Logging (screen + results.json)

Two-fold logging:

  1. Console — human-friendly, colorized lines (INFO/WARN/ERROR) for live monitoring.
    • e.g. [200] GET /admin -> 134ms
  2. results.json — append structured JSON lines with keys:
    • timestamp, url, status, latency_ms, ua, attempts, error, skipped_reason

Use newline-delimited JSON (NDJSON) to allow streaming and easy jq processing.


Examples

Example CLI run

$ python concurcrawler.py --target https://onmeridian.com --wordlist tiny.txt --concurrency 10 --delay 0.5 --timeout 8 --retries 2 --ua-list uas.txt
[INFO] 2025-10-16 12:34:22 | [200] GET /admin -> 84ms | ua: Chrome/117 ...
[WARN] 2025-10-16 12:34:24 | [403] GET /secret -> blocked by WAF | ua: curl/7.68
[INFO] Completed. Results saved to results.json

Sample results.json (NDJSON)

{"timestamp":"2025-10-16T12:34:22.123Z","url":"https://onmeridian.com/admin","status":200,"latency_ms":84,"ua":"Mozilla/5.0 (Windows NT 10.0; ...)","attempts":1,"error":null,"skipped_reason":null}
{"timestamp":"2025-10-16T12:34:24.532Z","url":"https://onmeridian.com/secret","status":403,"latency_ms":110,"ua":"curl/7.68.0","attempts":1,"error":"403 Forbidden","skipped_reason":null}
{"timestamp":"2025-10-16T12:34:30.999Z","url":"https://onmeridian.com/ignored","status":null,"latency_ms":null,"ua":null,"attempts":0,"error":null,"skipped_reason":"robots.txt disallow"}

Tips & Safety

  • Always follow the target's Terms of Service and local laws. Do not use ConcurCrawler for unauthorized scanning.
  • For bug bounty targets: check the program's policy and allowed paths before running.
  • Test on staging or sites you own first.
  • Start with low concurrency (2-5) and increase carefully.

Presentation & Theming (Hacker Aesthetic)

To make the README feel like cybersecurity tooling, we used:

  • Terminal-looking logs
  • Mermaid diagrams for flow & architecture
  • NDJSON examples
  • Shields & badges
  • Monospace code blocks and clear sectioning

You can optionally include animated GIFs/screenshots in your repo's /docs folder and embed them:

![screenshot](docs/screenshot.gif)

Contributing

PRs welcome. Please open issues for feature requests (e.g., headless browser support, distributed workers, proxy support, CAPTCHA handling, rate limit auto-throttling).


License

MIT © Vibhu Dixit

About

⚠️Performs a polite, concurrent endpoint scan of a target website using rotating User-Agents, respecting robots.txt, with configurable wordlist, concurrency, delays, retries, timeouts, and logs results to screen and JSON.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages