🎯 ConcurCrawler

A fast, polite, and configurable concurrent request engine for security testing, reconnaissance, and scraping — built with respect for robots.txt, rotating User-Agents, rate-limiting, and robust logging.
Perfect for bug bounty recon, site inventorying, and safe automated scanning.

Features

✅ Concurrent HTTP requests with configurable concurrency
✅ Rate-limiting using a semaphore with optional per-request delay
✅ Rotating User-Agent list (configurable)
✅ robots.txt parsing & honoring crawl-delay / disallowed paths
✅ Timeout and retry logic with exponential backoff
✅ Console logging + structured results.json
✅ Configurable wordlist, concurrency, delay, timeouts, retries

Why ConcurCrawler?

ConcurCrawler is built for security engineers and bug bounty hunters who need a reliable, respectful, and configurable tool to perform large-scale URL probing without hammering targets. It balances speed and politeness: when you need to be fast — it’s fast; when you must be careful — it’s polite.

Architecture & Flow

Component Diagram (Mermaid)

graph TB
  subgraph Core
    Q[Queue Manager] --> S[Semaphore Controller]
    R[Request Worker] --> L[Logger]
    U[User-Agent Rotator] --> R
    V[robots.txt Parser] --> Q
  end
  subgraph IO
    W[Wordlist Loader] --> Q
    X[results.json Writer] --> L
    Y[Console Output] --> L
  end
  Z[CLI / Config] --> Core
  Core --> IO

Installation

# clone your repo
git clone https://github.com/<your-username>/ConcurCrawler.git
cd ConcurCrawler

# create venv and install
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Usage

# basic usage
python concurcrawler.py \
  --target https://example.com \
  --wordlist wordlist.txt \
  --concurrency 20 \
  --delay 0.5 \
  --timeout 10 \
  --retries 2 \
  --ua-list uas.txt \
  --output results.json

Short flags and examples:

--target Target base URL (required)
--wordlist Newline-separated wordlist of paths to append
--concurrency Number of concurrent workers (semaphore permits)
--delay Optional per-request delay (seconds) applied after release
--timeout Per-request timeout (seconds)
--retries Number of retries per request (exponential backoff)
--ua-list Newline-separated User-Agent strings
--output Path to results JSON (default results.json)
--respect-robots Enable robots.txt checks (default: true)

Configuration

Example config.yaml:

target: "https://example.com"
concurrency: 25
delay: 0.2
timeout: 10
retries: 3
respect_robots: true
wordlist: "wordlist.txt"
ua_list: "uas.txt"
output: "results.json"
log_level: "INFO"

Technical Details

Concurrency & Rate-Limiting (Semaphore + Delay)

ConcurCrawler uses a counting semaphore to limit the number of in-flight requests to the target. This ensures you don't exceed the configured concurrency (or the target's tolerance).

Pseudo-flow:

Acquire semaphore permit.
Perform HTTP request.
On completion (success/final-failure/skip), release permit.
Optionally sleep delay seconds after releasing to space out bursts.

Why semaphore + delay?

Semaphore prevents excessive parallel connections.
Delay smooths bursts over time and respects polite crawling behavior (alongside robots.txt crawl-delay).

Implementation note (Python asyncio example):

sem = asyncio.Semaphore(concurrency)

async def worker(url):
    async with sem:
        resp = await fetch(url)   # handles UA, timeout, retries
    await asyncio.sleep(delay)

Rotating User-Agent Header List

Rotate User-Agent per request (round-robin or random) from a user-supplied list. This helps emulate diverse client profiles and avoid trivial blocks.

Best practices:

Include a configurable list (file) with many realistic UA strings.
Avoid overly aggressive rotation frequency that looks like bot behavior.
Optionally randomize order with a seeded RNG for reproducibility.

Example:

ua = random.choice(ua_list)
headers = {"User-Agent": ua}

Respecting robots.txt

ConcurCrawler fetches https://target/robots.txt (or http:// accordingly), parses Disallow rules and Crawl-delay (if present) and filters the URL queue.

Key behaviors:

If --respect-robots is enabled, any path matching a Disallow entry is skipped and logged as skipped.
When Crawl-delay exists, it can override or augment the configured --delay behaviour (the crawler takes the maximum of the two).
If robots.txt is unreachable, default to configured delay and proceed (but log a warning).

Parsing note: use a standard parser (e.g., urllib.robotparser or reppy for richer semantics).

Timeouts & Retries

Each request uses a configurable timeout.
Retries use exponential backoff with jitter:
- wait = base * (2 ** attempt) + random_jitter
Retries happen for network errors and retryable HTTP statuses (e.g., 429, 502, 503, 504), not for 4xx (except optionally 429).
Each attempt respects the semaphore and UA rotation.

Pseudo:

for attempt in range(retries + 1):
    try:
        response = session.get(url, headers=headers, timeout=timeout)
        if response.status_code in retryable:
            raise RetryableHTTP(response.status_code)
        break
    except Exception as e:
        if attempt < retries:
            await sleep(backoff_time(attempt))
        else:
            log final failure

Logging (screen + results.json)

Two-fold logging:

Console — human-friendly, colorized lines (INFO/WARN/ERROR) for live monitoring.
- e.g. [200] GET /admin -> 134ms
results.json — append structured JSON lines with keys:
- timestamp, url, status, latency_ms, ua, attempts, error, skipped_reason

Use newline-delimited JSON (NDJSON) to allow streaming and easy jq processing.

Examples

Example CLI run

$ python concurcrawler.py --target https://onmeridian.com --wordlist tiny.txt --concurrency 10 --delay 0.5 --timeout 8 --retries 2 --ua-list uas.txt
[INFO] 2025-10-16 12:34:22 | [200] GET /admin -> 84ms | ua: Chrome/117 ...
[WARN] 2025-10-16 12:34:24 | [403] GET /secret -> blocked by WAF | ua: curl/7.68
[INFO] Completed. Results saved to results.json

Sample `results.json` (NDJSON)

{"timestamp":"2025-10-16T12:34:22.123Z","url":"https://onmeridian.com/admin","status":200,"latency_ms":84,"ua":"Mozilla/5.0 (Windows NT 10.0; ...)","attempts":1,"error":null,"skipped_reason":null}
{"timestamp":"2025-10-16T12:34:24.532Z","url":"https://onmeridian.com/secret","status":403,"latency_ms":110,"ua":"curl/7.68.0","attempts":1,"error":"403 Forbidden","skipped_reason":null}
{"timestamp":"2025-10-16T12:34:30.999Z","url":"https://onmeridian.com/ignored","status":null,"latency_ms":null,"ua":null,"attempts":0,"error":null,"skipped_reason":"robots.txt disallow"}

Tips & Safety

Always follow the target's Terms of Service and local laws. Do not use ConcurCrawler for unauthorized scanning.
For bug bounty targets: check the program's policy and allowed paths before running.
Test on staging or sites you own first.
Start with low concurrency (2-5) and increase carefully.

Presentation & Theming (Hacker Aesthetic)

To make the README feel like cybersecurity tooling, we used:

Terminal-looking logs
Mermaid diagrams for flow & architecture
NDJSON examples
Shields & badges
Monospace code blocks and clear sectioning

You can optionally include animated GIFs/screenshots in your repo's /docs folder and embed them:

![screenshot](docs/screenshot.gif)

Contributing

PRs welcome. Please open issues for feature requests (e.g., headless browser support, distributed workers, proxy support, CAPTCHA handling, rate limit auto-throttling).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
results.json		results.json
scanner.py		scanner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎯 ConcurCrawler

Table of Contents

Features

Why ConcurCrawler?

Architecture & Flow

Component Diagram (Mermaid)

Installation

Usage

Configuration

Technical Details

Concurrency & Rate-Limiting (Semaphore + Delay)

Rotating User-Agent Header List

Respecting robots.txt

Timeouts & Retries

Logging (screen + results.json)

Examples

Example CLI run

Sample `results.json` (NDJSON)

Tips & Safety

Presentation & Theming (Hacker Aesthetic)

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

VibhuDixit-2215001940/ConcurCrawler

Folders and files

Latest commit

History

Repository files navigation

🎯 ConcurCrawler

Table of Contents

Features

Why ConcurCrawler?

Architecture & Flow

Component Diagram (Mermaid)

Installation

Usage

Configuration

Technical Details

Concurrency & Rate-Limiting (Semaphore + Delay)

Rotating User-Agent Header List

Respecting robots.txt

Timeouts & Retries

Logging (screen + results.json)

Examples

Example CLI run

Sample results.json (NDJSON)

Tips & Safety

Presentation & Theming (Hacker Aesthetic)

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Sample `results.json` (NDJSON)

Packages