A fast, polite, and configurable concurrent request engine for security testing, reconnaissance, and scraping — built with respect for robots.txt, rotating User-Agents, rate-limiting, and robust logging.
Perfect for bug bounty recon, site inventorying, and safe automated scanning.
- Features
- Why ConcurCrawler?
- Architecture & Flow
- Installation
- Usage
- Configuration
- Technical Details
- Examples
- Output Schema (results.json)
- Tips & Safety
- License
- ✅ Concurrent HTTP requests with configurable concurrency
- ✅ Rate-limiting using a semaphore with optional per-request delay
- ✅ Rotating User-Agent list (configurable)
- ✅ robots.txt parsing & honoring crawl-delay / disallowed paths
- ✅ Timeout and retry logic with exponential backoff
- ✅ Console logging + structured
results.json
- ✅ Configurable wordlist, concurrency, delay, timeouts, retries
ConcurCrawler is built for security engineers and bug bounty hunters who need a reliable, respectful, and configurable tool to perform large-scale URL probing without hammering targets. It balances speed and politeness: when you need to be fast — it’s fast; when you must be careful — it’s polite.

graph TB
subgraph Core
Q[Queue Manager] --> S[Semaphore Controller]
R[Request Worker] --> L[Logger]
U[User-Agent Rotator] --> R
V[robots.txt Parser] --> Q
end
subgraph IO
W[Wordlist Loader] --> Q
X[results.json Writer] --> L
Y[Console Output] --> L
end
Z[CLI / Config] --> Core
Core --> IO
# clone your repo
git clone https://github.com/<your-username>/ConcurCrawler.git
cd ConcurCrawler
# create venv and install
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
# basic usage
python concurcrawler.py \
--target https://example.com \
--wordlist wordlist.txt \
--concurrency 20 \
--delay 0.5 \
--timeout 10 \
--retries 2 \
--ua-list uas.txt \
--output results.json
Short flags and examples:
--target
Target base URL (required)--wordlist
Newline-separated wordlist of paths to append--concurrency
Number of concurrent workers (semaphore permits)--delay
Optional per-request delay (seconds) applied after release--timeout
Per-request timeout (seconds)--retries
Number of retries per request (exponential backoff)--ua-list
Newline-separated User-Agent strings--output
Path to results JSON (defaultresults.json
)--respect-robots
Enable robots.txt checks (default: true)
Example config.yaml
:
target: "https://example.com"
concurrency: 25
delay: 0.2
timeout: 10
retries: 3
respect_robots: true
wordlist: "wordlist.txt"
ua_list: "uas.txt"
output: "results.json"
log_level: "INFO"
ConcurCrawler uses a counting semaphore to limit the number of in-flight requests to the target. This ensures you don't exceed the configured concurrency (or the target's tolerance).
Pseudo-flow:
- Acquire semaphore permit.
- Perform HTTP request.
- On completion (success/final-failure/skip), release permit.
- Optionally sleep
delay
seconds after releasing to space out bursts.
Why semaphore + delay?
- Semaphore prevents excessive parallel connections.
- Delay smooths bursts over time and respects polite crawling behavior (alongside
robots.txt
crawl-delay).
Implementation note (Python asyncio
example):
sem = asyncio.Semaphore(concurrency)
async def worker(url):
async with sem:
resp = await fetch(url) # handles UA, timeout, retries
await asyncio.sleep(delay)
Rotate User-Agent per request (round-robin or random) from a user-supplied list. This helps emulate diverse client profiles and avoid trivial blocks.
Best practices:
- Include a configurable list (file) with many realistic UA strings.
- Avoid overly aggressive rotation frequency that looks like bot behavior.
- Optionally randomize order with a seeded RNG for reproducibility.
Example:
ua = random.choice(ua_list)
headers = {"User-Agent": ua}
ConcurCrawler fetches https://target/robots.txt
(or http://
accordingly), parses Disallow rules and Crawl-delay
(if present) and filters the URL queue.
Key behaviors:
- If
--respect-robots
is enabled, any path matching aDisallow
entry is skipped and logged as skipped. - When
Crawl-delay
exists, it can override or augment the configured--delay
behaviour (the crawler takes the maximum of the two). - If
robots.txt
is unreachable, default to configured delay and proceed (but log a warning).
Parsing note: use a standard parser (e.g., urllib.robotparser
or reppy
for richer semantics).
- Each request uses a configurable timeout.
- Retries use exponential backoff with jitter:
wait = base * (2 ** attempt) + random_jitter
- Retries happen for network errors and retryable HTTP statuses (e.g., 429, 502, 503, 504), not for 4xx (except optionally 429).
- Each attempt respects the semaphore and UA rotation.
Pseudo:
for attempt in range(retries + 1):
try:
response = session.get(url, headers=headers, timeout=timeout)
if response.status_code in retryable:
raise RetryableHTTP(response.status_code)
break
except Exception as e:
if attempt < retries:
await sleep(backoff_time(attempt))
else:
log final failure
Two-fold logging:
- Console — human-friendly, colorized lines (INFO/WARN/ERROR) for live monitoring.
- e.g.
[200] GET /admin -> 134ms
- e.g.
- results.json — append structured JSON lines with keys:
timestamp
,url
,status
,latency_ms
,ua
,attempts
,error
,skipped_reason
Use newline-delimited JSON (NDJSON) to allow streaming and easy jq
processing.
$ python concurcrawler.py --target https://onmeridian.com --wordlist tiny.txt --concurrency 10 --delay 0.5 --timeout 8 --retries 2 --ua-list uas.txt
[INFO] 2025-10-16 12:34:22 | [200] GET /admin -> 84ms | ua: Chrome/117 ...
[WARN] 2025-10-16 12:34:24 | [403] GET /secret -> blocked by WAF | ua: curl/7.68
[INFO] Completed. Results saved to results.json
{"timestamp":"2025-10-16T12:34:22.123Z","url":"https://onmeridian.com/admin","status":200,"latency_ms":84,"ua":"Mozilla/5.0 (Windows NT 10.0; ...)","attempts":1,"error":null,"skipped_reason":null}
{"timestamp":"2025-10-16T12:34:24.532Z","url":"https://onmeridian.com/secret","status":403,"latency_ms":110,"ua":"curl/7.68.0","attempts":1,"error":"403 Forbidden","skipped_reason":null}
{"timestamp":"2025-10-16T12:34:30.999Z","url":"https://onmeridian.com/ignored","status":null,"latency_ms":null,"ua":null,"attempts":0,"error":null,"skipped_reason":"robots.txt disallow"}
- Always follow the target's Terms of Service and local laws. Do not use ConcurCrawler for unauthorized scanning.
- For bug bounty targets: check the program's policy and allowed paths before running.
- Test on staging or sites you own first.
- Start with low concurrency (2-5) and increase carefully.
To make the README feel like cybersecurity tooling, we used:
- Terminal-looking logs
- Mermaid diagrams for flow & architecture
- NDJSON examples
- Shields & badges
- Monospace code blocks and clear sectioning
You can optionally include animated GIFs/screenshots in your repo's /docs
folder and embed them:

PRs welcome. Please open issues for feature requests (e.g., headless browser support, distributed workers, proxy support, CAPTCHA handling, rate limit auto-throttling).
MIT © Vibhu Dixit