A high-performance Python utility that fetches and merges multiple adlists into domain-only output for DNS blockers like Pi-hole, AdGuard, and similar DNS filtering solutions.
- Features
 - Quick Start
 - Installation (optional)
 - Configuration
 - Output Format
 - Performance
 - How It Works
 - Architecture
 - VS Code Tasks
 - Redundancy Analysis
 - Troubleshooting
 - Example Output
 - Requirements
 - Use Cases
 - Contributing
 - License
 - Acknowledgments
 
- Fast Concurrent Processing: Processes 1.6M+ entries from 50+ sources in ~50-60 seconds
 - Zero Dependencies: Uses only Python standard library (3.8+)
 - Dual Output: Generates both adlists and whitelists simultaneously
 - Smart Content Processing: Handles domains, wildcards, regex patterns, and Pi-hole format conversions
 - ABP Filter Support: Converts Pi-hole regex patterns to AdBlock Plus (ABP) format with automatic wildcard normalization
 - Intelligent Separation: Automatically separates exception rules (
@@||) from blocklist to whitelist - Domain Validation: Validates and filters invalid domain entries during post-processing
 - Real-time Progress: Animated progress spinners with detailed status updates
 - Error Resilient: Failed fetches don't crash the pipeline; they're logged and filtered out
 
- 
Clone the repository:
git clone https://github.com/Toomas633/Adlist-Parser.git cd Adlist-Parser - 
Run the parser:
- From a repo checkout:
python -m adparser
 - If installed as a package:
adlist-parser
 
On Windows PowerShell:
python -m adparser # or, if installed adlist-parser
 - From a repo checkout:
 - 
Find your results:
output/adlist.txt- Merged blocklist (~1.6M entries)output/whitelist.txt- Merged whitelist (~2K entries)
 
You can run from a checkout (above) or install locally to get the adlist-parser command:
# Editable install for development
python -m pip install -e .
# Then run
adlist-parserConfigure your sources in JSON files:
data/adlists.json - Blocklist sources:
{
  "lists": ["blacklist.txt", "old_adlist.txt"],
  "urls": [
    "https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts",
    "https://adaway.org/hosts.txt",
    "https://v.firebog.net/hosts/AdguardDNS.txt"
  ]
}data/whitelists.json - Whitelist sources:
{
  "lists": ["whitelist.txt"],
  "urls": [
    "https://raw.githubusercontent.com/hagezi/dns-blocklists/main/domains/whitelist-referral.txt"
  ]
}- URLs: HTTP/HTTPS links to remote lists
 - Local files: Relative paths to files in the 
data/directory - Mixed format: Each source can contain domains, wildcards, regex patterns, or Pi-hole entries
 
Notes:
- Path resolution: relative paths inside the JSON files are resolved relative to the JSON file location (not the CWD).
 - Accepted keys: both files accept any of 
lists,urls,adlists, orsourcesfor compatibility; they are merged. 
The parser intelligently handles multiple input formats:
- Plain domains: 
example.com - Wildcards: 
*.example.com - Pi-hole regex: 
(\.|^)example\.com$ - AdBlock patterns: 
/pattern/flags - Host file entries: 
0.0.0.0 example.com - Comments: Lines starting with 
#,!,//, or; 
- Domain Extraction: Extracts clean domains from various host file formats
 - Wildcard Handling: 
*.domain.comis preserved as a domain token (wildcard not expanded). In the final domain output the leading*.is stripped todomain.com. - ABP Normalization: Fixes broken ABP patterns automatically:
||*cdn.domain.com^→||*.cdn.domain.com^(missing dot after wildcard)||app.*.adjust.com^→||*.adjust.com^(wildcard-only label removed)||domain.google.*^→||domain.google^(wildcard TLD removed - not supported)-domain.com^→||-domain.com^(adds missing||prefix)@@|domain.com^|→@@||domain.com^(fixes single pipe + trailing pipe)
 - ABP Conversion: Pi-hole regex patterns convert to 
||domain^format when possible - Blocklist/Whitelist Separation: Automatically moves 
@@||exception entries from blocklist to whitelist - Domain Validation: Validates and removes invalid domain entries during post-processing
 - Regex Handling: Complex regexes that can't convert to ABP are discarded (pipeline doesn't crash)
 - Deduplication: Preserves first-seen order during normalization; final outputs are sorted case-insensitively during post-processing
 - Comment Filtering: Strips whole-line and inline comments (
#,!,//,;) - HTML Filtering: Removes HTML tags and attributes from lists
 - Error Resilience: Failed fetches logged and filtered during normalization
 - Adlist merge: Adlist pipeline merges with prior 
output/adlist.txtbefore writing to preserve entries across transient source failures (whitelist writes directly) 
- Outputs use LF-only line endings.
 - Sorting is deterministic and case-insensitive; deduplication is case-insensitive and whitespace-trimmed.
 - Headers are regenerated during post-processing (don’t hand-edit outputs).
 
Each output starts with a generated header like this:
# Adlist - Generated by Adlist-Parser
# https://github.com/Toomas633/Adlist-Parser
#
# Created/modified: 2025-01-01 00:00:00 UTC
# Total entries: 1,684,272
# Domains: 400,527
# ABP-style rules: 1,283,745
# Sources processed: 50
#
# This file is automatically generated. Do not edit manually.
# To update, run: adlist-parser or python -m adparser
#
- Concurrency: Fetches multiple sources simultaneously (max 16 workers)
 - Async Processing: Adlists and whitelists processed in parallel
 - Memory Efficient: Line-by-line processing for large datasets
 - Real-world Scale: Tested with 1.6M+ entries from 50+ sources
 
- Concurrency: network fetching uses up to 16 workers (see 
adparser/fetcher.py). You can adjust the cap there if needed for your environment. - I/O: most heavy I/O runs off the event loop using 
asyncio.to_thread(); disk speed can impact total time. - Output size: 
output/adlist.txtcan reach ~1.6–1.7M lines depending on sources. 
Two concurrent pipelines run via asyncio.gather() in adparser/cli.py:
Pipeline Flow (for both adlist and whitelist):
- Load Sources → Parse JSON configs and resolve paths
 - Fetch Content → Concurrent downloads (16 workers max)
 - Generate List → Normalize, categorize, and convert entries
 - Adlist-only merge → Merge new entries with prior 
output/adlist.txtbefore write (preserves previous content across transient source failures); whitelist writes directly - Write Output → Save with auto-generated headers (LF-only line endings)
 - Post-Processing → Separate blocklist/whitelist entries, validate domains, regenerate headers, and re-write both files
 - Redundancy Report → Analyze duplicates and overlaps
 
Key Processing Steps:
- All heavy I/O wrapped with 
asyncio.to_thread()to keep event loop responsive - Progress displayed via animated spinners (
adparser/status.py) - Domain validation using 
DOMAIN_REregex pattern - Pi-hole regex → ABP conversion when patterns are simple enough
 - ABP wildcard normalization fixes malformed patterns automatically
 - Post-processing separates 
@@||exception entries to whitelist - Cross-list deduplication ensures no conflicts between blocklist and whitelist
 - Failed sources tracked separately and reported at end
 - Files are written with LF-only line endings
 
The codebase follows a modular async architecture with strict separation of concerns:
adparser/cli.py           # Main orchestrator with async/await
├── adparser/io.py        # JSON parsing, path resolution, file I/O
├── adparser/fetcher.py   # Concurrent HTTP fetching (ThreadPoolExecutor)
├── adparser/content.py   # Domain extraction, normalization, regex conversion
├── adparser/models.py    # Source descriptor dataclass (URL vs local files)
├── adparser/status.py    # Progress spinners and terminal UI updates
├── adparser/reporting.py # Results summary with emoji formatting
├── adparser/redundancy.py# Duplicate detection and overlap analysis
└── adparser/constants.py # File path constants
Design Principles:
- Single Responsibility: Each module handles one concern (fetch/parse/write separated)
 - Error Isolation: Failed sources don't crash the pipeline
 - Async I/O: Heavy operations run in thread pool via 
to_thread() - Progress Feedback: Global 
status_displaycoordinates concurrent spinners - Order Preservation: Deduplication via 
_dedupe_preserve_order()before final sort 
This repository includes ready-to-run tasks for Windows PowerShell:
- Adlist-Parser: runs the tool end-to-end (equivalent to 
python -m adparser). - Tests: Pytest (coverage): runs 
pytestwith coverage as configured inpyproject.toml. - Lint: Pylint (report): runs pylint and writes 
pylint-report.txt; non-zero exit is allowed while still generating the report. 
Quality gates expectations:
- Build: N/A (pure Python). Runtime entry point is 
adparser.cli:main. - Tests: PASS on local data set; full runs may take ~50–60s.
 - Lint: PASS is ideal; if not, review 
pylint-report.txt. 
The parser includes built-in redundancy detection to help optimize your source lists:
Features:
- Duplicate Detection: Identifies sources with identical content
 - Local File Analysis: Shows which entries in local files are already covered by remote sources
 - Removal Suggestions: Lists first 20 redundant entries with count of remaining
 
Example Output:
🔁 Duplicate sources (identical content): 2 groups
├─ 🌐 https://example.com/list1.txt
└─ 🌐 https://example.com/list2.txt
💡 Tip: Keep one source from this group, remove the others
📄 Local file redundancy analysis:
• blacklist.txt: 150/200 entries (75.0%) already in remote sources
The parser intelligently processes various input formats and converts them appropriately:
| Input Format | Processing Result | Notes | 
|---|---|---|
example.com | 
→ example.com (domain) | 
Plain domain preserved | 
*.example.com | 
→ ||*.example.com^ (ABP rule) | 
Wildcard converted to ABP | 
0.0.0.0 example.com | 
→ example.com (domain) | 
Host file format extracted | 
(\.|^)example\.com$ | 
→ ||example.com^ (ABP rule) | 
Pi-hole regex converted | 
/ads?/ | 
→ ABP rule or discarded | Converted if simple, discarded if complex | 
# Comment line | 
→ filtered | Comment removed | 
domain.com # inline | 
→ domain.com | 
Inline comment stripped | 
<div>html</div> | 
→ filtered | HTML tags removed | 
@@||exception.com^ | 
→ Moved to whitelist as ||exception.com^ | 
Exception rule separated | 
||*cdn.example.com^ | 
→ ||*.cdn.example.com^ | 
Malformed ABP pattern normalized | 
Common Issues:
- Network Errors: Failed sources are listed as "UNAVAILABLE SOURCES" in the final report with 
🌐(remote) or📄(local) indicators - Proxy Issues: Configure system proxy settings or mirror remote sources locally in 
data/and update JSON configs - Large Files: 
output/adlist.txtcan be 30MB+; use command-line tools (grep,wc -l) for inspection - Slow Performance: Check network speed; adjust worker count in 
adparser/fetcher.py(default: 16) - Memory Usage: The parser uses line-by-line processing, so memory footprint stays low even with 1.6M+ entries
 
- Why are element-hiding rules (e.g., 
##,#@?#) missing from outputs?- This tool targets DNS blocklists. Element-hiding is cosmetic (browser-side), so such rules are dropped during normalization.
 
 - Why do some regex rules disappear?
- Only simple, anchored Pi-hole patterns are converted to ABP (
||domain^). Complex/JS-like regex is discarded for safety and DNS relevance. 
 - Only simple, anchored Pi-hole patterns are converted to ABP (
 - My local file entries are already covered by remotes—how do I find them?
- Check the redundancy section at the end of the run; it lists duplicates and local entries already provided by remote sources.
 
 
🚀 Starting Adlist-Parser...
⚡ Processing adlists and whitelists concurrently...
⚡ Adlist: Fetching content... |/-\ [48/50 (96%)]
⚡ Whitelist: Processing domains...
Adlist: ✅ Complete - 1684272 entries (400527 domains, 1283745 ABP rules)
Whitelist: ✅ Complete - 2337 entries (1346 domains, 991 ABP rules)
=== Adlists redundancy analysis ===
Analyzed 50 sources.
✅ No redundancy issues detected
============================================================
🎉 ALL PROCESSING COMPLETED IN 53.16 SECONDS! 🎉
============================================================
📊 RESULTS SUMMARY:
┌──────────────────────────────────────────────────────────┐
│ 🛡️  ADLIST:    50 sources → 1684272 entries              │
│   📝 Domains:  400527 | ABP rules: 1283745              │
├──────────────────────────────────────────────────────────┤
│ ✅ WHITELIST:  6 sources →    2337 entries               │
│   📝 Domains:    1346 | ABP rules:     991              │
├──────────────────────────────────────────────────────────┤
│ 📁 Output files:                                         │
│   • output/adlist.txt                                    │
│   • output/whitelist.txt                                 │
└──────────────────────────────────────────────────────────┘
- Python 3.8 or higher
 - No external dependencies (uses only standard library)
 
- Pi-hole: Use 
output/adlist.txtas a blocklist andoutput/whitelist.txtas an allowlist - AdGuard Home: Import both files as custom filtering rules
 - DNS Filtering: Any DNS-based ad blocker that supports domain lists
 - Network Security: Corporate firewall domain blocking lists
 
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Here's how to get started:
- Fork the repository
 - Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes following the existing patterns:
- Keep modules single-responsibility
 - Use 
asyncio.to_thread()for I/O operations - Wrap long operations with 
spinner.show_progress() - Add inline documentation for complex logic
 
 
- Keep runtime stdlib-only; do not add dependencies or modify 
pyproject.tomlbeyond[project.optional-dependencies].dev - Preserve public contracts: 
fetcher.fetch,content.generate_list,content.separate_blocklist_whitelist,io.write_output - Do not widen domain regex or IDN heuristics; keep 
_maybe_extract_domainandDOMAIN_REconservative 
- Test thoroughly with 
python -m adparser - Verify output files are generated correctly
 - Submit a pull request with clear description
 
Development Tips:
- Read 
.github/copilot-instructions.mdfor architecture overview - Check 
adparser/content.pyfor parsing rules and regex patterns - Use existing regex patterns rather than adding new ones
 - Maintain backward compatibility with existing JSON configs
 
For quick iterations, limit sources to local files to reduce runtime:
- 
Edit
data/adlists.jsonanddata/whitelists.jsonto only include the local files:{ "lists": ["blacklist.txt"], "urls": [] }{ "lists": ["whitelist.txt"], "urls": [] } - 
Add a few test lines to
data/blacklist.txtanddata/whitelist.txtand run:python -m adparser 
This exercises the full pipeline (status UI, normalization, separation, reporting) in seconds.
- Built for the DNS filtering community
 - Inspired by the need for fast, reliable adlist aggregation
 - Uses high-quality sources from the community (StevenBlack, Hagezi, FadeMind, and others)