A comprehensive tool for extracting user agent strings from online sources, featuring advanced data processing and cleaning. Released for demonstration and portfolio purposes.
Real-world performance: Successfully extracted and processed nearly 80k unique user agents in just a few hours, demonstrating enterprise-level data collection capabilities.
This project extracts user agents from web sources across all categories (Browser Types, Devices, Platforms), subcategories, and makers. It includes multithreaded data extraction, real-time duplicate detection, user agent rotation, and comprehensive data cleaning pipelines.
- High-Volume Data Extraction: Proven to extract nearly 80K user agents efficiently
- Comprehensive Coverage: Extracts user agents from all categories, subcategories, and makers
- Multithreaded Performance: Optimized with 4 workers for speed while avoiding rate limits
- Real-time Duplicate Detection: Thread-safe duplicate prevention during extraction
- User Agent Rotation: Rotates through different user agents to avoid detection
- Pagination Handling: Automatically navigates through all pages with loop detection
- Data Cleaning Pipeline: Multiple processing stages to clean and validate data
- Rate Limiting Protection: Conservative delays and exponential backoff
- CSV + Text Output: Structured CSV with metadata and clean text files
user_agent_extractor/
├── main.py # Main entry point - run this script
├── README.md # This file
├── LICENSE.txt # Creative Commons license
├── requirements.txt # Python dependencies
├── config/ # Configuration files (create your own)
│ ├── rotation_agents.txt # User agents for rotation separated by newlines
│ └── config.json # Application settings (not in repo)
└── src/ # Source code
├── __init__.py # Package initialization
├── data_extractor_multithreaded.py # Core optimized multithreaded data extractor
└── csv_processing/ # Data processing utilities
├── README.md # Processing documentation
├── process_data.py # Unified pipeline script
├── 01_sort_raw_csv.py # Sort raw CSV data
├── 02_clean_csv_data.py # Clean CSV with metadata
├── 03_extract_to_txt.py # Extract clean agents to text
├── 04_extract_csv_to_txt.py # Alternative CSV to text conversion
├── 05_clean_txt_file.py # Clean text file format
├── 06_final_analysis.py # Advanced analysis and cleaning
└── 07_validate_data.py # Verify data integrity
- Python 3.8+
- pip
-
Clone or download the project
-
Create virtual environment:
python -m venv venv source venv/bin/activate # On macOS/Linux
-
Install dependencies:
pip install -r requirements.txt
-
Install Playwright browsers:
playwright install chromium
-
Create configuration file: Create
config/config.json
with your settings:{ "base_url": "https://your-target-site.com", "max_workers": 4, "max_retries": 2, "batch_delay_min": 1.5, "batch_delay_max": 3.0, "page_delay_min": 0.4, "page_delay_max": 1.0, "worker_start_delay_min": 0.7, "worker_start_delay_max": 1.8, "request_timeout": 30000, "rate_limit_delay": 3.0, "max_pages_per_maker": 20, "max_pages_sequential": 30, "progress_update_interval": 20, "duplicate_report_interval": 100 }
-
Create rotation agents file: Create
config/rotation_agents.txt
with user agents for rotation:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0
-
Activate virtual environment:
source venv/bin/activate
-
Run the main data extractor:
python main.py
Process all data with one command:
# Run the complete processing pipeline
python src/csv_processing/process_data.py
# Or run specific steps only
python src/csv_processing/process_data.py --step 1 # Sort only
python src/csv_processing/process_data.py --step 2 # Clean only
# ... (steps 1-7 available)
Process and clean the extracted data manually:
cd src/csv_processing
# 1. Sort the raw data
python 01_sort_raw_csv.py
# 2. Clean the CSV (preserves metadata)
python 02_clean_csv_data.py
# 3. Extract clean user agents to text file
python 03_extract_to_txt.py
# Optional: Additional processing
python 04_extract_csv_to_txt.py # Alternative CSV to text conversion
python 05_clean_txt_file.py # Clean text file format
python 06_final_analysis.py # Advanced analysis
python 07_validate_data.py # Verify data integrity
agents.txt
: Final cleaned user agents (project root)all_useragents.txt
: All extracted user agents (project root)useragents.csv
: Raw extracted data with metadata (project root)
src/csv_processing/useragents_sorted.csv
: Organized by categorysrc/csv_processing/useragents_sorted_clean.csv
: Clean CSV with metadatasrc/csv_processing/all_useragents_clean_txt.txt
: Clean text formatsrc/csv_processing/all_useragents_final_txt.txt
: Final processed text
category,subcategory,maker,user_agent
Browser Types,Application,AOL,"Mozilla/5.0 (iPhone; CPU iPhone OS 8_2..."
Browser Types,Application,Adobe Systems,"Mozilla/5.0 (Windows; U; en-US)..."
Devices,Desktop,Apple Inc,"Mozilla/5.0 (Macintosh; U; PPC Mac OS X..."
Platforms,Windows,Microsoft Corporation,"Mozilla/5.0 (Windows NT 10.0..."
The processing pipeline removes:
- ✅ Invalid/corrupted entries (null values, undefined data)
- ✅ Duplicate entries across all processing stages
- ✅ Malformed user agents (suspicious characters, encoding issues)
- ✅ Very short entries (< 8 characters)
- ✅ Corrupted data suffixes and formatting issues
- Workers: 4 (optimized for speed vs rate limiting)
- Delays: 1.5-3.0s between batches, 0.4-1.0s between pages
- Timeout: 30s per request
- Page Limit: 20-30 pages per maker (prevents infinite loops)
- User Agent Rotation: Every 5 pages
- Conservative delays between requests
- Exponential backoff on errors
- User agent rotation
- Request timeout handling
- Automatic retry with increasing delays
- Proven Scale: Successfully extracted nearly 80k user agents in a single session
- Total Runtime: Completed full extraction in just a few hours
- Success Rate: 99.9% (with error handling and retries)
- Data Quality: High quality data after comprehensive processing
- Categories: 3 main categories (Browser Types, Devices, Platforms)
- Subcategories: Multiple subcategories per main category
- Makers: Hundreds of different makers/vendors
- User Agents: Thousands to tens of thousands of unique, validated entries
- Output Formats: CSV with metadata, clean text files
- Playwright: Web automation and scraping
- CSV: Data processing and output
- Threading: Concurrent processing
- urllib.parse: URL handling
- Multithreaded Design: ThreadPoolExecutor with worker pools
- Thread-safe Operations: Locks for file writing and duplicate detection
- Memory Efficient: Real-time processing and writing
- Error Resilient: Comprehensive exception handling
- No Duplicates: Verified across all output files
- Complete Coverage: All categories, subcategories, and makers
- Data Integrity: No corruption or invalid entries
- Format Validation: Proper CSV structure and encoding
validate_data.py
: Validate data integrity- Built-in duplicate detection during data extraction
- Comprehensive cleaning pipeline with validation
- Respectful Data Extraction: Implements delays and rate limiting
- Incremental Processing: Can resume from existing data
- Modular Design: Separate extraction and processing stages
- Well Documented: Comprehensive logging and progress reporting
- Rate Limiting: The extractor respects target sites with conservative delays
- Resource Usage: Multithreading uses significant CPU and memory
- Runtime: Complete data extraction can take several hours depending on site size
- Storage: Requires adequate disk space for collected data
- Ethical Use: Always respect robots.txt and site terms of service
Note: This project is licensed under CC BY-NC-ND 4.0 (no derivative works permitted).
- Study the code for educational purposes
- Share the project with proper attribution
- Use as reference for your own projects
- Multithreaded data extraction with rate limiting
- Thread-safe file operations and duplicate detection
- Configuration-driven architecture
- Comprehensive error handling and retry logic
Status: ✅ Complete - Successfully demonstrates comprehensive web data extraction and processing techniques
This project was developed by yumeangelica. For more information on how this work can be used, please refer to the LICENSE.txt file.
Copyright © 2025 - present; yumeangelica
This project is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. This allows you to share the work, with appropriate credit given, but not to use it for commercial purposes or to create derivative works.
For more details about the license, please visit Creative Commons License.