Skip to content

This is a comprehensive web data extractor tool that extracts user agent strings from a user agent database website. The extractor is designed for high-volume data collection with enterprise-level capabilities.

License

Notifications You must be signed in to change notification settings

yumeangelica/user_agent_extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

User Agent Extractor

A comprehensive tool for extracting user agent strings from online sources, featuring advanced data processing and cleaning. Released for demonstration and portfolio purposes.

Real-world performance: Successfully extracted and processed nearly 80k unique user agents in just a few hours, demonstrating enterprise-level data collection capabilities.

Project Overview

This project extracts user agents from web sources across all categories (Browser Types, Devices, Platforms), subcategories, and makers. It includes multithreaded data extraction, real-time duplicate detection, user agent rotation, and comprehensive data cleaning pipelines.

Features

  • High-Volume Data Extraction: Proven to extract nearly 80K user agents efficiently
  • Comprehensive Coverage: Extracts user agents from all categories, subcategories, and makers
  • Multithreaded Performance: Optimized with 4 workers for speed while avoiding rate limits
  • Real-time Duplicate Detection: Thread-safe duplicate prevention during extraction
  • User Agent Rotation: Rotates through different user agents to avoid detection
  • Pagination Handling: Automatically navigates through all pages with loop detection
  • Data Cleaning Pipeline: Multiple processing stages to clean and validate data
  • Rate Limiting Protection: Conservative delays and exponential backoff
  • CSV + Text Output: Structured CSV with metadata and clean text files

Project Structure

user_agent_extractor/
├── main.py                       # Main entry point - run this script
├── README.md                     # This file
├── LICENSE.txt                   # Creative Commons license
├── requirements.txt              # Python dependencies
├── config/                      # Configuration files (create your own)
│   ├── rotation_agents.txt       # User agents for rotation separated by newlines
│   └── config.json              # Application settings (not in repo)
└── src/                         # Source code
    ├── __init__.py              # Package initialization
    ├── data_extractor_multithreaded.py # Core optimized multithreaded data extractor
    └── csv_processing/          # Data processing utilities
        ├── README.md            # Processing documentation
        ├── process_data.py      # Unified pipeline script
        ├── 01_sort_raw_csv.py   # Sort raw CSV data
        ├── 02_clean_csv_data.py # Clean CSV with metadata
        ├── 03_extract_to_txt.py # Extract clean agents to text
        ├── 04_extract_csv_to_txt.py # Alternative CSV to text conversion
        ├── 05_clean_txt_file.py # Clean text file format
        ├── 06_final_analysis.py # Advanced analysis and cleaning
        └── 07_validate_data.py  # Verify data integrity

Setup & Installation

Prerequisites

  • Python 3.8+
  • pip

Installation

  1. Clone or download the project

  2. Create virtual environment:

    python -m venv venv
    source venv/bin/activate  # On macOS/Linux
  3. Install dependencies:

    pip install -r requirements.txt
  4. Install Playwright browsers:

    playwright install chromium
  5. Create configuration file: Create config/config.json with your settings:

    {
      "base_url": "https://your-target-site.com",
      "max_workers": 4,
      "max_retries": 2,
      "batch_delay_min": 1.5,
      "batch_delay_max": 3.0,
      "page_delay_min": 0.4,
      "page_delay_max": 1.0,
      "worker_start_delay_min": 0.7,
      "worker_start_delay_max": 1.8,
      "request_timeout": 30000,
      "rate_limit_delay": 3.0,
      "max_pages_per_maker": 20,
      "max_pages_sequential": 30,
      "progress_update_interval": 20,
      "duplicate_report_interval": 100
    }
  6. Create rotation agents file: Create config/rotation_agents.txt with user agents for rotation:

    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
    Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0
    

Usage

Quick Start

  1. Activate virtual environment:

    source venv/bin/activate
  2. Run the main data extractor:

    python main.py

Data Processing

Option 1: Use the Unified Pipeline (Recommended)

Process all data with one command:

# Run the complete processing pipeline
python src/csv_processing/process_data.py

# Or run specific steps only
python src/csv_processing/process_data.py --step 1    # Sort only
python src/csv_processing/process_data.py --step 2    # Clean only
# ... (steps 1-7 available)

Option 2: Manual Step-by-Step Processing

Process and clean the extracted data manually:

cd src/csv_processing

# 1. Sort the raw data
python 01_sort_raw_csv.py

# 2. Clean the CSV (preserves metadata)
python 02_clean_csv_data.py

# 3. Extract clean user agents to text file
python 03_extract_to_txt.py

# Optional: Additional processing
python 04_extract_csv_to_txt.py  # Alternative CSV to text conversion
python 05_clean_txt_file.py      # Clean text file format
python 06_final_analysis.py      # Advanced analysis
python 07_validate_data.py       # Verify data integrity

Output Files

Final Results

  • agents.txt: Final cleaned user agents (project root)
  • all_useragents.txt: All extracted user agents (project root)
  • useragents.csv: Raw extracted data with metadata (project root)

Intermediate Processing Files

  • src/csv_processing/useragents_sorted.csv: Organized by category
  • src/csv_processing/useragents_sorted_clean.csv: Clean CSV with metadata
  • src/csv_processing/all_useragents_clean_txt.txt: Clean text format
  • src/csv_processing/all_useragents_final_txt.txt: Final processed text

Data Structure (CSV)

category,subcategory,maker,user_agent
Browser Types,Application,AOL,"Mozilla/5.0 (iPhone; CPU iPhone OS 8_2..."
Browser Types,Application,Adobe Systems,"Mozilla/5.0 (Windows; U; en-US)..."
Devices,Desktop,Apple Inc,"Mozilla/5.0 (Macintosh; U; PPC Mac OS X..."
Platforms,Windows,Microsoft Corporation,"Mozilla/5.0 (Windows NT 10.0..."

Data Cleaning

The processing pipeline removes:

  • Invalid/corrupted entries (null values, undefined data)
  • Duplicate entries across all processing stages
  • Malformed user agents (suspicious characters, encoding issues)
  • Very short entries (< 8 characters)
  • Corrupted data suffixes and formatting issues

Configuration

Data Extractor Settings (data_extractor_multithreaded.py)

  • Workers: 4 (optimized for speed vs rate limiting)
  • Delays: 1.5-3.0s between batches, 0.4-1.0s between pages
  • Timeout: 30s per request
  • Page Limit: 20-30 pages per maker (prevents infinite loops)
  • User Agent Rotation: Every 5 pages

Rate Limiting Protection

  • Conservative delays between requests
  • Exponential backoff on errors
  • User agent rotation
  • Request timeout handling
  • Automatic retry with increasing delays

Performance & Results

Data Extraction Performance

  • Proven Scale: Successfully extracted nearly 80k user agents in a single session
  • Total Runtime: Completed full extraction in just a few hours
  • Success Rate: 99.9% (with error handling and retries)
  • Data Quality: High quality data after comprehensive processing

Final Statistics

  • Categories: 3 main categories (Browser Types, Devices, Platforms)
  • Subcategories: Multiple subcategories per main category
  • Makers: Hundreds of different makers/vendors
  • User Agents: Thousands to tens of thousands of unique, validated entries
  • Output Formats: CSV with metadata, clean text files

Technical Details

Dependencies

  • Playwright: Web automation and scraping
  • CSV: Data processing and output
  • Threading: Concurrent processing
  • urllib.parse: URL handling

Architecture

  • Multithreaded Design: ThreadPoolExecutor with worker pools
  • Thread-safe Operations: Locks for file writing and duplicate detection
  • Memory Efficient: Real-time processing and writing
  • Error Resilient: Comprehensive exception handling

Data Validation

Quality Assurance

  • No Duplicates: Verified across all output files
  • Complete Coverage: All categories, subcategories, and makers
  • Data Integrity: No corruption or invalid entries
  • Format Validation: Proper CSV structure and encoding

Verification Tools

  • validate_data.py: Validate data integrity
  • Built-in duplicate detection during data extraction
  • Comprehensive cleaning pipeline with validation

Notes

  • Respectful Data Extraction: Implements delays and rate limiting
  • Incremental Processing: Can resume from existing data
  • Modular Design: Separate extraction and processing stages
  • Well Documented: Comprehensive logging and progress reporting

Important Considerations

  • Rate Limiting: The extractor respects target sites with conservative delays
  • Resource Usage: Multithreading uses significant CPU and memory
  • Runtime: Complete data extraction can take several hours depending on site size
  • Storage: Requires adequate disk space for collected data
  • Ethical Use: Always respect robots.txt and site terms of service

Usage Guidelines

Note: This project is licensed under CC BY-NC-ND 4.0 (no derivative works permitted).

Allowed Uses:

  • Study the code for educational purposes
  • Share the project with proper attribution
  • Use as reference for your own projects

Key Implementation Techniques:

  • Multithreaded data extraction with rate limiting
  • Thread-safe file operations and duplicate detection
  • Configuration-driven architecture
  • Comprehensive error handling and retry logic

Status: ✅ Complete - Successfully demonstrates comprehensive web data extraction and processing techniques


Credits

This project was developed by yumeangelica. For more information on how this work can be used, please refer to the LICENSE.txt file.

Copyright © 2025 - present; yumeangelica

License

This project is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. This allows you to share the work, with appropriate credit given, but not to use it for commercial purposes or to create derivative works.

For more details about the license, please visit Creative Commons License.

About

This is a comprehensive web data extractor tool that extracts user agent strings from a user agent database website. The extractor is designed for high-volume data collection with enterprise-level capabilities.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages