User Agent Extractor

A comprehensive tool for extracting user agent strings from online sources, featuring advanced data processing and cleaning. Released for demonstration and portfolio purposes.

Real-world performance: Successfully extracted and processed nearly 80k unique user agents in just a few hours, demonstrating enterprise-level data collection capabilities.

Project Overview

This project extracts user agents from web sources across all categories (Browser Types, Devices, Platforms), subcategories, and makers. It includes multithreaded data extraction, real-time duplicate detection, user agent rotation, and comprehensive data cleaning pipelines.

Features

High-Volume Data Extraction: Proven to extract nearly 80K user agents efficiently
Comprehensive Coverage: Extracts user agents from all categories, subcategories, and makers
Multithreaded Performance: Optimized with 4 workers for speed while avoiding rate limits
Real-time Duplicate Detection: Thread-safe duplicate prevention during extraction
User Agent Rotation: Rotates through different user agents to avoid detection
Pagination Handling: Automatically navigates through all pages with loop detection
Data Cleaning Pipeline: Multiple processing stages to clean and validate data
Rate Limiting Protection: Conservative delays and exponential backoff
CSV + Text Output: Structured CSV with metadata and clean text files

Project Structure

user_agent_extractor/
├── main.py                       # Main entry point - run this script
├── README.md                     # This file
├── LICENSE.txt                   # Creative Commons license
├── requirements.txt              # Python dependencies
├── config/                      # Configuration files (create your own)
│   ├── rotation_agents.txt       # User agents for rotation separated by newlines
│   └── config.json              # Application settings (not in repo)
└── src/                         # Source code
    ├── __init__.py              # Package initialization
    ├── data_extractor_multithreaded.py # Core optimized multithreaded data extractor
    └── csv_processing/          # Data processing utilities
        ├── README.md            # Processing documentation
        ├── process_data.py      # Unified pipeline script
        ├── 01_sort_raw_csv.py   # Sort raw CSV data
        ├── 02_clean_csv_data.py # Clean CSV with metadata
        ├── 03_extract_to_txt.py # Extract clean agents to text
        ├── 04_extract_csv_to_txt.py # Alternative CSV to text conversion
        ├── 05_clean_txt_file.py # Clean text file format
        ├── 06_final_analysis.py # Advanced analysis and cleaning
        └── 07_validate_data.py  # Verify data integrity

Setup & Installation

Prerequisites

Python 3.8+
pip

Installation

Clone or download the project

Create virtual environment:

python -m venv venv
source venv/bin/activate  # On macOS/Linux

Install dependencies:
```
pip install -r requirements.txt
```
Install Playwright browsers:
```
playwright install chromium
```

Create configuration file: Create config/config.json with your settings:

{
  "base_url": "https://your-target-site.com",
  "max_workers": 4,
  "max_retries": 2,
  "batch_delay_min": 1.5,
  "batch_delay_max": 3.0,
  "page_delay_min": 0.4,
  "page_delay_max": 1.0,
  "worker_start_delay_min": 0.7,
  "worker_start_delay_max": 1.8,
  "request_timeout": 30000,
  "rate_limit_delay": 3.0,
  "max_pages_per_maker": 20,
  "max_pages_sequential": 30,
  "progress_update_interval": 20,
  "duplicate_report_interval": 100
}

Create rotation agents file: Create config/rotation_agents.txt with user agents for rotation:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0

Usage

Quick Start

Activate virtual environment:
```
source venv/bin/activate
```
Run the main data extractor:
```
python main.py
```

Data Processing

Option 1: Use the Unified Pipeline (Recommended)

Process all data with one command:

# Run the complete processing pipeline
python src/csv_processing/process_data.py

# Or run specific steps only
python src/csv_processing/process_data.py --step 1    # Sort only
python src/csv_processing/process_data.py --step 2    # Clean only
# ... (steps 1-7 available)

Option 2: Manual Step-by-Step Processing

Process and clean the extracted data manually:

cd src/csv_processing

# 1. Sort the raw data
python 01_sort_raw_csv.py

# 2. Clean the CSV (preserves metadata)
python 02_clean_csv_data.py

# 3. Extract clean user agents to text file
python 03_extract_to_txt.py

# Optional: Additional processing
python 04_extract_csv_to_txt.py  # Alternative CSV to text conversion
python 05_clean_txt_file.py      # Clean text file format
python 06_final_analysis.py      # Advanced analysis
python 07_validate_data.py       # Verify data integrity

Output Files

Final Results

agents.txt: Final cleaned user agents (project root)
all_useragents.txt: All extracted user agents (project root)
useragents.csv: Raw extracted data with metadata (project root)

Intermediate Processing Files

src/csv_processing/useragents_sorted.csv: Organized by category
src/csv_processing/useragents_sorted_clean.csv: Clean CSV with metadata
src/csv_processing/all_useragents_clean_txt.txt: Clean text format
src/csv_processing/all_useragents_final_txt.txt: Final processed text

Data Structure (CSV)

category,subcategory,maker,user_agent
Browser Types,Application,AOL,"Mozilla/5.0 (iPhone; CPU iPhone OS 8_2..."
Browser Types,Application,Adobe Systems,"Mozilla/5.0 (Windows; U; en-US)..."
Devices,Desktop,Apple Inc,"Mozilla/5.0 (Macintosh; U; PPC Mac OS X..."
Platforms,Windows,Microsoft Corporation,"Mozilla/5.0 (Windows NT 10.0..."

Data Cleaning

The processing pipeline removes:

✅ Invalid/corrupted entries (null values, undefined data)
✅ Duplicate entries across all processing stages
✅ Malformed user agents (suspicious characters, encoding issues)
✅ Very short entries (< 8 characters)
✅ Corrupted data suffixes and formatting issues

Configuration

Data Extractor Settings (`data_extractor_multithreaded.py`)

Workers: 4 (optimized for speed vs rate limiting)
Delays: 1.5-3.0s between batches, 0.4-1.0s between pages
Timeout: 30s per request
Page Limit: 20-30 pages per maker (prevents infinite loops)
User Agent Rotation: Every 5 pages

Rate Limiting Protection

Conservative delays between requests
Exponential backoff on errors
User agent rotation
Request timeout handling
Automatic retry with increasing delays

Performance & Results

Data Extraction Performance

Proven Scale: Successfully extracted nearly 80k user agents in a single session
Total Runtime: Completed full extraction in just a few hours
Success Rate: 99.9% (with error handling and retries)
Data Quality: High quality data after comprehensive processing

Final Statistics

Categories: 3 main categories (Browser Types, Devices, Platforms)
Subcategories: Multiple subcategories per main category
Makers: Hundreds of different makers/vendors
User Agents: Thousands to tens of thousands of unique, validated entries
Output Formats: CSV with metadata, clean text files

Technical Details

Dependencies

Playwright: Web automation and scraping
CSV: Data processing and output
Threading: Concurrent processing
urllib.parse: URL handling

Architecture

Multithreaded Design: ThreadPoolExecutor with worker pools
Thread-safe Operations: Locks for file writing and duplicate detection
Memory Efficient: Real-time processing and writing
Error Resilient: Comprehensive exception handling

Data Validation

Quality Assurance

No Duplicates: Verified across all output files
Complete Coverage: All categories, subcategories, and makers
Data Integrity: No corruption or invalid entries
Format Validation: Proper CSV structure and encoding

Verification Tools

validate_data.py: Validate data integrity
Built-in duplicate detection during data extraction
Comprehensive cleaning pipeline with validation

Notes

Respectful Data Extraction: Implements delays and rate limiting
Incremental Processing: Can resume from existing data
Modular Design: Separate extraction and processing stages
Well Documented: Comprehensive logging and progress reporting

Important Considerations

Rate Limiting: The extractor respects target sites with conservative delays
Resource Usage: Multithreading uses significant CPU and memory
Runtime: Complete data extraction can take several hours depending on site size
Storage: Requires adequate disk space for collected data
Ethical Use: Always respect robots.txt and site terms of service

Usage Guidelines

Note: This project is licensed under CC BY-NC-ND 4.0 (no derivative works permitted).

Allowed Uses:

Study the code for educational purposes
Share the project with proper attribution
Use as reference for your own projects

Key Implementation Techniques:

Multithreaded data extraction with rate limiting
Thread-safe file operations and duplicate detection
Configuration-driven architecture
Comprehensive error handling and retry logic

Status: ✅ Complete - Successfully demonstrates comprehensive web data extraction and processing techniques

Credits

This project was developed by yumeangelica. For more information on how this work can be used, please refer to the LICENSE.txt file.

License

This project is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. This allows you to share the work, with appropriate credit given, but not to use it for commercial purposes or to create derivative works.

For more details about the license, please visit Creative Commons License.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

yumeangelica/user_agent_extractor

Folders and files

Latest commit

History

Repository files navigation

User Agent Extractor

Project Overview

Features

Project Structure

Setup & Installation

Prerequisites

Installation

Usage

Quick Start

Data Processing

Option 1: Use the Unified Pipeline (Recommended)

Option 2: Manual Step-by-Step Processing

Output Files

Final Results

Intermediate Processing Files

Data Structure (CSV)

Data Cleaning

Configuration

Data Extractor Settings (data_extractor_multithreaded.py)

Rate Limiting Protection

Performance & Results

Data Extraction Performance

Final Statistics

Technical Details

Dependencies

Architecture

Data Validation

Quality Assurance

Verification Tools

Notes

Important Considerations

Usage Guidelines

Allowed Uses:

Key Implementation Techniques:

Credits

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Data Extractor Settings (`data_extractor_multithreaded.py`)

Packages