Skip to content

srinivas-skr/Website-Email-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Advanced Email Web Scraper

This tool is designed to extract email addresses from a list of websites. It uses a two-step approach:

  1. First, it tries a fast method using requests and BeautifulSoup
  2. If that fails, it falls back to a more robust method using Selenium with Chrome WebDriver

Features

  • Two-step scraping approach for maximum effectiveness
  • Automatically checks contact pages for additional emails
  • Filters out false positives (image files with @ symbols)
  • Creates example URLs file if none exists
  • Saves results to CSV for easy analysis
  • Detailed console output with progress information

Requirements

  • Python 3.6 or higher
  • Chrome browser installed (for Selenium fallback method)
  • Required Python packages (see Installation)

Installation

  1. Make sure you have Python installed (with "Add to PATH" option checked)
  2. Install the required packages:
pip install -r requirements.txt

Or install them individually:

pip install selenium pandas beautifulsoup4 requests webdriver-manager

Usage

  1. Create a file named urls.txt with one URL per line, for example:

    https://example.com
    https://example.org
    
  2. Run the script:

    python local_scraper.py
    
  3. The script will create a file named extracted_emails.csv with the results

How It Works

  1. For each URL in your list, the scraper first tries the fast method using requests
  2. If no emails are found, it automatically switches to the more powerful Selenium method
  3. Both methods also check for contact pages and scan them for additional emails
  4. All unique emails are saved to a CSV file with their source URLs

Customization

You can modify the following variables at the top of the script:

  • URLS_FILE: Change the input file name (default: 'urls.txt')
  • OUTPUT_CSV: Change the output file name (default: 'extracted_emails.csv')
  • EMAIL_REGEX: Modify the regular expression used to find emails

Troubleshooting

If you encounter issues with Selenium:

  1. Make sure Chrome is installed on your system
  2. Try updating Chrome to the latest version
  3. If you're on Linux, you might need to install additional dependencies

Notes

  • The script includes a 3-second delay when using Selenium to allow JavaScript to load
  • A 1-second delay is added between URLs to avoid overloading servers

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published