This tool is designed to extract email addresses from a list of websites. It uses a two-step approach:
- First, it tries a fast method using
requests
andBeautifulSoup
- If that fails, it falls back to a more robust method using
Selenium
with Chrome WebDriver
- Two-step scraping approach for maximum effectiveness
- Automatically checks contact pages for additional emails
- Filters out false positives (image files with @ symbols)
- Creates example URLs file if none exists
- Saves results to CSV for easy analysis
- Detailed console output with progress information
- Python 3.6 or higher
- Chrome browser installed (for Selenium fallback method)
- Required Python packages (see Installation)
- Make sure you have Python installed (with "Add to PATH" option checked)
- Install the required packages:
pip install -r requirements.txt
Or install them individually:
pip install selenium pandas beautifulsoup4 requests webdriver-manager
-
Create a file named
urls.txt
with one URL per line, for example:https://example.com https://example.org
-
Run the script:
python local_scraper.py
-
The script will create a file named
extracted_emails.csv
with the results
- For each URL in your list, the scraper first tries the fast method using
requests
- If no emails are found, it automatically switches to the more powerful
Selenium
method - Both methods also check for contact pages and scan them for additional emails
- All unique emails are saved to a CSV file with their source URLs
You can modify the following variables at the top of the script:
URLS_FILE
: Change the input file name (default: 'urls.txt')OUTPUT_CSV
: Change the output file name (default: 'extracted_emails.csv')EMAIL_REGEX
: Modify the regular expression used to find emails
If you encounter issues with Selenium:
- Make sure Chrome is installed on your system
- Try updating Chrome to the latest version
- If you're on Linux, you might need to install additional dependencies
- The script includes a 3-second delay when using Selenium to allow JavaScript to load
- A 1-second delay is added between URLs to avoid overloading servers