Skip to content

Modular Python application that analyzes the structure and content of any website. WebExtractor is cross-platform and works on macOS, Windows, and Linux.

License

Notifications You must be signed in to change notification settings

alfredo-villa/webextractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebExtractor

WebExtractor is a modular Python application that analyzes the structure and content of any website (static, dynamic, WordPress, etc.), generating a detailed technical HTML report and a user-friendly AI-powered report. The app is designed to be easily extensible and features a modern, intuitive UI.


Disclaimer

Important Notice:
WebExtractor is provided for educational, research, and legitimate business purposes only.
You are solely responsible for ensuring that your use of this software complies with all applicable laws, regulations, and the terms of service of any websites you access or analyze.
WebExtractor's authors and contributors disclaim any liability for misuse or illegal activity.
Do not use this tool to scrape or analyze websites without proper authorization or in violation of copyright or data protection laws.


Main Features

  • Automatic analysis of website structure (site map, hierarchies, menus, pages, articles)
  • Collection of structured data from all reachable pages
  • Support for static HTML sites and dynamic sites (JavaScript-loaded content)
  • Multilanguage support: All generated reports and AI outputs are available in 5 languages (Italian, English, Spanish, French, German). The user can select the language at project creation. All UI, report, and AI-generated texts are localized, while scraped content remains in its original language.
  • Generation of a detailed technical HTML report, downloadable in a project-named folder
  • Generation of a user-friendly report via AI (OpenAI, Anthropic, Gemini, Deepseek)
  • Modern, easy-to-use frontend (local web app)
  • Dynamic progress bar
  • Dynamic AI configuration (OpenAI, Anthropic, Gemini, Deepseek) with secure local API key storage
  • One-click reset of all saved API keys from the AI Configuration page, with confirmation and feedback
  • Modular architecture, ready for future extensions (plugins)
  • Local configuration file
  • Cross-platform: works on macOS, Windows, and Linux (see installation notes below)
  • Fully tested on macOS

Multilanguage Support

WebExtractor now supports five languages for all generated content:

  • Italiano (it)
  • English (en)
  • Español (es)
  • Français (fr)
  • Deutsch (de)

You can select your preferred language when creating a new project. All technical and user-friendly reports, as well as AI-generated summaries, will be presented in the selected language. The localization system is easily extendable: to add a new language, simply update the config/translations.py file.


Project Structure

webextractor/
│
├── ai/                # AI management module (dynamic providers, plugins)
├── config/            # Configuration management and local storage
│   └── translations.py # Multilanguage dictionaries
├── frontend/          # HTML templates, static, JS, CSS (modern UI)
│   ├── static/
│   └── templates/
├── plugins/           # Space for future extensions/plugins
├── report/            # HTML report generation and AI intermediate files
├── scraper/           # Scraping and site structure analysis (modular, static/dynamic)
│
├── README.md          # This guide
├── requirements.txt   # Python dependencies

Installation

1. Prerequisites

  • Python 3.9+ (Python 3.10+ recommended)
  • pip (Python package manager)
  • Google Chrome or Chromium (for dynamic site scraping with Selenium)
  • ChromeDriver (for Selenium, managed automatically)
  • macOS, Windows, or Linux (see notes below)

2. Install Python dependencies

Open a terminal in the project folder and run:

pip install -r requirements.txt

3. Requirements for dynamic scraping

  • Google Chrome (https://www.google.com/chrome/) or Chromium must be installed on your system.
    • On Windows: Download and install Chrome from the official site.
    • On Linux: Install with your package manager, e.g. sudo apt install chromium-browser or sudo apt install google-chrome-stable.
    • On macOS: Download from the official site or use Homebrew.
  • ChromeDriver is managed automatically by webdriver-manager; no manual installation needed.

4. Start the app

python main.py

Screenshot Troubleshooting

  • If screenshot generation fails or does not start, ensure Google Chrome or Chromium is installed and up to date.
  • webdriver-manager automatically downloads the correct ChromeDriver version.
  • If you use a non-standard OS or a very recent Chrome version, update both Chrome and Python packages (pip install -U selenium webdriver-manager).
  • For "cannot find Chrome binary" errors:
    • Windows: Make sure Chrome is installed in the default location or is in your PATH.
    • Linux: Ensure google-chrome or chromium-browser is in your PATH.
    • macOS: Ensure Chrome is in /Applications or in your PATH.
  • If you encounter issues, check the console output for error messages about Chrome/Chromedriver.

The app will be accessible in your browser at http://localhost:5000.


Scraping Modes

  • Static (default): Faster, suitable for classic HTML sites.
  • Dynamic (JS): Enable the “Enable dynamic scraping (JS)” checkbox on the home page to capture content loaded via JavaScript (SPA, dynamic menus, etc.).
    Requires Chrome/Chromium and ChromeDriver installed.

AI Configuration

  • From the frontend, you can select the AI provider (OpenAI, Anthropic, Gemini, Deepseek) and enter your API key.
  • The key is stored securely and locally for future use.
  • You can change provider or key at any time from the configuration section.
  • The user-friendly report is generated via AI from an optimized intermediate file.
  • Reset All API Keys: You can reset (delete) all saved API keys at any time by clicking the "Reset All API Keys" button at the bottom of the AI Configuration page. This action requires confirmation and will remove all stored API credentials from your local configuration.

Analysis Pipeline

  1. Project setup: Enter project name, URL, choose AI, language, and scraping mode.
  2. Scraping: Collect data from all public pages (static or dynamic).
  3. Technical report generation: Detailed HTML, faithful to the original.
  4. AI intermediate file generation: Optimized data to reduce AI costs.
  5. User-friendly report generation: Via AI, designed for clients or non-technical users.

How to Add New Features/Modules

  • Create a new Python module in the appropriate folder (ai/, scraper/, report/, plugins/).
  • Follow the interfaces and patterns already present (see docstrings in files).
  • Register the new module/plugin in the configuration file or via the plugin system.

FAQ & Troubleshooting

Q: The app does not start?
A: Make sure you have Python 3.9+ and all dependencies installed.

Q: Dynamic scraping does not work?
A: Ensure Chrome/Chromium and ChromeDriver are installed and up to date.
Check that chromedriver is in your PATH.

Q: How can I add a new AI provider?
A: Implement a new class in ai/ following the interface and register it in the system.

Q: Where are API keys stored?
A: In a local file in the config/ folder.


Security: Keeping API Keys and Local Config Private

Important:
Do NOT commit your API keys or local configuration files to the public repository.

  • Add the following lines to your .gitignore file to keep sensitive files out of version control:
# Local config and API keys
config/config.json
.env
output/
  • Never share your API keys publicly.
  • The app is designed to read API keys from local config only.

Support the Project

If you find WebExtractor useful and want to support the development of free and open-source projects, you can make a donation.
Click here to donate via PayPal


License

This project is licensed under the MIT License.
See the LICENSE file for details.


Note:
WebExtractor is cross-platform and works on macOS, Windows, and Linux.
It has been fully tested on macOS. For Windows and Linux, ensure Chrome/Chromium is installed and accessible in your system PATH. If you encounter issues with browser detection, consult the troubleshooting section above.

Releases

No releases published

Packages

No packages published