WebExtractor is a modular Python application that analyzes the structure and content of any website (static, dynamic, WordPress, etc.), generating a detailed technical HTML report and a user-friendly AI-powered report. The app is designed to be easily extensible and features a modern, intuitive UI.
Important Notice:
WebExtractor is provided for educational, research, and legitimate business purposes only.
You are solely responsible for ensuring that your use of this software complies with all applicable laws, regulations, and the terms of service of any websites you access or analyze.
WebExtractor's authors and contributors disclaim any liability for misuse or illegal activity.
Do not use this tool to scrape or analyze websites without proper authorization or in violation of copyright or data protection laws.
- Automatic analysis of website structure (site map, hierarchies, menus, pages, articles)
- Collection of structured data from all reachable pages
- Support for static HTML sites and dynamic sites (JavaScript-loaded content)
- Multilanguage support: All generated reports and AI outputs are available in 5 languages (Italian, English, Spanish, French, German). The user can select the language at project creation. All UI, report, and AI-generated texts are localized, while scraped content remains in its original language.
- Generation of a detailed technical HTML report, downloadable in a project-named folder
- Generation of a user-friendly report via AI (OpenAI, Anthropic, Gemini, Deepseek)
- Modern, easy-to-use frontend (local web app)
- Dynamic progress bar
- Dynamic AI configuration (OpenAI, Anthropic, Gemini, Deepseek) with secure local API key storage
- One-click reset of all saved API keys from the AI Configuration page, with confirmation and feedback
- Modular architecture, ready for future extensions (plugins)
- Local configuration file
- Cross-platform: works on macOS, Windows, and Linux (see installation notes below)
- Fully tested on macOS
WebExtractor now supports five languages for all generated content:
- Italiano (it)
- English (en)
- Español (es)
- Français (fr)
- Deutsch (de)
You can select your preferred language when creating a new project. All technical and user-friendly reports, as well as AI-generated summaries, will be presented in the selected language. The localization system is easily extendable: to add a new language, simply update the config/translations.py
file.
webextractor/
│
├── ai/ # AI management module (dynamic providers, plugins)
├── config/ # Configuration management and local storage
│ └── translations.py # Multilanguage dictionaries
├── frontend/ # HTML templates, static, JS, CSS (modern UI)
│ ├── static/
│ └── templates/
├── plugins/ # Space for future extensions/plugins
├── report/ # HTML report generation and AI intermediate files
├── scraper/ # Scraping and site structure analysis (modular, static/dynamic)
│
├── README.md # This guide
├── requirements.txt # Python dependencies
- Python 3.9+ (Python 3.10+ recommended)
- pip (Python package manager)
- Google Chrome or Chromium (for dynamic site scraping with Selenium)
- ChromeDriver (for Selenium, managed automatically)
- macOS, Windows, or Linux (see notes below)
Open a terminal in the project folder and run:
pip install -r requirements.txt
- Google Chrome (https://www.google.com/chrome/) or Chromium must be installed on your system.
- On Windows: Download and install Chrome from the official site.
- On Linux: Install with your package manager, e.g.
sudo apt install chromium-browser
orsudo apt install google-chrome-stable
. - On macOS: Download from the official site or use Homebrew.
- ChromeDriver is managed automatically by webdriver-manager; no manual installation needed.
python main.py
- If screenshot generation fails or does not start, ensure Google Chrome or Chromium is installed and up to date.
- webdriver-manager automatically downloads the correct ChromeDriver version.
- If you use a non-standard OS or a very recent Chrome version, update both Chrome and Python packages (
pip install -U selenium webdriver-manager
). - For "cannot find Chrome binary" errors:
- Windows: Make sure Chrome is installed in the default location or is in your PATH.
- Linux: Ensure
google-chrome
orchromium-browser
is in your PATH. - macOS: Ensure Chrome is in
/Applications
or in your PATH.
- If you encounter issues, check the console output for error messages about Chrome/Chromedriver.
The app will be accessible in your browser at http://localhost:5000.
- Static (default): Faster, suitable for classic HTML sites.
- Dynamic (JS): Enable the “Enable dynamic scraping (JS)” checkbox on the home page to capture content loaded via JavaScript (SPA, dynamic menus, etc.).
Requires Chrome/Chromium and ChromeDriver installed.
- From the frontend, you can select the AI provider (OpenAI, Anthropic, Gemini, Deepseek) and enter your API key.
- The key is stored securely and locally for future use.
- You can change provider or key at any time from the configuration section.
- The user-friendly report is generated via AI from an optimized intermediate file.
- Reset All API Keys: You can reset (delete) all saved API keys at any time by clicking the "Reset All API Keys" button at the bottom of the AI Configuration page. This action requires confirmation and will remove all stored API credentials from your local configuration.
- Project setup: Enter project name, URL, choose AI, language, and scraping mode.
- Scraping: Collect data from all public pages (static or dynamic).
- Technical report generation: Detailed HTML, faithful to the original.
- AI intermediate file generation: Optimized data to reduce AI costs.
- User-friendly report generation: Via AI, designed for clients or non-technical users.
- Create a new Python module in the appropriate folder (
ai/
,scraper/
,report/
,plugins/
). - Follow the interfaces and patterns already present (see docstrings in files).
- Register the new module/plugin in the configuration file or via the plugin system.
Q: The app does not start?
A: Make sure you have Python 3.9+ and all dependencies installed.
Q: Dynamic scraping does not work?
A: Ensure Chrome/Chromium and ChromeDriver are installed and up to date.
Check that chromedriver
is in your PATH.
Q: How can I add a new AI provider?
A: Implement a new class in ai/
following the interface and register it in the system.
Q: Where are API keys stored?
A: In a local file in the config/
folder.
Important:
Do NOT commit your API keys or local configuration files to the public repository.
- Add the following lines to your
.gitignore
file to keep sensitive files out of version control:
# Local config and API keys
config/config.json
.env
output/
- Never share your API keys publicly.
- The app is designed to read API keys from local config only.
If you find WebExtractor useful and want to support the development of free and open-source projects, you can make a donation.
Click here to donate via PayPal
This project is licensed under the MIT License.
See the LICENSE file for details.
Note:
WebExtractor is cross-platform and works on macOS, Windows, and Linux.
It has been fully tested on macOS. For Windows and Linux, ensure Chrome/Chromium is installed and accessible in your system PATH. If you encounter issues with browser detection, consult the troubleshooting section above.