WebExtractor

WebExtractor is a modular Python application that analyzes the structure and content of any website (static, dynamic, WordPress, etc.), generating a detailed technical HTML report and a user-friendly AI-powered report. The app is designed to be easily extensible and features a modern, intuitive UI.

Disclaimer

Important Notice:
WebExtractor is provided for educational, research, and legitimate business purposes only.
You are solely responsible for ensuring that your use of this software complies with all applicable laws, regulations, and the terms of service of any websites you access or analyze.
WebExtractor's authors and contributors disclaim any liability for misuse or illegal activity.
Do not use this tool to scrape or analyze websites without proper authorization or in violation of copyright or data protection laws.

Main Features

Automatic analysis of website structure (site map, hierarchies, menus, pages, articles)
Collection of structured data from all reachable pages
Support for static HTML sites and dynamic sites (JavaScript-loaded content)
Multilanguage support: All generated reports and AI outputs are available in 5 languages (Italian, English, Spanish, French, German). The user can select the language at project creation. All UI, report, and AI-generated texts are localized, while scraped content remains in its original language.
Generation of a detailed technical HTML report, downloadable in a project-named folder
Generation of a user-friendly report via AI (OpenAI, Anthropic, Gemini, Deepseek)
Modern, easy-to-use frontend (local web app)
Dynamic progress bar
Dynamic AI configuration (OpenAI, Anthropic, Gemini, Deepseek) with secure local API key storage
One-click reset of all saved API keys from the AI Configuration page, with confirmation and feedback
Modular architecture, ready for future extensions (plugins)
Local configuration file
Cross-platform: works on macOS, Windows, and Linux (see installation notes below)
Fully tested on macOS

Multilanguage Support

WebExtractor now supports five languages for all generated content:

Italiano (it)
English (en)
Español (es)
Français (fr)
Deutsch (de)

You can select your preferred language when creating a new project. All technical and user-friendly reports, as well as AI-generated summaries, will be presented in the selected language. The localization system is easily extendable: to add a new language, simply update the config/translations.py file.

Project Structure

webextractor/
│
├── ai/                # AI management module (dynamic providers, plugins)
├── config/            # Configuration management and local storage
│   └── translations.py # Multilanguage dictionaries
├── frontend/          # HTML templates, static, JS, CSS (modern UI)
│   ├── static/
│   └── templates/
├── plugins/           # Space for future extensions/plugins
├── report/            # HTML report generation and AI intermediate files
├── scraper/           # Scraping and site structure analysis (modular, static/dynamic)
│
├── README.md          # This guide
├── requirements.txt   # Python dependencies

Installation

1. Prerequisites

Python 3.9+ (Python 3.10+ recommended)
pip (Python package manager)
Google Chrome or Chromium (for dynamic site scraping with Selenium)
ChromeDriver (for Selenium, managed automatically)
macOS, Windows, or Linux (see notes below)

2. Install Python dependencies

Open a terminal in the project folder and run:

pip install -r requirements.txt

3. Requirements for dynamic scraping

Google Chrome (https://www.google.com/chrome/) or Chromium must be installed on your system.
- On Windows: Download and install Chrome from the official site.
- On Linux: Install with your package manager, e.g. sudo apt install chromium-browser or sudo apt install google-chrome-stable.
- On macOS: Download from the official site or use Homebrew.
ChromeDriver is managed automatically by webdriver-manager; no manual installation needed.

4. Start the app

python main.py

Screenshot Troubleshooting

If screenshot generation fails or does not start, ensure Google Chrome or Chromium is installed and up to date.
webdriver-manager automatically downloads the correct ChromeDriver version.
If you use a non-standard OS or a very recent Chrome version, update both Chrome and Python packages (pip install -U selenium webdriver-manager).
For "cannot find Chrome binary" errors:
- Windows: Make sure Chrome is installed in the default location or is in your PATH.
- Linux: Ensure google-chrome or chromium-browser is in your PATH.
- macOS: Ensure Chrome is in /Applications or in your PATH.
If you encounter issues, check the console output for error messages about Chrome/Chromedriver.

The app will be accessible in your browser at http://localhost:5000.

Scraping Modes

Static (default): Faster, suitable for classic HTML sites.
Dynamic (JS): Enable the “Enable dynamic scraping (JS)” checkbox on the home page to capture content loaded via JavaScript (SPA, dynamic menus, etc.).
Requires Chrome/Chromium and ChromeDriver installed.

AI Configuration

From the frontend, you can select the AI provider (OpenAI, Anthropic, Gemini, Deepseek) and enter your API key.
The key is stored securely and locally for future use.
You can change provider or key at any time from the configuration section.
The user-friendly report is generated via AI from an optimized intermediate file.
Reset All API Keys: You can reset (delete) all saved API keys at any time by clicking the "Reset All API Keys" button at the bottom of the AI Configuration page. This action requires confirmation and will remove all stored API credentials from your local configuration.

Analysis Pipeline

Project setup: Enter project name, URL, choose AI, language, and scraping mode.
Scraping: Collect data from all public pages (static or dynamic).
Technical report generation: Detailed HTML, faithful to the original.
AI intermediate file generation: Optimized data to reduce AI costs.
User-friendly report generation: Via AI, designed for clients or non-technical users.

How to Add New Features/Modules

Create a new Python module in the appropriate folder (ai/, scraper/, report/, plugins/).
Follow the interfaces and patterns already present (see docstrings in files).
Register the new module/plugin in the configuration file or via the plugin system.

FAQ & Troubleshooting

Q: The app does not start?
A: Make sure you have Python 3.9+ and all dependencies installed.

Q: Dynamic scraping does not work?
A: Ensure Chrome/Chromium and ChromeDriver are installed and up to date.
Check that chromedriver is in your PATH.

Q: How can I add a new AI provider?
A: Implement a new class in ai/ following the interface and register it in the system.

Q: Where are API keys stored?
A: In a local file in the config/ folder.

Security: Keeping API Keys and Local Config Private

Important:
Do NOT commit your API keys or local configuration files to the public repository.

Add the following lines to your .gitignore file to keep sensitive files out of version control:

# Local config and API keys
config/config.json
.env
output/

Never share your API keys publicly.
The app is designed to read API keys from local config only.

Support the Project

If you find WebExtractor useful and want to support the development of free and open-source projects, you can make a donation.
Click here to donate via PayPal

License

This project is licensed under the MIT License.
See the LICENSE file for details.

Note:
WebExtractor is cross-platform and works on macOS, Windows, and Linux.
It has been fully tested on macOS. For Windows and Linux, ensure Chrome/Chromium is installed and accessible in your system PATH. If you encounter issues with browser detection, consult the troubleshooting section above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WebExtractor

Disclaimer

Main Features

Multilanguage Support

Project Structure

Installation

1. Prerequisites

2. Install Python dependencies

3. Requirements for dynamic scraping

4. Start the app

Screenshot Troubleshooting

Scraping Modes

AI Configuration

Analysis Pipeline

How to Add New Features/Modules

FAQ & Troubleshooting

Security: Keeping API Keys and Local Config Private

Support the Project

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ai		ai
config		config
frontend		frontend
plugins		plugins
report		report
scraper		scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

alfredo-villa/webextractor

Folders and files

Latest commit

History

Repository files navigation

WebExtractor

Disclaimer

Main Features

Multilanguage Support

Project Structure

Installation

1. Prerequisites

2. Install Python dependencies

3. Requirements for dynamic scraping

4. Start the app

Screenshot Troubleshooting

Scraping Modes

AI Configuration

Analysis Pipeline

How to Add New Features/Modules

FAQ & Troubleshooting

Security: Keeping API Keys and Local Config Private

Support the Project

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages