AI Web Scraper

👉 View Demo Here 🎈

Overview

This project is an AI-powered web scraper that utilizes OpenAI's API, BrightData's Scraping Browser, Selenium, and other libraries to extract and process website data. It is designed to bypass CAPTCHA challenges and interact with web pages dynamically. The extracted content can be processed with a large language model (LLM) for structured data extraction.

Features

Automated Web Scraping: Uses Selenium and BrightData's Scraping Browser to access and extract content.
CAPTCHA Handling: Overcomes website CAPTCHA challenges using BrightData's Scraping Browser.
Content Cleaning: Removes unnecessary elements such as scripts and styles from the extracted HTML.
AI-Powered Parsing: Uses OpenAI's API to process and analyze extracted content.
Streamlit UI: Provides a simple user interface for inputting website URLs and processing scraped data.

Installation

Prerequisites

Ensure you have Python installed (version 3.7+ recommended).

Steps

Clone this repository:

git clone https://github.com/naoufalcb/AI-Web-Scraper
cd ai-web-scraper

Install dependencies:
```
pip install -r requirements.txt
```
Set up environment variables:
- Adjust the .env file in the root directory.
- Add your BrightData Scraping Browser WebDriver URL:
```
SBR_WEBDRIVER=your_scraping_browser_webdriver_url
```
- Add your OpenAI API key:
```
GITHUB_TOKEN=your_openai_api_key
```

Demo

1. Run the Streamlit app:

streamlit run main.py

2. Chose a website to scrape.

3. Enter a website URL in the input field and click "Scrape".

3. Review the extracted content.

4. Provide instructions on what data to extract and click "Parse" to process it with AI.

5. Results will be displayed in the Streamlit app.

Project Structure

├── scrape.py       # Web scraping functions
├── main.py         # Streamlit UI
├── llm.py          # AI-powered parsing functions
├── requirements.txt # Required dependencies
├── .env            # Environment variables

Dependencies

Streamlit: UI for interacting with the scraper.
Selenium: Automates browser interactions.
BeautifulSoup4: Parses and cleans HTML content.
BrightData Scraping Browser: Enables CAPTCHA bypass and advanced scraping.
OpenAI API (via LangChain): Processes extracted content.
Python-dotenv: Manages environment variables.

Notes

Ensure that your BrightData Scraping Browser and OpenAI API credentials are correctly set up in the .env file.

License

This project is licensed under the MIT License.

Author

Naoufal CHABAA

📧 Email: nchabaa3@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Web Scraper

👉 View Demo Here 🎈

Overview

Features

Installation

Prerequisites

Steps

Demo

1. Run the Streamlit app:

2. Chose a website to scrape.

3. Enter a website URL in the input field and click "Scrape".

3. Review the extracted content.

4. Provide instructions on what data to extract and click "Parse" to process it with AI.

5. Results will be displayed in the Streamlit app.

Project Structure

Dependencies

Notes

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
demo		demo
.env		.env
README.md		README.md
llm.py		llm.py
main.py		main.py
requirements.txt		requirements.txt
scrape.py		scrape.py

naoufalcb/AI-Web-Scraper

Folders and files

Latest commit

History

Repository files navigation

AI Web Scraper

👉 View Demo Here 🎈

Overview

Features

Installation

Prerequisites

Steps

Demo

1. Run the Streamlit app:

2. Chose a website to scrape.

3. Enter a website URL in the input field and click "Scrape".

3. Review the extracted content.

4. Provide instructions on what data to extract and click "Parse" to process it with AI.

5. Results will be displayed in the Streamlit app.

Project Structure

Dependencies

Notes

License

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages