Web_Scraper 📎

Welcome to Web_Scraper 🌐, a cutting-edge tool designed to scrape webpages in a very neat fashion. Crafted with python,

This comprehensive guide is here to equip you with everything you need to use Web_Scraper effectively.

🛠️ Installation and Setup 🛠️

Prerequisites

Ensure your system meets these requirements:

Has Python 3.8 or higher.
Downloaded all required dependencies.

Step-by-Step Installation

Clone the Repository: Use Git to clone Web_Scraper to your local machine. Open Command Prompt as an administrator and run:
```
git clone https://github.com/DefinetlyNotAI/Web_Scraper.git
```
Navigate to the Project Directory: Change your current directory to the cloned CHANGE_ME folder:
```
cd Web_Scraper
```
Install Dependencies: Run pip install -r requirements.txt
Run the Web Scraper: Run ./scrape more info below.

Basic Usage

The utility is executed from the command line. Here's a basic example of how to use it:

python scrape.py --URL "https://example.com" --name "ExampleSite" --zip --full -y

You may use secrets.scrape.py for beta testing functionality.

Options

--url: Required. The URL of the website you wish to scrape.
--name: Optional. A custom name for the scraped website. If not provided, the domain name will be used.
--zip: Optional. If set, the utility will compress the downloaded files into a ZIP archive.
--full: Optional. If set, the utility will download the full HTML content along with associated resources. Otherwise, it downloads only the basic HTML content.
-y: Optional. Automatically proceeds with the download without asking for confirmation.

Functions Overview

`download_basic_html(url)`

Downloads the basic HTML content from a given URL and saves it to a file.

`download_with_resources(url)`

Downloads the HTML content and associated resources from a given URL, saves them to a file, and returns the filename.

`download_images(base_url, url)`

Downloads images from a given URL after processing them to get the absolute image URLs.

`zip_files(zip_filename, files, delete_after=False)`

Zips the files given in the 'files' list into a zip file named 'zip_filename'. Optionally deletes the files after zipping.

`parse()`

Main function that serves as the entry point for the web scraping application. Parses command-line arguments to scrape a given URL, download content based on the arguments provided, and optionally zip the downloaded files.

Dependencies

argparse: For parsing command-line options and arguments.
os, shutil: For file and directory operations.
requests: For making HTTP requests.
BeautifulSoup: For parsing HTML content.
zipfile: For creating ZIP archives.
tqdm: For displaying progress bars during downloads.

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue on GitHub.

Source Code

Read the CONTRIBUTING file for more information.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
.idea		.idea
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CREDITS.md		CREDITS.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
scrape.py		scrape.py
secrets.scrape.py		secrets.scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web_Scraper 📎

Table of Contents

🛠️ Installation and Setup 🛠️

Prerequisites

Step-by-Step Installation

Basic Usage

Options

Functions Overview

`download_basic_html(url)`

`download_with_resources(url)`

`download_images(base_url, url)`

`zip_files(zip_filename, files, delete_after=False)`

`parse()`

Dependencies

Contributing

License

About

Uh oh!

Releases 2

Uh oh!

Languages

License

DefinetlyNotAI/Web_Scraper

Folders and files

Latest commit

History

Repository files navigation

Web_Scraper 📎

Table of Contents

🛠️ Installation and Setup 🛠️

Prerequisites

Step-by-Step Installation

Basic Usage

Options

Functions Overview

download_basic_html(url)

download_with_resources(url)

download_images(base_url, url)

zip_files(zip_filename, files, delete_after=False)

parse()

Dependencies

Contributing

License

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Languages

`download_basic_html(url)`

`download_with_resources(url)`

`download_images(base_url, url)`

`zip_files(zip_filename, files, delete_after=False)`

`parse()`