Welcome to Web_Scraper 🌐, a cutting-edge tool designed to scrape webpages in a very neat fashion. Crafted with python,
This comprehensive guide is here to equip you with everything you need to use Web_Scraper effectively.
Ensure your system meets these requirements:
- Has Python 3.8 or higher.
- Downloaded all required dependencies.
-
Clone the Repository: Use Git to clone Web_Scraper to your local machine. Open Command Prompt as an administrator and run:
git clone https://github.com/DefinetlyNotAI/Web_Scraper.git
-
Navigate to the Project Directory: Change your current directory to the cloned CHANGE_ME folder:
cd Web_Scraper
-
Install Dependencies: Run
pip install -r requirements.txt
-
Run the Web Scraper: Run
./scrape
more info below.
The utility is executed from the command line. Here's a basic example of how to use it:
python scrape.py --URL "https://example.com" --name "ExampleSite" --zip --full -y
You may use secrets.scrape.py
for beta testing functionality.
--url
: Required. The URL of the website you wish to scrape.--name
: Optional. A custom name for the scraped website. If not provided, the domain name will be used.--zip
: Optional. If set, the utility will compress the downloaded files into a ZIP archive.--full
: Optional. If set, the utility will download the full HTML content along with associated resources. Otherwise, it downloads only the basic HTML content.-y
: Optional. Automatically proceeds with the download without asking for confirmation.
Downloads the basic HTML content from a given URL and saves it to a file.
Downloads the HTML content and associated resources from a given URL, saves them to a file, and returns the filename.
Downloads images from a given URL after processing them to get the absolute image URLs.
Zips the files given in the 'files' list into a zip file named 'zip_filename'. Optionally deletes the files after zipping.
Main function that serves as the entry point for the web scraping application. Parses command-line arguments to scrape a given URL, download content based on the arguments provided, and optionally zip the downloaded files.
argparse
: For parsing command-line options and arguments.os
,shutil
: For file and directory operations.requests
: For making HTTP requests.BeautifulSoup
: For parsing HTML content.zipfile
: For creating ZIP archives.tqdm
: For displaying progress bars during downloads.
Contributions are welcome! Please feel free to submit a pull request or open an issue on GitHub.
Read the CONTRIBUTING file for more information.
This project is licensed under the MIT License - see the LICENSE file for details.