Skip to content

Gamma120/minimal-web-scraper

Repository files navigation

Minimal-Web-Scraper

A minimal library to scrape web pages. The user must implement its own parsers using libraries like Beautiful Soup.

Here, you can find an overview of the library. For more details, read the documentation.

This project is in development state, and the author doesn’t guarantee the stability of its API.

This is a personal project without any other ambition that to learn how to develop a Python library.

Installation

pip install git+https://github.com/Gamma120/minimal-web-scraper.git

Dependencies

This library depends on:

  • Python 3.10 or newer
  • request

Overview

The target this website aimed to practice scraping: https://books.toscrape.com/.

Here is a small script for demonstration. You can find it here: example.py and parser_example.py.

# example.py

import pandas as pd

import minimal_web_scraper as scraper
from minimal_web_scraper import parsers

# import the parsers modules to add them to the scraper list of parsers
import parser_example

# add all parsers imported (which are subclass of BaseParser)
parsers.add_parser()

# or add them manually
# parsers.add_parser(parser_example.BookParser)
# parsers.add_parser(parser_example.BooksParser)

# scrape the URL in argument and return a dictionary of parsed data
data = scraper.scrape("https://books.toscrape.com/")

# Pretty output formatting with pandas
books = pd.DataFrame(data=data, columns=["name", "price"])
print(books.head(5))

Output:

                                        name  price
    0                   A Light in the Attic  51.77
    1                     Tipping the Velvet  53.74
    2                             Soumission  50.10
    3                          Sharp Objects  47.82
    4  Sapiens: A Brief History of Humankind  54.23

Links

About

An other web scraping Python library

Topics

Resources

License

Stars

Watchers

Forks