Minimal-Web-Scraper

A minimal library to scrape web pages. The user must implement its own parsers using libraries like Beautiful Soup.

Here, you can find an overview of the library. For more details, read the documentation.

This project is in development state, and the author doesn’t guarantee the stability of its API.

This is a personal project without any other ambition that to learn how to develop a Python library.

Installation

pip install git+https://github.com/Gamma120/minimal-web-scraper.git

Dependencies

This library depends on:

Python 3.10 or newer
request

Overview

The target this website aimed to practice scraping: https://books.toscrape.com/.

Here is a small script for demonstration. You can find it here: example.py and parser_example.py.

# example.py

import pandas as pd

import minimal_web_scraper as scraper
from minimal_web_scraper import parsers

# import the parsers modules to add them to the scraper list of parsers
import parser_example

# add all parsers imported (which are subclass of BaseParser)
parsers.add_parser()

# or add them manually
# parsers.add_parser(parser_example.BookParser)
# parsers.add_parser(parser_example.BooksParser)

# scrape the URL in argument and return a dictionary of parsed data
data = scraper.scrape("https://books.toscrape.com/")

# Pretty output formatting with pandas
books = pd.DataFrame(data=data, columns=["name", "price"])
print(books.head(5))

Output:

                                        name  price
    0                   A Light in the Attic  51.77
    1                     Tipping the Velvet  53.74
    2                             Soumission  50.10
    3                          Sharp Objects  47.82
    4  Sapiens: A Brief History of Humankind  54.23

Links

Documentation: Readthedocs
Contact: gamma_120@simplelogin.com

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
docs		docs
example		example
src/minimal_web_scraper		src/minimal_web_scraper
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.rst		CHANGELOG.rst
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Minimal-Web-Scraper

Installation

Dependencies

Overview

Links

About

Uh oh!

Releases 2

Uh oh!

Languages

License

Gamma120/minimal-web-scraper

Folders and files

Latest commit

History

Repository files navigation

Minimal-Web-Scraper

Installation

Dependencies

Overview

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Languages