Web Scraper Project

A simple, respectful web scraper built with Python and BeautifulSoup.

Features

Respectful scraping: Built-in delays between requests
Flexible search: Search in titles, body text, or links
Context extraction: Get surrounding text for matches
Multiple output formats: Console output and JSON file export
Error handling: Graceful handling of failed requests
Configurable: Easy to customize URLs, terms, and settings

Installation

From PyPI (if published)

pip install web_scraper_project

Development Installation

Clone the repository:

git clone https://github.com/kwein123/web-scraper.git
cd web-scraper

Install in development mode:
```
pip install -e .
```
Or install with development dependencies:
```
pip install -e ".[dev]"
```

Quick Start

Command Line (Easiest)

# Use default configuration
webscraper

# Custom URLs and terms
webscraper --urls https://example.com --terms python web

# See all options
webscraper --help

# Run examples
webscraper --examples

Python Code

from web_scraper_project import WebScraper

scraper = WebScraper(delay=1.0)
results = scraper.scrape_urls(
    urls=["https://example.com"],
    search_terms=["example", "test"]
)
scraper.print_results(results)

File Structure

web_scraper_project/
├── web_scraper_project/      # Main package directory
│   ├── __init__.py          # Package initialization
│   ├── scraper.py           # Main WebScraper class
│   ├── config.py            # Configuration settings
│   ├── cli.py               # Command-line interface
│   └── examples.py          # Usage examples
├── tests/                   # Test directory
│   ├── __init__.py
│   └── test_scraper.py      # Unit tests
├── setup.py                 # Minimal setup script
├── setup.cfg                # Main configuration
├── pyproject.toml           # Modern Python packaging
├── requirements.txt         # Dependencies
├── MANIFEST.in              # Package manifest
├── LICENSE                  # MIT License
└── README.md                # This file

Configuration

Edit config.py to customize:

# URLs to scrape
URLS = [
    "https://example.com",
    "https://python.org"
]

# Terms to search for
SEARCH_TERMS = [
    "python",
    "programming",
    "example"
]

# Scraper settings
SCRAPER_DELAY = 1.0  # Delay between requests
CASE_SENSITIVE = False
SEARCH_IN = 'all'  # 'all', 'title', 'body', 'links'

Usage Examples

Basic Usage

from .scraper import WebScraper

scraper = WebScraper(delay=1.0)
results = scraper.scrape_urls(
    urls=["https://example.com"],
    search_terms=["example", "test"]
)
scraper.print_results(results)

Search Only in Titles

results = scraper.scrape_urls(
    urls=urls,
    search_terms=terms,
    search_in='title'
)

Case-Sensitive Search

results = scraper.scrape_urls(
    urls=urls,
    search_terms=["Python", "PYTHON"],
    case_sensitive=True
)

Custom Configuration

scraper = WebScraper(delay=2.0)  # Longer delay for politeness
results = scraper.scrape_urls(
    urls=custom_urls,
    search_terms=custom_terms,
    case_sensitive=False,
    search_in='body'  # Search only in page body
)

API Reference

WebScraper Class

`init(delay: float = 1.0)`

Initialize the scraper with optional delay between requests.

`fetch_page(url: str) -> BeautifulSoup`

Fetch and parse a single web page.

`search_terms_in_text(text: str, search_terms: List[str], case_sensitive: bool = False) -> Dict[str, List[str]]`

Search for terms in text and return matches with context.

`scrape_urls(urls: List[str], search_terms: List[str], case_sensitive: bool = False, search_in: str = 'all') -> Dict[str, Dict]`

Scrape multiple URLs and search for terms.

`print_results(results: Dict[str, Dict])`

Print search results in a formatted way.

Running Tests

# Run all tests
python -m pytest

# Run with coverage
python -m pytest --cov=scraper

# Run specific test file
python test_scraper.py

Development

Setting up development environment

# Clone the repo
git clone https://github.com/kwein123/web-scraper.git
cd web-scraper

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Run tests
python -m pytest

# Run linting
flake8 .

# Run type checking
mypy .

# Format code
black .

Building the package

# Build source and wheel distributions
python -m build

# Upload to PyPI (requires account and setup)
python -m twine upload dist/*

Important Notes

Respect robots.txt: Always check and respect website robots.txt files
Rate limiting: The scraper includes delays between requests to be respectful
Terms of service: Some websites may block automated requests
Legal considerations: Consider legal and ethical implications of web scraping
User agent: The scraper sets a realistic user agent to avoid blocks

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This project is for educational purposes. Always ensure you have permission to scrape websites and comply with their terms of service and applicable laws.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.vscode		.vscode
scripts		scripts
tests		tests
venv		venv
web_scraper_project		web_scraper_project
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
main.py		main.py
pyproject.toml		pyproject.toml
readme.md		readme.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
scraper_results_20250909_172728.json		scraper_results_20250909_172728.json
scraper_results_20250911_083433.json		scraper_results_20250911_083433.json
scraper_results_20250911_083716.json		scraper_results_20250911_083716.json
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraper Project

Features

Installation

From PyPI (if published)

Development Installation

Quick Start

Command Line (Easiest)

Python Code

File Structure

Configuration

Usage Examples

Basic Usage

Search Only in Titles

Case-Sensitive Search

Custom Configuration

API Reference

WebScraper Class

`init(delay: float = 1.0)`

`fetch_page(url: str) -> BeautifulSoup`

`search_terms_in_text(text: str, search_terms: List[str], case_sensitive: bool = False) -> Dict[str, List[str]]`

`scrape_urls(urls: List[str], search_terms: List[str], case_sensitive: bool = False, search_in: str = 'all') -> Dict[str, Dict]`

`print_results(results: Dict[str, Dict])`

Running Tests

Development

Setting up development environment

Building the package

Important Notes

Contributing

License

Disclaimer

About

Uh oh!

Releases

Packages

Languages

License

kwein123/web_scraper_project

Folders and files

Latest commit

History

Repository files navigation

Web Scraper Project

Features

Installation

From PyPI (if published)

Development Installation

Quick Start

Command Line (Easiest)

Python Code

File Structure

Configuration

Usage Examples

Basic Usage

Search Only in Titles

Case-Sensitive Search

Custom Configuration

API Reference

WebScraper Class

__init__(delay: float = 1.0)

fetch_page(url: str) -> BeautifulSoup

search_terms_in_text(text: str, search_terms: List[str], case_sensitive: bool = False) -> Dict[str, List[str]]

scrape_urls(urls: List[str], search_terms: List[str], case_sensitive: bool = False, search_in: str = 'all') -> Dict[str, Dict]

print_results(results: Dict[str, Dict])

Running Tests

Development

Setting up development environment

Building the package

Important Notes

Contributing

License

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`init(delay: float = 1.0)`

`fetch_page(url: str) -> BeautifulSoup`

`search_terms_in_text(text: str, search_terms: List[str], case_sensitive: bool = False) -> Dict[str, List[str]]`

`scrape_urls(urls: List[str], search_terms: List[str], case_sensitive: bool = False, search_in: str = 'all') -> Dict[str, Dict]`

`print_results(results: Dict[str, Dict])`

Packages