A simple, respectful web scraper built with Python and BeautifulSoup.
- Respectful scraping: Built-in delays between requests
- Flexible search: Search in titles, body text, or links
- Context extraction: Get surrounding text for matches
- Multiple output formats: Console output and JSON file export
- Error handling: Graceful handling of failed requests
- Configurable: Easy to customize URLs, terms, and settings
pip install web_scraper_project
-
Clone the repository:
git clone https://github.com/kwein123/web-scraper.git cd web-scraper
-
Install in development mode:
pip install -e .
-
Or install with development dependencies:
pip install -e ".[dev]"
# Use default configuration
webscraper
# Custom URLs and terms
webscraper --urls https://example.com --terms python web
# See all options
webscraper --help
# Run examples
webscraper --examples
from web_scraper_project import WebScraper
scraper = WebScraper(delay=1.0)
results = scraper.scrape_urls(
urls=["https://example.com"],
search_terms=["example", "test"]
)
scraper.print_results(results)
web_scraper_project/
├── web_scraper_project/ # Main package directory
│ ├── __init__.py # Package initialization
│ ├── scraper.py # Main WebScraper class
│ ├── config.py # Configuration settings
│ ├── cli.py # Command-line interface
│ └── examples.py # Usage examples
├── tests/ # Test directory
│ ├── __init__.py
│ └── test_scraper.py # Unit tests
├── setup.py # Minimal setup script
├── setup.cfg # Main configuration
├── pyproject.toml # Modern Python packaging
├── requirements.txt # Dependencies
├── MANIFEST.in # Package manifest
├── LICENSE # MIT License
└── README.md # This file
Edit config.py
to customize:
# URLs to scrape
URLS = [
"https://example.com",
"https://python.org"
]
# Terms to search for
SEARCH_TERMS = [
"python",
"programming",
"example"
]
# Scraper settings
SCRAPER_DELAY = 1.0 # Delay between requests
CASE_SENSITIVE = False
SEARCH_IN = 'all' # 'all', 'title', 'body', 'links'
from .scraper import WebScraper
scraper = WebScraper(delay=1.0)
results = scraper.scrape_urls(
urls=["https://example.com"],
search_terms=["example", "test"]
)
scraper.print_results(results)
results = scraper.scrape_urls(
urls=urls,
search_terms=terms,
search_in='title'
)
results = scraper.scrape_urls(
urls=urls,
search_terms=["Python", "PYTHON"],
case_sensitive=True
)
scraper = WebScraper(delay=2.0) # Longer delay for politeness
results = scraper.scrape_urls(
urls=custom_urls,
search_terms=custom_terms,
case_sensitive=False,
search_in='body' # Search only in page body
)
Initialize the scraper with optional delay between requests.
Fetch and parse a single web page.
search_terms_in_text(text: str, search_terms: List[str], case_sensitive: bool = False) -> Dict[str, List[str]]
Search for terms in text and return matches with context.
scrape_urls(urls: List[str], search_terms: List[str], case_sensitive: bool = False, search_in: str = 'all') -> Dict[str, Dict]
Scrape multiple URLs and search for terms.
Print search results in a formatted way.
# Run all tests
python -m pytest
# Run with coverage
python -m pytest --cov=scraper
# Run specific test file
python test_scraper.py
# Clone the repo
git clone https://github.com/kwein123/web-scraper.git
cd web-scraper
# Install in development mode with dev dependencies
pip install -e ".[dev]"
# Run tests
python -m pytest
# Run linting
flake8 .
# Run type checking
mypy .
# Format code
black .
# Build source and wheel distributions
python -m build
# Upload to PyPI (requires account and setup)
python -m twine upload dist/*
- Respect robots.txt: Always check and respect website robots.txt files
- Rate limiting: The scraper includes delays between requests to be respectful
- Terms of service: Some websites may block automated requests
- Legal considerations: Consider legal and ethical implications of web scraping
- User agent: The scraper sets a realistic user agent to avoid blocks
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
This project is for educational purposes. Always ensure you have permission to scrape websites and comply with their terms of service and applicable laws.