Skip to content

Releases: JustAzul/web-scrapper-stdio

v1.3.0

14 Jun 22:10
70c668a
Compare
Choose a tag to compare

Summary:
This release delivers new scraping features, performance optimizations, improved test coverage, and major enhancements to CI/CD workflows and documentation. The codebase is now more robust, maintainable, and easier to extend, with a focus on reliability and developer experience.

Features

  • Support for custom_elements_to_remove in API scrape arguments and extraction
  • Added filter_none_values utility with comprehensive tests
  • Added click_selector support to extract_text_from_url

Performance

  • Reuse singleton browser instance for all scrapes, reducing resource usage
  • Avoid repeated BeautifulSoup parsing by reusing soup objects

Refactors

  • Refactored mcp_server to use filter_none_values and support click_selector
  • Merged dynamic article extraction tests into a single random-domain test

Documentation

Chore

Tests

Fixes

  • Update badge Gist URLs and workflow gistIDs for build, test, and coverage
  • Filter out None values from tool arguments to prevent type errors
  • Ensure output_format string is converted to OutputFormat enum in get_prompt
  • Refactor JS-delay test to use real demo page and improve reliability
  • Restore per-scrape browser launch and cleanup for test reliability
  • Ensure browser is always closed using finally block in extract_text_from_url

Full Changelog: 1.2.0...1.3.0

v1.2.0

12 Jun 16:11
2c3de28
Compare
Choose a tag to compare

Release Notes: 1.2.0

Summary:
This release introduces a new grace_period_seconds feature for improved JavaScript rendering support, significant refactoring for configuration and test logic, improved documentation, and enhanced test coverage. The codebase is now more robust, with clearer configuration and more reliable extraction logic.

Features

  • Add grace_period_seconds parameter for JS rendering delay
    • Allows fine-tuning of wait time for JavaScript-rendered content.
    • Commit: eb3b841

Refactors

  • Configuration and server logic improvements
    • Removed unused content length and grace period options from config.
    • Updated server and scraper logic to use new parameters and improve clarity.
    • Commit: 61a9264

Documentation

  • README and usage updates
    • Updated documentation to reflect new parameters and configuration changes.
    • Commit: c80fcd6

Tests

  • Test updates and improvements
    • Refactored tests for new result format and grace_period_seconds support.
    • Improved test assertions and coverage for error handling and edge cases.
    • Commit: 2c3de28

Affected Files

  • Modified: README.md, src/config.py, src/mcp_server.py, src/scraper/__init__.py, tests/test_mcp_server.py, tests/test_scraper.py

v1.1.1

12 Jun 15:15
3f6a953
Compare
Choose a tag to compare

Refactor

  • refactor(config): add env-based config and type-safe parsing helpers
  • refactor(mcp): adapt to new dict return type from extract_text_from_url
  • refactor(scraper): return dict with metadata and error from extract_text_from_url

Docs

  • docs(readme): rewrite and reorganize documentation for clarity, usage, and config

Test

  • test(tests): add test_mcp_server.py for MCP server testing

CI

  • ci(test): add separate test services for mcp and scraper in docker-compose

Fix

  • fix(docker): format environment variable assignments in Dockerfile
  • fix(mcp_server): remove hardcoded stdio to properly serve the mcp tool requests

v1.1.0

11 Jun 20:29
dd9d284
Compare
Choose a tag to compare

Release Notes: 1.1.0

Summary:
This release introduces a modular scraper architecture, significant refactoring for maintainability, improved documentation, and enhanced test coverage. Obsolete files and legacy code have been removed, and the codebase is now more consistent and easier to extend.

Features

  • New modular scraper implementation and helpers
    • Introduced a new scraper module with improved structure and extensibility.
    • Added helpers for browser automation, content selection, error handling, HTML utilities, and rate limiting.
    • Commit: 74c7852

Refactors

  • Core and server logic improvements
    • Centralized configuration, added a Logger, and improved debug/error handling.
    • Reformatted and improved readability of server logic.
    • Removed legacy and obsolete files (CLI, main, stdio_server, test runner, old scraper).
    • Improved extraction logic and cleaned up dependencies.
    • Commits: 3f6f876, fb95ddc, de5e0c7, c8167d9, 3752cf1, 01c145e

Style

  • Code formatting and consistency
    • Reformatted logger and test files for PEP8 compliance and readability.
    • Removed trailing whitespace from __init__.py files.
    • Commits: cd753ac, d272992, c053c4d

Documentation

  • Documentation and configuration updates
    • Updated README with new usage instructions, removed CLI references, and added Cursor IDE integration.
    • Updated environment, Docker, and documentation for new config structure.
    • Commits: dd9d284, 6fbba89

Chore

  • Cleanup and maintenance
    • Removed obsolete files and updated project structure.
    • Commit: c8167d9

Tests

  • Test updates and improvements
    • Updated and added tests for new config, timing constants, and scraper logic.
    • Reformatted test files for clarity.
    • Commits: cd753ac, 47c70d0

Affected Files

  • Added: src/logger.py, src/mcp_server.py, src/scraper/__init__.py, src/scraper/helpers/*, tests/test_scraper.py
  • Modified: .env.example, Dockerfile, README.md, docker-compose.yml, requirements.txt, src/__init__.py, src/config.py, tests/__init__.py, tests/test_helpers.py
  • Deleted: CHANGELOG.md, src/main.py, src/scraper.py, src/stdio_server.py, tests/test_mcp.py

v1.0.0

09 Jun 18:30
077e662
Compare
Choose a tag to compare

1. Release Overview

  • Version: 1.0.0
  • Goal: First stable release with robust anti-bot scraping, Dockerized deployment, and automated testing.

2. Major Features

  • Playwright-based Scraper:
    • Async scraping with Playwright and BeautifulSoup.
    • Domain-specific and generic content extraction.
  • Anti-Bot Evasion:
    • Integrated playwright-stealth for fingerprint evasion.
    • Randomized user agent, viewport, and language per request.
    • Navigator property spoofing.
    • Rate limiting per domain.
  • Robust Extraction Logic:
    • Fallback to <body> text for edge cases.
    • Handles redirects, 404s, and Cloudflare blocks gracefully.
  • Dockerized Workflow:
    • Dockerfile and docker-compose for reproducible builds and test runs.
  • Automated Testing:
    • Pytest suite with coverage for extraction, error handling, and edge cases.
    • CI-ready test execution via Docker Compose.
  • Documentation:
    • Comprehensive README.md with setup, usage, and development workflow.
    • Changelog and release notes.