Releases: JustAzul/web-scrapper-stdio
v1.3.0
Summary:
This release delivers new scraping features, performance optimizations, improved test coverage, and major enhancements to CI/CD workflows and documentation. The codebase is now more robust, maintainable, and easier to extend, with a focus on reliability and developer experience.
Features
- Support for
custom_elements_to_remove
in API scrape arguments and extraction- Commit: 851e38d
- Added
filter_none_values
utility with comprehensive tests- Commit: 8c58e5a
- Added
click_selector
support toextract_text_from_url
- Commit: 558f513
Performance
- Reuse singleton browser instance for all scrapes, reducing resource usage
- Commit: b905f3d
- Avoid repeated BeautifulSoup parsing by reusing soup objects
- Commit: 096bb1b
Refactors
- Refactored
mcp_server
to usefilter_none_values
and supportclick_selector
- Commit: 045de45
- Merged dynamic article extraction tests into a single random-domain test
- Commit: db7d1e0
Documentation
- Updated README: removed roadmap/contact/robots.txt sections, improved badge clarity, and added usage examples
Chore
- Added MIT license file
- Commit: 9791b2b
- Ignored all files/directories under
cursor/
- Commit: 8ed75e1
- Removed empty
__init__.py
files fromsrc
andtests/helpers
- Commit: 48e723a
- Ensured trailing newline in config files
- Commit: 3b5ddee
- Updated and cleaned up workflows (build, test, release, coverage)
- Standardized on master branch, removed legacy/duplicate workflow files
Tests
- Improved JS delay and user agent tests for robustness and coverage
- Added and expanded tests for new utilities and features
Fixes
- Update badge Gist URLs and workflow gistIDs for build, test, and coverage
- Commit: 09e6060
- Filter out None values from tool arguments to prevent type errors
- Commit: a4d6fed
- Ensure output_format string is converted to OutputFormat enum in get_prompt
- Commit: d25f2c0
- Refactor JS-delay test to use real demo page and improve reliability
- Commit: 0623313
- Restore per-scrape browser launch and cleanup for test reliability
- Commit: 2af0a3b
- Ensure browser is always closed using finally block in extract_text_from_url
- Commit: d779341
Full Changelog: 1.2.0...1.3.0
v1.2.0
Release Notes: 1.2.0
Summary:
This release introduces a new grace_period_seconds
feature for improved JavaScript rendering support, significant refactoring for configuration and test logic, improved documentation, and enhanced test coverage. The codebase is now more robust, with clearer configuration and more reliable extraction logic.
Features
- Add
grace_period_seconds
parameter for JS rendering delay- Allows fine-tuning of wait time for JavaScript-rendered content.
- Commit: eb3b841
Refactors
- Configuration and server logic improvements
- Removed unused content length and grace period options from config.
- Updated server and scraper logic to use new parameters and improve clarity.
- Commit: 61a9264
Documentation
- README and usage updates
- Updated documentation to reflect new parameters and configuration changes.
- Commit: c80fcd6
Tests
- Test updates and improvements
- Refactored tests for new result format and
grace_period_seconds
support. - Improved test assertions and coverage for error handling and edge cases.
- Commit: 2c3de28
- Refactored tests for new result format and
Affected Files
- Modified:
README.md
,src/config.py
,src/mcp_server.py
,src/scraper/__init__.py
,tests/test_mcp_server.py
,tests/test_scraper.py
- Full Changelog: 1.1.1...1.2.0
v1.1.1
Refactor
- refactor(config): add env-based config and type-safe parsing helpers
- refactor(mcp): adapt to new dict return type from extract_text_from_url
- refactor(scraper): return dict with metadata and error from extract_text_from_url
Docs
- docs(readme): rewrite and reorganize documentation for clarity, usage, and config
Test
- test(tests): add test_mcp_server.py for MCP server testing
CI
- ci(test): add separate test services for mcp and scraper in docker-compose
Fix
- fix(docker): format environment variable assignments in Dockerfile
- fix(mcp_server): remove hardcoded stdio to properly serve the mcp tool requests
v1.1.0
Release Notes: 1.1.0
Summary:
This release introduces a modular scraper architecture, significant refactoring for maintainability, improved documentation, and enhanced test coverage. Obsolete files and legacy code have been removed, and the codebase is now more consistent and easier to extend.
Features
- New modular scraper implementation and helpers
- Introduced a new scraper module with improved structure and extensibility.
- Added helpers for browser automation, content selection, error handling, HTML utilities, and rate limiting.
- Commit: 74c7852
Refactors
- Core and server logic improvements
- Centralized configuration, added a Logger, and improved debug/error handling.
- Reformatted and improved readability of server logic.
- Removed legacy and obsolete files (CLI, main, stdio_server, test runner, old scraper).
- Improved extraction logic and cleaned up dependencies.
- Commits: 3f6f876, fb95ddc, de5e0c7, c8167d9, 3752cf1, 01c145e
Style
- Code formatting and consistency
Documentation
- Documentation and configuration updates
Chore
- Cleanup and maintenance
- Removed obsolete files and updated project structure.
- Commit: c8167d9
Tests
- Test updates and improvements
Affected Files
- Added:
src/logger.py
,src/mcp_server.py
,src/scraper/__init__.py
,src/scraper/helpers/*
,tests/test_scraper.py
- Modified:
.env.example
,Dockerfile
,README.md
,docker-compose.yml
,requirements.txt
,src/__init__.py
,src/config.py
,tests/__init__.py
,tests/test_helpers.py
- Deleted:
CHANGELOG.md
,src/main.py
,src/scraper.py
,src/stdio_server.py
,tests/test_mcp.py
- Full Changelog: 1.0.0...1.1.0
v1.0.0
1. Release Overview
- Version: 1.0.0
- Goal: First stable release with robust anti-bot scraping, Dockerized deployment, and automated testing.
2. Major Features
- Playwright-based Scraper:
- Async scraping with Playwright and BeautifulSoup.
- Domain-specific and generic content extraction.
- Anti-Bot Evasion:
- Integrated
playwright-stealth
for fingerprint evasion. - Randomized user agent, viewport, and language per request.
- Navigator property spoofing.
- Rate limiting per domain.
- Integrated
- Robust Extraction Logic:
- Fallback to
<body>
text for edge cases. - Handles redirects, 404s, and Cloudflare blocks gracefully.
- Fallback to
- Dockerized Workflow:
- Dockerfile and docker-compose for reproducible builds and test runs.
- Automated Testing:
- Pytest suite with coverage for extraction, error handling, and edge cases.
- CI-ready test execution via Docker Compose.
- Documentation:
- Comprehensive
README.md
with setup, usage, and development workflow. - Changelog and release notes.
- Comprehensive