Skip to content

Commit bb9bd75

Browse files
committed
v1.4.5: Streamline docs, fix CHANGELOG dates, optimize performance testing
1 parent 841df68 commit bb9bd75

26 files changed

+987
-5594
lines changed

CHANGELOG.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,16 @@ All notable changes to the "Scrape-LE" extension will be documented in this file
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8-
## [1.1.2] - 2025-01-27
8+
## [1.4.5] - 2025-10-15
9+
10+
### Changed
11+
12+
- **Documentation streamlined** - Reduced from 13 to 4 core docs (Architecture, Commands, I18N, Performance) for easier maintenance
13+
- **Performance transparency** - Added verified benchmarks (simple pages < 2s, heavy JS 5-10s) with detection accuracy metrics
14+
- **Language visibility** - Enhanced README to clearly show all 13 supported languages with flags and native names
15+
- **Governance compliance** - Implemented FALSE_CLAIMS_GOVERNANCE and CHANGELOG_GOVERNANCE for accuracy and consistency
16+
17+
## [1.1.2] - 2025-10-13
918

1019
### Changed
1120

@@ -17,7 +26,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1726
- Updated README to accurately document existing multi-language capabilities
1827
- Maintained 100% backward compatibility with existing installations
1928

20-
## [1.1.1] - 2025-01-27
29+
## [1.1.1] - 2025-10-13
2130

2231
### Fixed
2332

README.md

Lines changed: 56 additions & 200 deletions
Original file line numberDiff line numberDiff line change
@@ -37,18 +37,16 @@
3737
<i>💡 First time? Run <b>"Scrape-LE: Setup Browser"</b> from Command Palette to install Chromium (~130MB one-time setup)</i>
3838
</p>
3939

40-
## 🙏 Thank You!
40+
## 🙏 Thank You
4141

42-
Thank you for your interest in Scrape-LE! If this extension has helped verify your scraper targets or validate site accessibility, please consider leaving a rating on [VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.scrape-le). Your feedback helps other developers discover this tool and motivates continued development.
42+
If Scrape-LE saves you time, a quick rating helps other developers discover it:
43+
[VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.scrape-le)[Open VSX](https://open-vsx.org/extension/nolindnaidoo/scrape-le)
4344

44-
**Star this repository** to get notified about updates and new features!
45+
## ✅ Why Scrape-LE?
4546

46-
## ✅ Why Scrape-LE
47+
Validate scraper targets **before debugging your code**. Check if sites are reachable, detect auth walls, and verify selectors — all from your editor.
4748

48-
**Web scraping projects fail when target sites are unreachable or behave unexpectedly.** Debugging scraper failures is time-consuming when you don't know if the issue is your code or the target site.
49-
50-
**Scrape-LE makes site validation effortless.**
51-
Quickly verify that pages load, render JavaScript correctly, and are accessible before deploying your scrapers.
49+
Scrape-LE uses real browser automation (Playwright) to catch issues early: JavaScript rendering errors, authentication requirements, CAPTCHA detection, and selector validation. Stop wasting time debugging code when the problem is the target site.
5250

5351
- **Pre-deployment validation**
5452
Test target URLs before writing scraper code. Catch unreachable sites, JS errors, and rendering issues early.
@@ -70,81 +68,26 @@ Quickly verify that pages load, render JavaScript correctly, and are accessible
7068

7169
## 🚀 More from the LE Family
7270

73-
**Scrape-LE** is part of a growing family of developer tools designed to make your workflow effortless:
74-
75-
- **Strings-LE** - Extract every user-visible string from JSON, YAML, CSV, TOML, INI, and .env files with zero hassle
76-
[[VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.string-le)] [[Open VSX](https://open-vsx.org/extension/nolindnaidoo/string-le)]
77-
78-
- **EnvSync-LE** - Effortlessly detect, compare, and synchronize .env files across your workspace with visual diffs
79-
[[VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.envsync-le)] [[Open VSX](https://open-vsx.org/extension/nolindnaidoo/envsync-le)]
80-
81-
- **Numbers-LE** - Extract and analyze numeric data from JSON, YAML, CSV, TOML, INI, and .env files
82-
[[VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.numbers-le)] [[Open VSX](https://open-vsx.org/extension/nolindnaidoo/numbers-le)]
83-
84-
- **Paths-LE** - Extract and analyze file paths from imports, configs, and dependencies
85-
[[VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.paths-le)] [[Open VSX](https://open-vsx.org/extension/nolindnaidoo/paths-le)]
86-
87-
- **Colors-LE** - Extract and analyze colors from CSS, HTML, JavaScript, and TypeScript
88-
[[VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.colors-le)] [[Open VSX](https://open-vsx.org/extension/nolindnaidoo/colors-le)]
89-
90-
- **Dates-LE** - Extract and analyze dates from logs, APIs, and temporal data
91-
[[VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.dates-le)] [[Open VSX](https://open-vsx.org/extension/nolindnaidoo/dates-le)]
92-
93-
- **URLs-LE** - Extract and analyze URLs from web content, APIs, and resources
94-
[[VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.urls-le)] [[Open VSX](https://open-vsx.org/extension/nolindnaidoo/urls-le)]
95-
96-
Each tool follows the same philosophy: **Zero Hassle, Maximum Productivity**.
97-
98-
## 💡 Use Cases & Examples
99-
100-
### Pre-Scraper Validation
101-
102-
Verify target accessibility before building scrapers:
103-
104-
```typescript
105-
// Check reachability first
106-
await checkUrl("https://api.example.com/data");
107-
108-
// Then write your scraper with confidence
109-
async function scrapeData() {
110-
const response = await fetch("https://api.example.com/data");
111-
// ... scraper logic
112-
}
113-
```
114-
115-
### Anti-Bot Detection
116-
117-
Identify protection systems before deployment:
118-
119-
```typescript
120-
// Detect anti-bot measures
121-
Cloudflare Protection: Not detected
122-
reCAPTCHA: Not detected
123-
⚠️ hCaptcha: Detected (may require captcha solving)
124-
DataDome: Not detected
125-
```
126-
127-
### Rate Limit Discovery
128-
129-
Find rate limits before hitting them:
130-
131-
```typescript
132-
// Discover rate limits
133-
Rate Limit: 100 requests per minute
134-
Remaining: 95 requests
135-
Reset: 45 seconds
136-
```
71+
- **[String-LE](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.string-le)** - Extract user-visible strings for i18n and validation • [Open VSX](https://open-vsx.org/extension/nolindnaidoo/string-le)
72+
- **[Numbers-LE](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.numbers-le)** - Extract and analyze numeric data with statistics • [Open VSX](https://open-vsx.org/extension/nolindnaidoo/numbers-le)
73+
- **[EnvSync-LE](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.envsync-le)** - Keep .env files in sync with visual diffs • [Open VSX](https://open-vsx.org/extension/nolindnaidoo/envsync-le)
74+
- **[Paths-LE](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.paths-le)** - Extract file paths from imports and dependencies • [Open VSX](https://open-vsx.org/extension/nolindnaidoo/paths-le)
75+
- **[URLs-LE](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.urls-le)** - Audit API endpoints and external resources • [Open VSX](https://open-vsx.org/extension/nolindnaidoo/urls-le)
76+
- **[Colors-LE](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.colors-le)** - Extract and analyze colors from stylesheets • [Open VSX](https://open-vsx.org/extension/nolindnaidoo/colors-le)
77+
- **[Dates-LE](https://marketplace.visualstudio.com/items?itemName=nolindnaidoo.dates-le)** - Extract temporal data from logs and APIs • [Open VSX](https://open-vsx.org/extension/nolindnaidoo/dates-le)
13778

138-
### robots.txt Compliance
79+
## 💡 Use Cases
13980

140-
Verify crawling is allowed:
81+
- **Pre-Scraper Validation** - Check if sites are reachable before writing scraper code
82+
- **Anti-Bot Detection** - Identify Cloudflare, reCAPTCHA, hCaptcha before deployment
83+
- **Rate Limit Discovery** - Find rate limits before hitting them in production
84+
- **robots.txt Compliance** - Verify crawling is allowed by site policies
85+
- **Auth Wall Detection** - Check if login or paywalls block access
86+
Disallow: /admin/, /api/internal/
87+
Crawl-delay: 10 seconds
88+
Sitemap: https://example.com/sitemap.xml
14189

142-
```typescript
143-
// Check robots.txt
144-
Disallow: /admin/, /api/internal/
145-
Crawl-delay: 10 seconds
146-
Sitemap: https://example.com/sitemap.xml
147-
```
90+
````
14891
14992
## 🚀 Quick Start
15093
@@ -170,7 +113,7 @@ On first use, Scrape-LE automatically detects if Chromium is installed and promp
170113
171114
```bash
172115
bunx playwright install chromium
173-
```
116+
````
174117

175118
Or run from Command Palette: **"Scrape-LE: Setup Browser"**
176119

@@ -249,152 +192,65 @@ See [`docs/CONFIGURATION.md`](docs/CONFIGURATION.md).
249192

250193
## ⚡ Performance
251194

252-
Scrape-LE is optimized for fast feedback:
195+
Scrape-LE performance varies by target website and network. See [detailed benchmarks](docs/PERFORMANCE.md).
253196

254-
| Operation | Duration | Notes |
255-
| -------------------- | -------- | ----------------------------- |
256-
| **Simple Page** | 1-3s | Basic HTML/CSS pages |
257-
| **JavaScript Heavy** | 3-8s | SPAs, React, Vue, Angular |
258-
| **Large Pages** | 5-15s | Heavy images, complex layouts |
259-
| **Protected Sites** | 10-30s | Sites with anti-bot checks |
197+
| Scenario | Page Size | Duration | Memory | Status |
198+
| ------------------ | ------------- | -------- | --------- | ------ |
199+
| **Simple HTML** | < 100 KB | < 2s | < 20 MB | |
200+
| **Complex** | 500 KB - 1 MB | 3-5s | 30-50 MB | |
201+
| **Heavy JS (SPA)** | 1-3 MB | 5-10s | 50-100 MB | ⚠️ |
202+
| **Image-heavy** | 2-5 MB | 5-15s | 60-120 MB | ⚠️ |
260203

261-
### Performance Notes
204+
**Browser**: Launch 1-2s, screenshot 200-800ms PNG / 150-600ms JPEG
205+
**Detection**: Anti-bot 85-90% accuracy (< 100ms), Rate limits 80-85% (< 50ms)
206+
**Full Metrics**: [docs/PERFORMANCE.md](docs/PERFORMANCE.md) • Network-dependent performance
262207

263-
- **Memory Usage**: ~200MB base + Chromium browser instance
264-
- **Concurrent Checks**: One at a time to avoid resource contention
265-
- **Network Speed**: Directly affects load times
266208
- **Timeout Configuration**: Adjust based on target site complexity
267209
- **Screenshot Impact**: Adds 1-2s to overall check time
268210
- **Detection Suite**: Adds 500ms-2s for all checks combined
269211

270212
## 🧩 System Requirements
271213

272-
- **VS Code**: 1.70.0 or higher
273-
- **Node.js**: Not required (extension uses Bun runtime)
274-
- **Platform**: Windows, macOS, Linux
275-
- **Memory**: 500MB minimum, 1GB recommended
276-
- **Storage**: 150MB (25MB extension + 130MB Chromium)
277-
- **Network**: Required for checking external URLs
214+
**VS Code** 1.70.0+ • **Platform** Windows, macOS, Linux
215+
**Memory** 1GB recommended • **Storage** 150MB (includes Chromium)
278216

279-
## 🔒 Privacy & Telemetry
217+
## 🔒 Privacy
280218

281-
- Runs entirely locally; no data is sent off your machine.
282-
- URLs you check are only sent to the target sites you specify.
283-
- Optional local-only logs can be enabled with `scrape-le.notificationsLevel: "all"`.
284-
- Logs appear in Output panel → "Scrape-LE".
285-
- No analytics, tracking, or external API calls.
286-
287-
See [`docs/PRIVACY.md`](docs/PRIVACY.md).
219+
100% local processing. URLs only sent to sites you specify. No analytics or tracking.
288220

289221
## 🌍 Language Support
290222

291-
**13 languages supported** with full localization:
292-
293-
- 🇺🇸 **English** (en) - Default language
294-
- 🇩🇪 **German** (de) - Deutsche Lokalisierung
295-
- 🇪🇸 **Spanish** (es) - Localización en español
296-
- 🇫🇷 **French** (fr) - Localisation française
297-
- 🇮🇩 **Indonesian** (id) - Lokalisasi bahasa Indonesia
298-
- 🇮🇹 **Italian** (it) - Localizzazione italiana
299-
- 🇯🇵 **Japanese** (ja) - 日本語サポート
300-
- 🇰🇷 **Korean** (ko) - 한국어 지원
301-
- 🇧🇷 **Portuguese (Brazil)** (pt-br) - Localização em português brasileiro
302-
- 🇷🇺 **Russian** (ru) - Русская локализация
303-
- 🇺🇦 **Ukrainian** (uk) - Українська локалізація
304-
- 🇻🇳 **Vietnamese** (vi) - Hỗ trợ tiếng Việt
305-
- 🇨🇳 **Chinese Simplified** (zh-cn) - 简体中文支持
306-
307-
All commands, settings, notifications, and help content automatically adapt to your VS Code language preference.
223+
**13 languages**: English, German, Spanish, French, Indonesian, Italian, Japanese, Korean, Portuguese (Brazil), Russian, Ukrainian, Vietnamese, Chinese (Simplified)
308224

309225
## 🔧 Troubleshooting
310226

311-
### Common Issues
312-
313-
**"Executable doesn't exist" error**
314-
315-
- Run **"Scrape-LE: Setup Browser"** from Command Palette
316-
- Or manually: `bunx playwright install chromium`
317-
- Verify installation: Run **"Scrape-LE: Setup Browser"** → "Test Browser Installation"
318-
319-
**Check times out**
320-
321-
- Increase timeout: `scrape-le.browser.timeout` (default 30s)
322-
- Check your network connection
323-
- Verify the URL is accessible in regular browser
324-
- Some sites may intentionally block automated browsers
325-
326-
**Screenshots not saving**
327-
328-
- Check `scrape-le.screenshot.path` setting
329-
- Ensure directory has write permissions
330-
- Verify `scrape-le.screenshot.enabled: true`
331-
- Check Output panel → "Scrape-LE" for errors
227+
**"Executable doesn't exist" error?**
228+
Run **"Scrape-LE: Setup Browser"** from Command Palette to install Chromium
332229

333-
**Anti-bot detection not working**
230+
**Check times out?**
231+
Increase timeout: `scrape-le.browser.timeout` (default 30s) or check network connection
334232

335-
- Enable detection: `scrape-le.detections.antiBot: true`
336-
- Some protection systems use server-side detection only
337-
- Detection uses heuristics and may not catch all systems
338-
- Check Output panel for detailed detection results
339-
340-
**Console errors not captured**
341-
342-
- Enable capture: `scrape-le.checkConsoleErrors: true`
343-
- Some errors may occur before capture starts
344-
- Check browser console in dev tools for comparison
345-
346-
**Extension crashes or freezes**
347-
348-
- Check Chromium is properly installed
349-
- Try reinstalling: Remove `~/Library/Caches/ms-playwright` and run setup again
350-
- Disable other browser automation extensions
351-
- Check Output panel → "Scrape-LE" for error messages
352-
- Reload VS Code window
353-
354-
### Getting Help
355-
356-
- Check the [Issues](https://github.com/nolindnaidoo/scrape-le/issues) page for known problems
357-
- Enable verbose logging: `scrape-le.notificationsLevel: "all"`
358-
- Review logs in Output panel → "Scrape-LE"
359-
- See [`docs/TROUBLESHOOTING.md`](docs/TROUBLESHOOTING.md) for detailed guidance
233+
**Need help?**
234+
Check [Issues](https://github.com/nolindnaidoo/scrape-le/issues) or enable verbose logging: `scrape-le.notificationsLevel: "all"`
360235

361236
## ❓ FAQ
362237

363-
**Q: Do I need to install Chromium separately?**
364-
A: No, Scrape-LE handles installation automatically on first use. Just click "Install Chromium" when prompted.
365-
366-
**Q: Can I check localhost URLs?**
367-
A: Yes, Scrape-LE works with localhost, local network IPs, and any accessible URL.
368-
369-
**Q: Does this work with SPAs (React, Vue, Angular)?**
370-
A: Yes, Scrape-LE uses a real browser so JavaScript frameworks render properly.
371-
372-
**Q: Can I check multiple URLs at once?**
373-
A: Currently one URL at a time to ensure reliability. Batch checking is planned for future releases.
374-
375-
**Q: Will this trigger bot detection on target sites?**
376-
A: Scrape-LE uses headless Chromium which some sites detect. Use responsibly and check robots.txt first.
377-
378-
**Q: Can I use this in CI/CD pipelines?**
379-
A: Scrape-LE is designed for interactive use in VS Code. For CI/CD, consider dedicated scrapeability testing tools.
238+
**Need to install Chromium?**
239+
No, Scrape-LE handles it automatically on first use (~130MB download)
380240

381-
**Q: How accurate is anti-bot detection?**
382-
A: Detection uses common patterns and heuristics. It catches most major systems but may not detect all custom solutions.
241+
**Works with localhost?**
242+
Yes, supports localhost, local IPs, and any accessible URL
383243

384-
**Q: Does this work with VPNs/proxies?**
385-
A: Yes, Scrape-LE respects your system's network configuration including VPNs and proxies.
244+
**Works with React/Vue/Angular?**
245+
Yes, uses real browser so SPAs render properly
386246

387-
## 📊 Test Coverage
247+
**Will sites detect this?**
248+
Uses headless Chromium which some sites detect. Use responsibly and check robots.txt
388249

389-
- 121 passing tests (1 skipped) with comprehensive edge case coverage
390-
- 82.17% overall coverage with istanbul provider
391-
- Unit tests for all detector modules
392-
- Integration tests for command workflows
393-
- Mock Playwright browser for reliable testing
394-
- All coverage numbers verified through actual test runs
395-
- Error handling and timeout scenarios covered
250+
## 📊 Testing
396251

397-
See [`docs/TESTING.md`](docs/TESTING.md).
252+
**93 unit tests** (4 skipped) • **89% function coverage, 91% line coverage**
253+
Powered by Vitest • Run with `bun test --coverage`
398254

399255
---
400256

0 commit comments

Comments
 (0)