A simple Node.js web scraper that collects article titles from The Verge website.
-
Install dependencies:
npm install express axios cheerio
-
Run the app:
node app.js
-
Open browser:
http://localhost:3000
- Scrapes articles from The Verge homepage and RSS feeds
- Web interface with year filtering (2022-2025)
- JSON API at
/api/articles
- 30-minute caching to avoid overloading the server
- Auto-refresh every 30 minutes
- GET
/
- Main web interface - GET
/api/articles
- JSON API with all articles - GET
/refresh
- Force refresh articles - GET
/debug
- Debug information - GET
/health
- Server health check
- Set
PORT
environment variable (default: 3000) - Cache duration: 30 minutes (configurable in code)
- Primary: Scrapes The Verge homepage HTML
- Fallback: Uses RSS feeds if main scraping fails
- Filters: Only shows articles from 2022 onwards
- Deduplicates: Removes duplicate articles by URL
- Caches: Stores results for 30 minutes
- No articles? Check
/debug
endpoint - Scraping blocked? RSS feeds provide fallback
- Performance issues? Caching reduces server load
express
- Web serveraxios
- HTTP requestscheerio
- HTML parsing
For educational/personal use only. Respects The Verge's content and includes appropriate delays to avoid overwhelming their servers.