A Python application that scrapes articles from various tech news sources and stores them in MongoDB. Designed to run as a scheduled job via GitHub Actions.
- Multi-source scraping: Extracts articles from InShorts, Medium, and 30+ RSS feeds including BBC, CNN, TechCrunch, and more
- Quality validation: Filters out non-articles (user profiles, sign-in pages) ensuring only valid articles with proper metadata
- Global coverage: Articles from diverse sources across technology, business, science, and international news
- MongoDB integration: Stores articles with deduplication and indexing
- Automatic cleanup: Removes articles older than 2 months (configurable)
- GitHub Actions workflow: Automated daily scraping via cron jobs
- Professional structure: Modular codebase with proper configuration management
- Comprehensive logging: Detailed logging with file and console output
- Error handling: Robust error handling and retry mechanisms
- Rate limiting: Respectful scraping with configurable delays
- Database monitoring: Statistics and health checking tools
daily-article-scrapper/
├── .github/
│ └── workflows/
│ └── daily-scraper.yml # GitHub Actions workflow
├── config/
│ ├── __init__.py
│ └── settings.py # Configuration settings
├── src/
│ ├── __init__.py
│ ├── database.py # MongoDB operations
│ └── scraper.py # Core scraping logic
├── tests/ # Test files
├── scripts/ # Utility scripts
│ ├── setup.sh # Environment setup
│ ├── manage.sh # Project management
│ ├── status_check.py # Health monitoring
│ └── cleanup_articles.py # Database cleanup
├── logs/ # Log files (created at runtime)
├── .env.example # Environment variables template
├── .gitignore # Git ignore file
├── main.py # Main application entry point
├── requirements.txt # Production dependencies
├── requirements-dev.txt # Development dependencies
├── CLEANUP_GUIDE.md # Database cleanup documentation
└── README.md # This file
git clone <your-repo-url>
cd daily-article-scrapperpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtcp .env.example .env
# Edit .env with your MongoDB configurationMake sure you have access to a MongoDB instance. You can use:
- Local MongoDB installation
- MongoDB Atlas (cloud)
- Docker container
Create a .env file based on .env.example:
# MongoDB Configuration
MONGODB_URI=mongodb://localhost:27017/
MONGODB_DATABASE=article_scraper
MONGODB_COLLECTION=articles
# Scraping Configuration
TARGET_ARTICLE_COUNT=20
RATE_LIMIT_DELAY=2
MAX_RETRIES=3
# Logging Configuration
LOG_LEVEL=INFO
LOG_FILE=logs/scraper.log
# Cleanup Configuration
AUTO_CLEANUP_ENABLED=true
CLEANUP_MONTHS_OLD=2The application will automatically:
- Create the database and collection if they don't exist
- Set up indexes for optimal performance
- Handle duplicate articles based on URL
# Run the scraper once
python main.py
# Run with custom article count
TARGET_ARTICLE_COUNT=50 python main.py
# Check database statistics
python scripts/cleanup_articles.py --stats
# Manual cleanup (dry run)
python scripts/cleanup_articles.py --dry-run
# Manual cleanup (execute)
python scripts/cleanup_articles.py-
Set up repository secrets:
- Go to your repository Settings → Secrets and variables → Actions
- Add the following secrets:
MONGODB_URI: Your MongoDB connection stringMONGODB_DATABASE: Database nameMONGODB_COLLECTION: Collection name
-
Configure the schedule:
- Edit
.github/workflows/daily-scraper.yml - Modify the cron expression to your preferred time
- Edit
-
Manual trigger:
- Go to Actions tab in your repository
- Select "Daily Article Scraper"
- Click "Run workflow"
# Install development dependencies
pip install -r requirements-dev.txt
# Format code
black src/ config/ main.py
# Lint code
flake8 src/ config/ main.py
# Run tests
pytest tests/ -v
# Check database statistics
bash scripts/manage.sh stats
# Manual cleanup
bash scripts/manage.sh cleanup- RSS feeds: Add to
config/settings.pyin theRSS_FEEDSdictionary - Medium publications: Add to
MEDIUM_PUBLICATIONSlist inconfig/settings.py - Custom scrapers: Add methods to
src/scraper.py - Configuration: Update environment variables as needed
See ARTICLE_QUALITY_IMPROVEMENTS.md for details on quality validation and source selection.
The application includes an automated cleanup system that removes articles older than 2 months by default. This prevents the database from growing indefinitely and ensures optimal performance.
- Automatic cleanup: Runs before each scraping session
- Configurable retention: Adjust with
CLEANUP_MONTHS_OLDenvironment variable - Manual control: Can be disabled with
AUTO_CLEANUP_ENABLED=false - Safe operations: Dry-run mode available for testing
# View database statistics
python scripts/cleanup_articles.py --stats
bash scripts/manage.sh stats
# Preview cleanup (dry run)
python scripts/cleanup_articles.py --dry-run
# Manual cleanup
python scripts/cleanup_articles.py
bash scripts/manage.sh cleanup
# Custom retention period
python scripts/cleanup_articles.py --months 3Set cleanup behavior in your .env file:
AUTO_CLEANUP_ENABLED=true # Enable/disable automatic cleanup
CLEANUP_MONTHS_OLD=2 # Keep articles for 2 monthsFor detailed cleanup documentation, see CLEANUP_GUIDE.md.
Main scraping functionality:
from src.scraper import ArticleScraper
scraper = ArticleScraper()
articles = scraper.scrape_daily_articles(target_count=20)MongoDB operations:
from src.database import DatabaseManager
with DatabaseManager() as db:
db.save_articles(articles)
recent = db.get_recent_articles(days=7)Articles are stored with the following structure:
{
"_id": "https://example.com/article_20250706",
"title": "Article Title",
"url": "https://example.com/article",
"published": "2025-07-06T10:30:00Z",
"summary": "Article summary text",
"source": "techcrunch.com",
"tags": ["technology", "ai"],
"scraped_at": "2025-07-06T13:22:46.123Z"
}- Local logs: Check
logs/scraper.log - Database stats: Run
python scripts/cleanup_articles.py --stats - Management tools: Use
bash scripts/manage.sh [command] - GitHub Actions: View logs in the Actions tab
- MongoDB: Query the database for article statistics
-
MongoDB connection failed:
- Check your
MONGODB_URIconfiguration - Ensure MongoDB is running and accessible
- Verify network connectivity
- Check your
-
No articles found:
- Check internet connectivity
- Some RSS feeds might be temporarily unavailable
- Increase
MAX_RETRIESin configuration
-
Rate limiting:
- Increase
RATE_LIMIT_DELAYto be more respectful to servers - Some sites might block requests; consider using proxies
- Increase
-
Database growing too large:
- Check if cleanup is enabled:
AUTO_CLEANUP_ENABLED=true - Adjust retention period:
CLEANUP_MONTHS_OLD=2 - Run manual cleanup:
bash scripts/manage.sh cleanup
- Check if cleanup is enabled:
-
Old articles not being deleted:
- Verify cleanup configuration in
.env - Check logs for cleanup errors
- Run cleanup manually to test
- Verify cleanup configuration in
-
Secrets not configured:
- Ensure all required secrets are set in repository settings
-
Workflow not running:
- Check the cron expression syntax
- Ensure the repository is not dormant (GitHub disables workflows on inactive repos)
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is licensed under the MIT License. See the LICENSE file for details.
For issues and questions:
- Check the troubleshooting section
- Search existing GitHub issues
- Create a new issue with detailed information
- Built with Python 3.11+
- Uses feedparser for RSS parsing
- Beautiful Soup for web scraping
- MongoDB for data storage
- GitHub Actions for automation
ARTICLE_QUALITY_IMPROVEMENTS.md- Details on article validation and quality filteringIMAGE_EXTRACTION_IMPROVEMENTS.md- Image URL extraction enhancementsINSHORTS_INTEGRATION.md- InShorts API integration documentationCLEANUP_GUIDE.md- Database cleanup guidePROJECT_OVERVIEW.md- Overall project architecture