Daily Article Scraper

A Python application that scrapes articles from various tech news sources and stores them in MongoDB. Designed to run as a scheduled job via GitHub Actions.

Features

Multi-source scraping: Extracts articles from InShorts, Medium, and 30+ RSS feeds including BBC, CNN, TechCrunch, and more
Quality validation: Filters out non-articles (user profiles, sign-in pages) ensuring only valid articles with proper metadata
Global coverage: Articles from diverse sources across technology, business, science, and international news
MongoDB integration: Stores articles with deduplication and indexing
Automatic cleanup: Removes articles older than 2 months (configurable)
GitHub Actions workflow: Automated daily scraping via cron jobs
Professional structure: Modular codebase with proper configuration management
Comprehensive logging: Detailed logging with file and console output
Error handling: Robust error handling and retry mechanisms
Rate limiting: Respectful scraping with configurable delays
Database monitoring: Statistics and health checking tools

Project Structure

daily-article-scrapper/
├── .github/
│   └── workflows/
│       └── daily-scraper.yml      # GitHub Actions workflow
├── config/
│   ├── __init__.py
│   └── settings.py                # Configuration settings
├── src/
│   ├── __init__.py
│   ├── database.py               # MongoDB operations
│   └── scraper.py                # Core scraping logic
├── tests/                        # Test files
├── scripts/                      # Utility scripts
│   ├── setup.sh                  # Environment setup
│   ├── manage.sh                 # Project management
│   ├── status_check.py           # Health monitoring
│   └── cleanup_articles.py       # Database cleanup
├── logs/                         # Log files (created at runtime)
├── .env.example                  # Environment variables template
├── .gitignore                    # Git ignore file
├── main.py                       # Main application entry point
├── requirements.txt              # Production dependencies
├── requirements-dev.txt          # Development dependencies
├── CLEANUP_GUIDE.md              # Database cleanup documentation
└── README.md                     # This file

Installation

1. Clone the repository

git clone <your-repo-url>
cd daily-article-scrapper

2. Create a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Set up environment variables

cp .env.example .env
# Edit .env with your MongoDB configuration

5. Set up MongoDB

Make sure you have access to a MongoDB instance. You can use:

Local MongoDB installation
MongoDB Atlas (cloud)
Docker container

Configuration

Environment Variables

Create a .env file based on .env.example:

# MongoDB Configuration
MONGODB_URI=mongodb://localhost:27017/
MONGODB_DATABASE=article_scraper
MONGODB_COLLECTION=articles

# Scraping Configuration
TARGET_ARTICLE_COUNT=20
RATE_LIMIT_DELAY=2
MAX_RETRIES=3

# Logging Configuration
LOG_LEVEL=INFO
LOG_FILE=logs/scraper.log

# Cleanup Configuration
AUTO_CLEANUP_ENABLED=true
CLEANUP_MONTHS_OLD=2

MongoDB Setup

The application will automatically:

Create the database and collection if they don't exist
Set up indexes for optimal performance
Handle duplicate articles based on URL

Usage

Local Development

# Run the scraper once
python main.py

# Run with custom article count
TARGET_ARTICLE_COUNT=50 python main.py

# Check database statistics
python scripts/cleanup_articles.py --stats

# Manual cleanup (dry run)
python scripts/cleanup_articles.py --dry-run

# Manual cleanup (execute)
python scripts/cleanup_articles.py

GitHub Actions Setup

Set up repository secrets:
- Go to your repository Settings → Secrets and variables → Actions
- Add the following secrets:
  - MONGODB_URI: Your MongoDB connection string
  - MONGODB_DATABASE: Database name
  - MONGODB_COLLECTION: Collection name
Configure the schedule:
- Edit .github/workflows/daily-scraper.yml
- Modify the cron expression to your preferred time
Manual trigger:
- Go to Actions tab in your repository
- Select "Daily Article Scraper"
- Click "Run workflow"

Development

Setting up development environment

# Install development dependencies
pip install -r requirements-dev.txt

# Format code
black src/ config/ main.py

# Lint code
flake8 src/ config/ main.py

# Run tests
pytest tests/ -v

# Check database statistics
bash scripts/manage.sh stats

# Manual cleanup
bash scripts/manage.sh cleanup

Adding new sources

RSS feeds: Add to config/settings.py in the RSS_FEEDS dictionary
Medium publications: Add to MEDIUM_PUBLICATIONS list in config/settings.py
Custom scrapers: Add methods to src/scraper.py
Configuration: Update environment variables as needed

See ARTICLE_QUALITY_IMPROVEMENTS.md for details on quality validation and source selection.

Database Cleanup

The application includes an automated cleanup system that removes articles older than 2 months by default. This prevents the database from growing indefinitely and ensures optimal performance.

Cleanup Features

Automatic cleanup: Runs before each scraping session
Configurable retention: Adjust with CLEANUP_MONTHS_OLD environment variable
Manual control: Can be disabled with AUTO_CLEANUP_ENABLED=false
Safe operations: Dry-run mode available for testing

Cleanup Commands

# View database statistics
python scripts/cleanup_articles.py --stats
bash scripts/manage.sh stats

# Preview cleanup (dry run)
python scripts/cleanup_articles.py --dry-run

# Manual cleanup
python scripts/cleanup_articles.py
bash scripts/manage.sh cleanup

# Custom retention period
python scripts/cleanup_articles.py --months 3

Configuration

Set cleanup behavior in your .env file:

AUTO_CLEANUP_ENABLED=true    # Enable/disable automatic cleanup
CLEANUP_MONTHS_OLD=2         # Keep articles for 2 months

For detailed cleanup documentation, see CLEANUP_GUIDE.md.

API Documentation

ArticleScraper Class

Main scraping functionality:

from src.scraper import ArticleScraper

scraper = ArticleScraper()
articles = scraper.scrape_daily_articles(target_count=20)

DatabaseManager Class

MongoDB operations:

from src.database import DatabaseManager

with DatabaseManager() as db:
    db.save_articles(articles)
    recent = db.get_recent_articles(days=7)

Data Structure

Articles are stored with the following structure:

{
  "_id": "https://example.com/article_20250706",
  "title": "Article Title",
  "url": "https://example.com/article",
  "published": "2025-07-06T10:30:00Z",
  "summary": "Article summary text",
  "source": "techcrunch.com",
  "tags": ["technology", "ai"],
  "scraped_at": "2025-07-06T13:22:46.123Z"
}

Monitoring and Logs

Local logs: Check logs/scraper.log
Database stats: Run python scripts/cleanup_articles.py --stats
Management tools: Use bash scripts/manage.sh [command]
GitHub Actions: View logs in the Actions tab
MongoDB: Query the database for article statistics

Troubleshooting

Common Issues

MongoDB connection failed:
- Check your MONGODB_URI configuration
- Ensure MongoDB is running and accessible
- Verify network connectivity
No articles found:
- Check internet connectivity
- Some RSS feeds might be temporarily unavailable
- Increase MAX_RETRIES in configuration
Rate limiting:
- Increase RATE_LIMIT_DELAY to be more respectful to servers
- Some sites might block requests; consider using proxies
Database growing too large:
- Check if cleanup is enabled: AUTO_CLEANUP_ENABLED=true
- Adjust retention period: CLEANUP_MONTHS_OLD=2
- Run manual cleanup: bash scripts/manage.sh cleanup
Old articles not being deleted:
- Verify cleanup configuration in .env
- Check logs for cleanup errors
- Run cleanup manually to test

GitHub Actions Issues

Secrets not configured:
- Ensure all required secrets are set in repository settings
Workflow not running:
- Check the cron expression syntax
- Ensure the repository is not dormant (GitHub disables workflows on inactive repos)

Contributing

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

License

This project is licensed under the MIT License. See the LICENSE file for details.

Support

For issues and questions:

Check the troubleshooting section
Search existing GitHub issues
Create a new issue with detailed information

Acknowledgments

Built with Python 3.11+
Uses feedparser for RSS parsing
Beautiful Soup for web scraping
MongoDB for data storage
GitHub Actions for automation

Documentation

ARTICLE_QUALITY_IMPROVEMENTS.md - Details on article validation and quality filtering
IMAGE_EXTRACTION_IMPROVEMENTS.md - Image URL extraction enhancements
INSHORTS_INTEGRATION.md - InShorts API integration documentation
CLEANUP_GUIDE.md - Database cleanup guide
PROJECT_OVERVIEW.md - Overall project architecture

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
config		config
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ARTICLE_QUALITY_IMPROVEMENTS.md		ARTICLE_QUALITY_IMPROVEMENTS.md
CLEANUP_GUIDE.md		CLEANUP_GUIDE.md
IMAGE_EXTRACTION_IMPROVEMENTS.md		IMAGE_EXTRACTION_IMPROVEMENTS.md
INSHORTS_INTEGRATION.md		INSHORTS_INTEGRATION.md
LICENSE		LICENSE
Makefile		Makefile
PROJECT_OVERVIEW.md		PROJECT_OVERVIEW.md
README.md		README.md
__init__.py		__init__.py
demo_enhanced_images.py		demo_enhanced_images.py
demo_inshorts.py		demo_inshorts.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
test_image_improvements.py		test_image_improvements.py
validate_solution.py		validate_solution.py

License

athrvk/daily-article-scrapper

Folders and files

Latest commit

History

Repository files navigation

Daily Article Scraper

Features

Project Structure

Installation

1. Clone the repository

2. Create a virtual environment

3. Install dependencies

4. Set up environment variables

5. Set up MongoDB

Configuration

Environment Variables

MongoDB Setup

Usage

Local Development

GitHub Actions Setup

Development

Setting up development environment

Adding new sources

Database Cleanup

Cleanup Features

Cleanup Commands

Configuration

API Documentation

ArticleScraper Class

DatabaseManager Class

Data Structure

Monitoring and Logs

Troubleshooting

Common Issues

GitHub Actions Issues

Contributing

License

Support

Acknowledgments

Documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages