A comprehensive, production-ready Reddit scraping platform with modern React.js dashboard, advanced analytics, and enterprise-grade features.
- Beautiful dark theme with Material-UI components
- Real-time WebSocket updates for live monitoring
- Interactive charts with Chart.js and Recharts
- Responsive design that works on all devices
- Modern color palette with gradients and animations
- SQLite database for persistent storage
- Data versioning and backup capabilities
- Advanced querying with filtering and pagination
- Performance optimization with indexing
- Automatic cleanup of old data
- Sentiment analysis with VADER and TextBlob
- Trend prediction with machine learning
- Viral potential scoring algorithm
- Content categorization and insights
- Subreddit growth analysis
- Parallel processing with up to 10x speed improvement
- Memory optimization for large datasets
- Intelligent caching system
- Performance monitoring with detailed metrics
- Resource usage tracking
- Docker containerization with multi-stage builds
- Docker Compose for easy orchestration
- Nginx reverse proxy with load balancing
- Health checks and monitoring
- SSL/TLS support ready
- Features
- Quick Start
- Installation
- Configuration
- Usage
- API Documentation
- Development
- Deployment
- Contributing
- License
- ✅ Multi-subreddit parallel scraping
- ✅ Flexible sorting options (hot, new, top, rising)
- ✅ User profile collection
- ✅ Content extraction from external links
- ✅ Rate limiting with intelligent backoff
- ✅ Error handling and retry mechanisms
- ✅ Sentiment analysis (VADER + TextBlob + Custom patterns)
- ✅ Trend prediction with ML algorithms
- ✅ Viral potential scoring
- ✅ Content categorization
- ✅ Engagement analysis
- ✅ Time-based pattern recognition
- ✅ Real-time monitoring with WebSocket
- ✅ Interactive charts and visualizations
- ✅ Data browser with advanced filtering
- ✅ Session management
- ✅ Performance metrics
- ✅ Export capabilities
- ✅ Database persistence with SQLite
- ✅ RESTful API with FastAPI
- ✅ Docker containerization
- ✅ Nginx reverse proxy
- ✅ Health checks and monitoring
- ✅ Comprehensive logging
- ✅ JSON - Structured data with metadata
- ✅ CSV - Multiple files with breakdowns
- ✅ HTML - Interactive reports with charts
- ✅ Database - Persistent storage with querying
# Clone the repository
git clone https://github.com/your-username/reddit-scraper-v2.git
cd reddit-scraper-v2
# Start with Docker Compose
docker-compose up -d
# Access the dashboard
open http://localhost:3000
# Clone and setup backend
git clone https://github.com/your-username/reddit-scraper-v2.git
cd reddit-scraper-v2
# Install Python dependencies
pip install -r requirements.txt
# Setup Reddit API credentials
python run.py setup
# Start the API server
uvicorn src.api.dashboard_api:create_app --reload --factory
# In another terminal, setup frontend
cd frontend
npm install
npm start
# Access the dashboard
open http://localhost:3000
- Python 3.9+
- Node.js 18+
- Docker & Docker Compose (for containerized deployment)
- Reddit API credentials
-
Clone the repository:
git clone https://github.com/your-username/reddit-scraper-v2.git cd reddit-scraper-v2
-
Create virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Setup Reddit API:
python run.py setup
-
Navigate to frontend directory:
cd frontend
-
Install dependencies:
npm install
-
Start development server:
npm start
- Go to Reddit App Preferences
- Create a new app (script type)
- Note your
client_id
andclient_secret
- Run the setup command:
python run.py setup
Edit config/settings.yaml
:
reddit_api:
client_id: "your_client_id"
client_secret: "your_client_secret"
user_agent: "RedditScraper/2.0"
scraping:
rate_limit: 1.0 # requests per second
max_retries: 3
timeout: 30
parallel_workers: 5
filtering:
min_score: 1
max_age_days: 365
exclude_nsfw: true
exclude_deleted: true
database:
path: "data/reddit_scraper.db"
cleanup_interval_days: 30
performance:
cache_enabled: true
cache_duration: 3600
memory_limit_mb: 1024
-
Start the application:
# Backend uvicorn src.api.dashboard_api:create_app --factory # Frontend (in another terminal) cd frontend && npm start
-
Access dashboard: http://localhost:3000
-
Features available:
- Dashboard: Real-time metrics and monitoring
- Scraping: Start and manage scraping sessions
- Analytics: Sentiment analysis and trend prediction
- Data: Browse and export scraped data
- Settings: Configure API and preferences
# Basic scraping
python run.py scrape --subreddit python --posts 100
# Advanced scraping with all features
python run.py scrape \
--subreddit "python,datascience,MachineLearning" \
--posts 200 \
--parallel \
--extract-content \
--include-users \
--performance-monitor \
--output "json,csv,html"
# Analytics
python run.py analyze --sentiment --trends --subreddit python
# Database management
python run.py db --stats
python run.py db --cleanup --days 30
import requests
# Start scraping session
response = requests.post('http://localhost:8000/scrape/start', json={
'subreddits': ['python', 'datascience'],
'posts_per_subreddit': 100,
'parallel': True,
'extract_content': True
})
session_id = response.json()['session_id']
# Check status
status = requests.get(f'http://localhost:8000/scrape/status/{session_id}')
print(status.json())
# Get analytics
analytics = requests.get('http://localhost:8000/analytics/summary?days=7')
print(analytics.json())
Method | Endpoint | Description |
---|---|---|
GET |
/health |
Health check |
GET |
/config |
Get configuration |
POST |
/scrape/start |
Start scraping session |
GET |
/scrape/status/{id} |
Get session status |
GET |
/scrape/sessions |
List all sessions |
DELETE |
/scrape/stop/{id} |
Stop session |
Method | Endpoint | Description |
---|---|---|
GET |
/data/posts |
Get posts with filtering |
GET |
/analytics/summary |
Get analytics summary |
POST |
/analytics/sentiment |
Run sentiment analysis |
POST |
/analytics/trends |
Run trend analysis |
GET |
/analytics/realtime |
Get real-time metrics |
Connect to ws://localhost:8000/ws
for real-time updates:
const ws = new WebSocket('ws://localhost:8000/ws');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log('Update:', data);
};
reddit-scraper-v2/
├── src/ # Backend source code
│ ├── api/ # FastAPI application
│ ├── core/ # Core scraping logic
│ ├── database/ # Database management
│ ├── analytics/ # Analytics and ML
│ ├── processors/ # Data processing
│ └── exporters/ # Export functionality
├── frontend/ # React.js dashboard
│ ├── src/
│ │ ├── components/ # React components
│ │ ├── pages/ # Page components
│ │ ├── services/ # API services
│ │ └── utils/ # Utilities
│ └── public/ # Static files
├── tests/ # Test suite
├── config/ # Configuration files
├── docs/ # Documentation
├── docker/ # Docker configurations
└── monitoring/ # Monitoring configs
# Run all tests
python run_tests.py --all
# Run specific test types
python run_tests.py --unit
python run_tests.py --integration
python run_tests.py --performance
# Run with coverage
python run_tests.py --coverage
# Frontend tests
cd frontend && npm test
# Format code
python run_tests.py --format
# Lint code
python run_tests.py --lint
# Security scan
python run_tests.py --security
# Type checking
mypy src/
-
Create feature branch:
git checkout -b feature/new-feature
-
Make changes and test:
python run_tests.py --all
-
Format and lint:
python run_tests.py --format --lint
-
Commit and push:
git add . git commit -m "Add new feature" git push origin feature/new-feature
-
Build and run:
docker-compose up -d
-
Scale services:
docker-compose up -d --scale reddit-scraper-api=3
-
View logs:
docker-compose logs -f
-
Setup environment:
# Copy production config cp config/settings.example.yaml config/settings.yaml # Edit with production values nano config/settings.yaml
-
SSL Configuration:
# Generate SSL certificates mkdir -p nginx/ssl openssl req -x509 -nodes -days 365 -newkey rsa:2048 \ -keyout nginx/ssl/key.pem -out nginx/ssl/cert.pem
-
Deploy with monitoring:
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d
- Application: http://localhost:3000
- API Health: http://localhost:8000/health
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3001 (admin/admin)
/* Primary Colors */
--primary: #4a9eff; /* Blue */
--secondary: #ff4500; /* Reddit Orange */
/* Background */
--bg-primary: #0f0f0f; /* Dark */
--bg-secondary: #1a1a1a; /* Card Background */
--bg-tertiary: #2d2d2d; /* Surface */
/* Text */
--text-primary: #ffffff; /* Primary Text */
--text-secondary: #b0b0b0; /* Secondary Text */
/* Status Colors */
--success: #4caf50; /* Green */
--warning: #ff9800; /* Orange */
--error: #f44336; /* Red */
--info: #2196f3; /* Blue */
- Font Family: Inter, Roboto, Helvetica, Arial
- Headings: 700 weight, varied sizes
- Body: 400 weight, 1rem base size
- Captions: 300 weight, 0.875rem
- Sequential: ~100 posts/minute
- Parallel (5 workers): ~500 posts/minute
- Memory usage: <100MB for 10k posts
- Database queries: <50ms average
- Initial load: <2 seconds
- Real-time updates: <100ms latency
- Chart rendering: <500ms for 1k data points
- API response: <200ms average
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
- Follow PEP 8 for Python code
- Use ESLint/Prettier for JavaScript
- Write comprehensive tests
- Document new features
- Update CHANGELOG.md
This project is licensed under the MIT License - see the LICENSE file for details.
Membuat tool scraping Reddit berbasis terminal untuk mengumpulkan posts, konten, dan user profiles dari subreddit tertentu atau secara general untuk keperluan analisis sentimen, penelitian, dan analisa tren.
- Volume: Ratusan ribu hingga jutaan posts
- Format: JSON, CSV, HTML
- Sifat: One-time scraping
- Interface: Terminal/Command Line
Reddit Scraper
├── Core Engine
│ ├── Reddit API Client
│ ├── Web Scraper (Fallback)
│ └── Rate Limiter
├── Data Processing
│ ├── Post Processor
│ ├── User Profile Processor
│ └── Content Extractor
├── Storage Manager
│ ├── JSON Exporter
│ ├── CSV Exporter
│ └── HTML Generator
└── CLI Interface
├── Configuration Manager
├── Progress Monitor
└── Error Handler
- Setup project structure
- Implement Reddit API client
- Basic CLI interface
- Configuration management
- Rate limiting mechanism
- Post scraping functionality
- User profile collection
- Content extraction (links)
- Error handling & logging
- Basic filtering options
- JSON export functionality
- CSV export functionality
- HTML report generation
- Data validation
- Duplicate detection
- Advanced filtering options
- Parallel processing
- Progress monitoring
- Performance optimization
- Testing & documentation
- PRAW - Python Reddit API Wrapper
- FastAPI - Modern web framework
- React.js - Frontend framework
- Material-UI - React components
- Chart.js - Data visualization
- Docker - Containerization
- Documentation: docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: support@reddit-scraper.com
Built with ❤️ by @pixelbrow720
Reddit Scraper v2.0 - Enterprise Edition