A modular, asynchronous Python application to collect, analyze, and store cryptocurrency-related sentiment data from multiple sources including CryptoPanic, Reddit, CoinMarketCap (CMC), Twitter, and a Trump RSS feed. The project leverages FinBERT, a financial sentiment analysis model, to provide sentiment scoring on the collected textual data.
- Overview
- Features
- Supported Sources
- Architecture
- Installation
- Configuration
- Usage
- Data Storage
- Extending the Collector
- Logging and Error Handling
- Limitations and Considerations
- License
- Acknowledgements
The Crypto Sentiment Collector is designed to aggregate real-time sentiment data from various crypto news and social media platforms. It fetches the latest posts, tweets, and news articles, processes them through FinBERT for sentiment classification, and stores the results in both JSON and Parquet formats for easy analysis and integration.
- Asynchronous data fetching for efficient and concurrent API calls.
- Multi-source support including CryptoPanic, Reddit, CMC, Twitter, and RSS feeds.
- Financial sentiment analysis using the state-of-the-art FinBERT model.
- Data persistence with atomic JSON saving and Parquet file exports.
- Duplicate detection to avoid redundant data storage.
- Configurable tracked currencies and accounts.
- Robust error handling and logging for production readiness.
- Scheduled runs every 15 minutes aligned to quarter hours.
Source | Description | API Type | Notes |
---|---|---|---|
CryptoPanic | Aggregated crypto news and social posts | REST API | Requires CRYPTOPANIC_API_KEY |
Posts from selected crypto-related subreddits | OAuth2 API | Requires Reddit app credentials | |
CoinMarketCap (CMC) | Crypto news and content | REST API | Requires CMC_API_KEY |
Tweets from influential crypto accounts | Twitter API v2 | Requires TWITTER_BEARER_TOKEN |
|
Trump RSS Feed | RSS feed from Donald J. Trump's official site | RSS Feed (XML) | No API key needed |
Note: CMC, Twitter, and Trump RSS sources are implemented but commented out by default.
- SentimentSource (Base Class): Defines the interface for all data sources with
fetch
,process
, andsave_to_parquet
methods. - Source Implementations: Each source inherits from
SentimentSource
and implements its own fetching and processing logic. - SentimentCollector: Manages all sources, runs asynchronous fetch and process cycles, maintains a combined dataset, and handles data saving.
- FinBERT Sentiment Analysis: Uses Hugging Face Transformers pipeline with the
ProsusAI/finbert
model to generate sentiment labels and scores.
-
Clone the repository:
git clone https://github.com/yourusername/crypto-sentiment-collector.git cd crypto-sentiment-collector
-
Create and activate a Python virtual environment (recommended):
python3 -m venv venv source venv/bin/activate # Linux/macOS venv\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
-
Additional dependencies:
pyarrow
orfastparquet
for Parquet support in pandas.transformers
andtorch
for FinBERT sentiment analysis.tweepy
for Twitter API access.python-dotenv
for environment variable management.
Create a .env
file in the project root to store your API keys and credentials:
CRYPTOPANIC_API_KEY=your_cryptopanic_api_key
CMC_API_KEY=your_coinmarketcap_api_key
REDDIT3_CLIENT_ID=your_reddit_client_id
REDDIT3_CLIENT_SECRET=your_reddit_client_secret
REDDIT3_USER_AGENT=your_reddit_user_agent
TWITTER_BEARER_TOKEN=your_twitter_bearer_token
- Important: Keep your
.env
file secure and do not commit it to version control.
Run the main script to start the sentiment collector:
python main.py
- The collector runs indefinitely, fetching and processing data every 15 minutes aligned to quarter-hour marks (e.g., 00:00, 00:15, 00:30, 00:45).
- Use
Ctrl+C
to stop the collector gracefully; it will save data before exiting.
- JSON File:
data/sentiment_data.json
stores the combined collected sentiment records (up to the last 5000 entries). - Parquet Files: Each source saves its processed data into timestamped Parquet files under
data/sentiment/{source_name}/
. - Data Fields: Include source, unique IDs, titles, content, timestamps, sentiment scores and labels, metadata (votes, author, metrics), and URLs.
To add a new sentiment source:
- Create a new class inheriting from
SentimentSource
. - Implement
fetch
to retrieve raw data asynchronously. - Implement
process
to convert raw data into the standardized format and run sentiment analysis. - Add an instance of your new source to the
SentimentCollector.sources
list.
- Logs are output to the console with timestamps, log levels, and messages.
- Errors during fetching, processing, or saving are logged with details.
- Rate limits and API errors are handled gracefully with retries or skips.
- Tweepy and aiohttp exceptions are caught to prevent crashes.
- API Rate Limits: Be mindful of API usage limits; the app handles some rate limiting but excessive calls may cause failures.
- Sentiment Analysis: FinBERT is optimized for financial text but may not perfectly capture all nuances.
- Data Volume: The JSON file is capped at 5000 records to prevent excessive disk usage.
- Timezones: All timestamps are stored in UTC ISO 8601 format.
- RSS Parsing: The Trump RSS feed uses a simple regex parser; consider using a robust XML parser for production.
- ProsusAI/finbert for the sentiment analysis model.
- Tweepy for Twitter API integration.
- CoinMarketCap API for crypto news.
- CryptoPanic for aggregated crypto news.
- Reddit API for subreddit data.
If you find this project useful or want to contribute, please open an issue or submit a pull request!
Last updated: July 13, 2025