A powerful Python CLI tool for analyzing your personal record collection using Discogs data. DiscoStar combines XML data dumps with real-time API calls to provide deep insights into your music collection.
- Hybrid Data Approach: Combines Discogs XML dumps for reference data with API calls for personal collection
- Collection Sync: Sync your personal collection from Discogs API with real-time progress tracking
- High-Performance Ingestion: Memory-efficient XML parsing with batch processing (10,000+ records/second)
- Rate-Limited API Client: Respects Discogs API limits with configurable SSL handling
- Real-time Progress Tracking: Visual progress indicators and detailed status reporting
- Robust Error Handling: Comprehensive error recovery with sub-1% error rates
- Local Database: SQLite for development, with Azure PostgreSQL support for production
- CLI Interface: Clean command-line interface for all operations
- Analytics Engine: Comprehensive collection analysis with multiple output formats
- Web Interface: Future Flask-based web dashboard (coming soon)
- Cloud Ready: Terraform infrastructure for Azure deployment
DiscoStar provides comprehensive analytics for your music collection with multiple output formats:
- Collection Summary: Overview statistics (total releases, artists, labels, year range)
- Decade Analysis: Distribution by decade (prevents duplicate counting of same albums)
- Top Artists: Most collected artists in your collection
- Top Labels: Most collected record labels
- Longest Tracks: Find the longest tracks in your collection
- Multiple Copies: Identify albums where you own multiple variants/pressings
- Genre Analysis: Breakdown by genre and subgenre
- Format Analysis: Distribution by format (vinyl, CD, digital, etc.)
- Year Analysis: Most collected years
- Artist Collaborations: Find releases where two artists collaborated
- Human-readable: Formatted tables for terminal display
- CSV: For spreadsheet analysis and external visualization tools
- JSON: For programmatic use and integration with other tools
# Basic collection summary
discostar analytics
# Decade analysis with CSV output for visualization
discostar analytics --type decades --format csv --output decades.csv
# Top 10 artists in JSON format
discostar analytics --type top-artists --limit 10 --format json
# Find collaborations between Miles Davis and John Coltrane
discostar analytics --type collaborations --artist1 "Miles Davis" --artist2 "John Coltrane"
# Run all analyses and save comprehensive report
discostar analytics --type all --output collection_report.txt
# Export genre data for external analysis
discostar analytics --type genres --format csv --limit 30 --output genres.csv
- Smart duplicate handling: Decade analysis uses earliest release year for each master to prevent duplicate counting
- Flexible limits: Customize result limits for top-N analyses
- File output: Save results directly to files for further processing
- Real-time validation: Checks for collection data before running analyses
- Python 3.9 or higher
- Discogs account and API token
- Clone the repository:
git clone https://github.com/difu/discostar.git
cd discostar
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements-dev.txt
- Set up configuration:
cp .env.example .env
# Edit .env with your Discogs API token and username
- Initialize the database:
discostar init
# Download Discogs XML dumps
discostar download-dumps
# Import XML data into database
discostar ingest-data
# Sync your personal collection from Discogs API
discostar sync-collection
# Check ingestion and sync status
discostar status
# Analyze your collection
discostar analytics
DiscoStar is optimized for processing large Discogs datasets efficiently:
- Processing Speed: ~10,000 records/second
- Memory Efficiency: Uses iterative XML parsing for files >1GB
- Error Rate: <0.001% (sub-1% error tolerance)
- Batch Processing: Configurable batch sizes (default: 1,000 records)
- Progress Tracking: Real-time updates every 10,000 records
- Batch Commits: Every 10,000 records to optimize transaction overhead
- Memory Usage: Minimal memory footprint with streaming processing
- Storage: SQLite for local development, PostgreSQL for production scale
- Collection Sync: 603 collection items synced in ~8 seconds
- Rate Limiting: 60 requests/minute with 1-second minimum between requests
- Error Recovery: Automatic retry logic for transient API failures
- Progress Tracking: Real-time statistics during sync operations
Tested with Discogs June 2025 XML dumps on a Macbook Pro M4:
- Artists: 1,060,000+ records processed in ~2 minutes
- Collection Sync: 603 personal collection items in ~8 seconds
- Releases: Estimated 8+ million records (full dataset)
- Labels: Estimated 1.5+ million records
- Masters: Estimated 2+ million records
DiscoStar uses a normalized database schema with both JSON fields and relational join tables for optimal flexibility:
- JSON Fields: Store raw Discogs data in JSON format for completeness
- Join Tables: Normalized relationships for efficient queries and analytics
- Hybrid Benefits: Maintains data integrity while enabling complex SQL queries
DiscoStar automatically populates join tables during release ingestion:
Table | Purpose | Example Query |
---|---|---|
release_artists |
Artist-release relationships with roles | Find all releases by producer |
release_labels |
Label-release relationships with catalog numbers | Group releases by label |
tracks |
Individual track listings with positions | Search for specific songs |
# Automatic: Join tables populated during release ingestion
discostar ingest-data --type releases
# Manual: Process existing releases to populate join tables
discostar process-relationships
# Check results
discostar status # Shows join table counts
With join tables populated, you can run complex analytics:
π For comprehensive SQL query examples: See
release_analysis_queries.md
for detailed queries to identify original pressings, first presses, and country analysis with advantages/disadvantages for each approach.
-- Find all releases where Artist X collaborated with Artist Y
SELECT r.title FROM releases r
JOIN release_artists ra1 ON r.id = ra1.release_id
JOIN release_artists ra2 ON r.id = ra2.release_id
WHERE ra1.artist_id = 1 AND ra2.artist_id = 2;
-- Count releases by label
SELECT l.name, COUNT(*) FROM labels l
JOIN release_labels rl ON l.id = rl.label_id
GROUP BY l.name ORDER BY COUNT(*) DESC;
-- Find longest tracks in collection
SELECT r.title, t.title, t.duration FROM tracks t
JOIN releases r ON t.release_id = r.id
ORDER BY t.duration_seconds DESC LIMIT 10;
-- Find favorite decade based on collection (earliest version of each master release only)
-- - Groups your music collection by decade using the earliest release year for
-- each album you own. This prevents duplicate counting when you own multiple pressings of the same album
-- (e.g., original + remaster), giving you accurate statistics about
-- which decades your music taste favors most.
WITH earliest_releases AS (
SELECT
r.master_id,
MIN(
COALESCE(
CAST(strftime('%Y', r.released) AS INTEGER),
m.year,
CAST(json_extract(uc.basic_information, '$.year') AS INTEGER)
)
) as earliest_year
FROM releases r
INNER JOIN user_collection uc ON r.id = uc.release_id
LEFT JOIN masters m ON r.master_id = m.id
WHERE r.master_id IS NOT NULL
AND (
r.released IS NOT NULL OR
m.year IS NOT NULL OR
json_extract(uc.basic_information, '$.year') IS NOT NULL
)
GROUP BY r.master_id
UNION ALL
-- Include releases without master_id (standalone releases)
SELECT
NULL as master_id,
COALESCE(
CAST(strftime('%Y', r.released) AS INTEGER),
CAST(json_extract(uc.basic_information, '$.year') AS INTEGER)
) as earliest_year
FROM releases r
INNER JOIN user_collection uc ON r.id = uc.release_id
WHERE r.master_id IS NULL
AND (
r.released IS NOT NULL OR
json_extract(uc.basic_information, '$.year') IS NOT NULL
)
),
decade_counts AS (
SELECT
(earliest_year / 10) * 10 as decade_start,
COUNT(*) as release_count
FROM earliest_releases
WHERE earliest_year IS NOT NULL
GROUP BY (earliest_year / 10) * 10
)
SELECT
decade_start,
(decade_start || 's') as decade,
release_count,
ROUND(100.0 * release_count / SUM(release_count) OVER(), 2) as percentage
FROM decade_counts
ORDER BY release_count DESC;
-- Find releases where you own multiple copies - Identifies albums in your collection where you own more than one pressing or version. Groups by master release to show
-- unique albums with multiple copies, helping you track duplicates, variants, and different pressings of the same album (e.g., original vinyl + remaster + special
-- edition).
WITH duplicate_releases AS (
SELECT
r.master_id,
m.title as master_title,
COUNT(DISTINCT uc.release_id) as copy_count -- COUNT DISTINCT release_ids
FROM releases r
INNER JOIN user_collection uc ON r.id = uc.release_id
INNER JOIN masters m ON r.master_id = m.id
WHERE r.master_id IS NOT NULL AND r.master_id > 0
GROUP BY r.master_id, m.title
HAVING COUNT(DISTINCT uc.release_id) > 1 -- Use DISTINCT here too
)
SELECT
master_title as release_name,
copy_count
FROM duplicate_releases
ORDER BY copy_count DESC, master_title
LIMIT 5;
DiscoStar offers flexible release data management to balance completeness with performance:
Strategy | Records | Use Case | Storage | Query Speed |
---|---|---|---|---|
all |
8M+ releases | Complete dataset, discovery | ~2GB+ | Slower |
skip |
0 releases | Collection-only analysis | ~50MB | Fastest |
collection_only |
100s-1000s | Personal collection focus | ~100MB | Fast |
collection_only + masters |
1000s-10000s | Collection + all variants | ~200MB | Fast |
NEW FEATURE: For collection_only
strategy, you can now include all releases linked to masters in your collection. This gives you comprehensive coverage of all pressings, remasters, and variants of albums you own.
Example: If you own "Abbey Road" (1969 UK pressing), enabling master expansion will also include:
- Abbey Road (1969 US pressing)
- Abbey Road (1987 CD remaster)
- Abbey Road (2019 anniversary edition)
- All other official releases of the album
# Option 1: Start with essential data only
echo "strategy: skip" >> config/settings.yaml
discostar ingest-data --type artists,labels,masters
# Later: sync collection via API
# Option 2: Import everything, optimize later
discostar ingest-data # All data including 8M+ releases
# After collection sync:
discostar optimize-db --clean-unused # Remove unused releases
Edit config/settings.yaml
:
ingestion:
releases:
strategy: "collection_only" # or "all", "skip"
include_master_releases: true # Include all pressings of albums in collection
# 1. Set up collection-only strategy with master expansion
echo "ingestion:
releases:
strategy: 'collection_only'
include_master_releases: true" >> config/settings.yaml
# 2. Sync your collection first
discostar sync-collection
# 3. Import releases with master expansion
discostar ingest-data --type releases
# 4. Check results
discostar status # Shows collection + master variant counts
discostar/
βββ src/
β βββ core/ # Shared business logic
β β βββ database/ # Database models and operations
β β βββ discogs/ # API client and XML processing
β β βββ analytics/ # Statistical analysis
β β βββ utils/ # Utilities and configuration
β βββ cli/ # Command-line interface
β βββ web/ # Web interface (future)
βββ infrastructure/ # Azure deployment resources
βββ tests/ # Test suite
βββ data/ # Local data storage
βββ config/ # Configuration files
DiscoStar uses YAML configuration with environment variable overrides.
# Core commands
discostar init # Initialize database and directories
discostar download-dumps # Download all XML dumps
discostar ingest-data # Import XML data into database
discostar sync-collection # Sync your collection from Discogs API
discostar status # Show database and sync status
# Analytics commands
discostar analytics # Basic collection summary
discostar analytics --type all # Run all available analyses
discostar analytics --type decades --format csv # Decade analysis as CSV
discostar analytics --type collaborations --artist1 "Artist1" --artist2 "Artist2"
# Collection sync options
discostar sync-collection --force # Force refresh of collection data
discostar sync-wantlist # Sync wantlist (coming soon)
# Advanced XML ingestion options
discostar download-dumps --type artists # Download specific dump type
discostar ingest-data --type releases # Import specific data type
discostar ingest-data --force # Force re-ingestion
discostar clear-data --type artists # Clear specific data type
# Relationship processing (join tables)
discostar process-relationships # Populate join tables from release JSON data
# Collection-only workflow guidance
discostar collection-workflow # Interactive guide for collection-only setup
# Database optimization (after collection sync)
discostar optimize-db --clean-unused # Remove releases not in collections
# Master release expansion options
discostar ingest-data --include-masters # CLI override for master expansion
# Verbose logging
discostar -v <command> # Enable detailed logging
# Required
DISCOGS_API_TOKEN=your_discogs_api_token
DISCOGS_USERNAME=your_username
# Optional
DATABASE_URL=sqlite:///data/discostar.db
AZURE_STORAGE_CONNECTION_STRING=your_azure_connection
See config/settings.yaml
for detailed configuration options including:
- Database settings
- Discogs API configuration and rate limiting
- SSL verification settings (for development environments)
- Logging configuration
- Cache settings
- XML ingestion batch processing parameters
pytest
# Format code
black src/ tests/
# Lint code
flake8 src/ tests/
# Type checking
mypy src/
- Core Modules: Business logic separated into focused modules
- CLI Interface: Click-based command structure
- Database Layer: SQLAlchemy models matching Discogs schema
- API Client: Async HTTP client with rate limiting and error handling
- Collection Sync: Real-time synchronization with progress tracking
- Async Processing: aiohttp for concurrent API operations
- Testing: Pytest with async support
TODO nothing done yet π
- Configure Azure credentials
- Deploy infrastructure:
cd infrastructure/terraform
terraform init
terraform plan
terraform apply
- Deploy application:
# Build and push Docker container
docker build -t discostar .
# Deploy to Azure Container Instances
- Fork the repository
- Create a feature branch:
git checkout -b feature-name
- Make your changes
- Add tests for new functionality
- Run the test suite:
pytest
- Format code:
black src/ tests/
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Discogs for providing the comprehensive music database and API
- The open-source community for the excellent Python libraries that make this project possible
- Create an issue for bug reports or feature requests
- Check the documentation for detailed guides
DiscoStar - Illuminate your music collection with data-driven insights! β