Skip to content

samsullivandelgobbo/vPIC-dl

Repository files navigation

vPIC Database Pipeline

A production-ready pipeline for downloading, processing, and optimizing the NHTSA vPIC (Vehicle Product Information Catalog) database for VIN decoding applications.

🚀 Quick Start

# Clone and setup
git clone <repository-url>
cd vpic-pipeline

# Run complete pipeline (downloads latest vPIC data and creates optimized database)
make sqlite-pipeline

# Find your optimized databases in temp/compressed/
ls -lh temp/compressed/

📋 Overview

This pipeline transforms the massive NHTSA vPIC database into optimized, compressed formats suitable for production VIN decoding applications:

  • Download: Latest vPIC database from NHTSA (~180MB ZIP, ~1.1GB uncompressed)
  • Process: Restore SQL Server backup and migrate to SQLite
  • Optimize: Remove unnecessary tables and data for VIN decoding (336MB → 63MB)
  • Compress: Multiple compression formats (63MB → 12MB with xz)

🎯 Output Results

Database Type Uncompressed gzip bzip2 xz zstd
Full 336MB 92MB 68MB - -
Lite (optimized) 63MB 20MB 17MB 12MB 15MB

The lite database contains only essential tables for VIN decoding while maintaining 100% decoding accuracy.

🏗️ Architecture

vPIC API → SQL Server (Docker) → SQLite → Optimize → Compress → Production DB

Key Components

  • Docker: SQL Server container for .bak file restoration
  • Python: Migration engine with type mapping and validation
  • SQLite: Target database optimized for read-heavy VIN decoding
  • Bash Scripts: Orchestration and optimization pipeline

📁 Project Structure

vpic-pipeline/
├── Makefile              # Main automation targets
├── docker-compose.yml    # SQL Server & PostgreSQL containers  
├── .env.example          # Environment configuration template
├── scripts/              # Pipeline scripts
│   ├── 01-download.sh    # Download latest vPIC data
│   ├── 02-restore.sh     # Restore SQL Server backup
│   ├── 03-verify.sh      # Verify database restoration
│   ├── 04-optimize.sh    # Remove unnecessary tables/data
│   └── 05-compress.sh    # Compress optimized database
├── src/                  # Python migration code
│   ├── migrate.py        # Main migration orchestrator
│   ├── database.py       # Database connection handlers
│   └── settings.py       # Configuration and type mappings
└── temp/                 # Working directory (ignored by git)
    ├── vpic.db           # Full SQLite database
    ├── vpic.lite.db      # Optimized database
    └── compressed/       # Final compressed outputs

🛠️ Installation

Prerequisites

  • Docker and Docker Compose
  • Python 3.8+
  • Make
  • Standard Unix tools: curl, unzip, gzip, bzip2, xz, sqlite3

Setup

# 1. Clone repository
git clone <repository-url>
cd vpic-pipeline

# 2. Copy environment template
cp .env.example .env

# 3. Install Python dependencies
make setup

# 4. Run complete pipeline
make sqlite-pipeline

🎮 Usage

Full Pipeline (Recommended)

# Complete pipeline: download → restore → migrate → optimize → compress
make sqlite-pipeline

Individual Steps

# Start containers
make start-containers

# Download latest vPIC data  
make download

# Restore to SQL Server
make restore

# Migrate to SQLite
make migrate-sqlite

# Optimize for VIN decoding
make optimize

# Compress database
make compress

# Clean up
make clean

Database Outputs

After running the pipeline, you'll find:

temp/
├── vpic.db                                    # Full database (336MB)
├── vpic.lite.db                               # Optimized database (63MB)  
└── compressed/
    ├── vpic_lite_YYYYMMDD.db.gz              # 20MB - Most compatible
    ├── vpic_lite_YYYYMMDD.db.bz2             # 17MB - Good balance
    ├── vpic_lite_YYYYMMDD.db.xz              # 12MB - Best compression
    ├── vpic_lite_YYYYMMDD.db.zst             # 15MB - Fastest decompression
    ├── vpic_lite_YYYYMMDD_checksums.sha256   # Integrity verification
    └── vpic_lite_YYYYMMDD_info.json          # Release metadata

⚙️ Configuration

Environment Variables

Copy .env.example to .env and customize:

# Database Passwords
MSSQL_SA_PASSWORD=DevPassword123#
POSTGRES_PASSWORD=postgres

# Container Names  
SQL_CONTAINER=vpic-sql
PG_CONTAINER=vpic-postgres

# Directories
TEMP_DATA_DIR=temp

# Connection Settings
CONNECTION_TIMEOUT=30

Optimization Settings

The optimization process keeps only essential tables for VIN decoding:

  • Core Tables: Make, Model, Pattern, Element, VinSchema, Wmi
  • Plant Info: Plant location and manufacturing data
  • Vehicle Specs: Body style, engine, fuel type, drivetrain
  • Removed: Safety features, specialized vehicles, detailed technical specs

About

A set of scripts to help migrate the NHTSA vPIC VIN decoder database from MS SQL to Postgres and Sqlite

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published