SwordFinder identifies and analyzes "sword swings" in baseball - strikeouts where batters have embarrassingly slow bat speeds on fastballs. With the new 2025 MLB bat tracking data, we can now find the worst swings in baseball and instantly get video clips.
- 2-strike count β Strikeout
- Swinging strike (not called)
- Bat speed < 60 mph (embarrassingly slow)
- Swing path tilt > 30Β° (bad swing plane)
- Perceived Velocity Calculation: 20,673 / 353,506 (5.8%) - ETA: ~10-12 hours
- Strike Zone Boundaries: 477 / 353,506 (0.1%) - ETA: ~2-3 hours
- Strike Zone Distance: Waiting for boundaries to complete
- 353,506 MLB pitches (2025 season through June 21) in Supabase
- Enhanced table with 118+ MLB fields plus custom sword scoring
- Play ID mapping fixed (gets correct video for each pitch)
- 426 videos processed and uploaded to Azure
- Basic API with health check and recent swords endpoints
- Daily update script ready with all calculations integrated
-- Core MLB fields
game_pk, game_date, pitcher, batter, pitch_type, release_speed, etc.
-- 2025 Bat tracking
bat_speed, swing_length, swing_path_tilt, attack_angle
-- Custom fields
sword_score -- Calculated score for sword quality
video_azure_blob_url -- For processed videos
velo_percentile_overall -- Pitch percentiles
spin_percentile_pitch_type
bat_speed_percentile_sword
perceived_velocity -- Based on release extension
strike_zone_distance_inches -- Distance from zone edge
sz_top, sz_bot -- Strike zone boundaries
-- ... and more
download_full_2025_season.py
- Downloads all MLB data from pybaseballget_play_ids_on_demand.py
- Gets MLB play IDs for specific pitches
create_mlb_enhanced_table.sql
- Supabase table schemaupload_data_correctly.py
- Uploads data with proper type handlingadd_sz_columns.sql
- Adds strike zone columns to existing table
clean_video_processor.py
- Downloads videos from MLB & uploads to Azureprocess_top_sword_videos.py
- Complete video pipeline with Azure integrationprocess_regular_season_videos_smart.py
- Smart processor that skips completed dates
calculate_perceived_velocity.py
- Calculates perceived velocity based on extensioncalculate_strike_zone_distance.py
- Calculates distance from strike zonecalculate_percentiles_sql.py
- Fast SQL-based percentile calculatorcalculate_all_sword_scores.py
- Calculate sword scores for entire database
daily_update.py
- Automated daily data refresh with all calculationsupdate_percentiles_daily.py
- Efficient percentile updates using cached distributions
api.py
- FastAPI backend with initial endpointssetup_database_for_ui.py
- Interactive database preparation script
# Python 3.9+
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
# Supabase
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_ANON_KEY=your-anon-key
SUPABASE_SERVICE_ROLE_KEY=your-service-role-key
DATABASE_URL=postgresql://...
# Azure Blob Storage (for video hosting)
AZURE_STORAGE_CONNECTION_STRING=...
AZURE_CONTAINER_NAME=swordfinder-videos
.env.example
to .env
and fill in your credentials.
β Status: Both Supabase and Azure are now configured and tested!
python download_full_2025_season.py
# Creates: mlb_2025_full_season_complete.csv (227MB)
- Go to Supabase SQL editor
- Run
create_mlb_enhanced_table.sql
- Creates table with indexes
python upload_data_correctly.py
# Uploads 353,501 pitches in ~3 minutes
-- In Supabase SQL editor
SELECT COUNT(*) FROM mlb_pitches_enhanced;
-- Should return 353,501
-- Worst sword swings
SELECT player_name, bat_speed, release_speed, sword_score, game_date
FROM mlb_pitches_enhanced
WHERE bat_speed < 30 AND bat_speed > 0
ORDER BY bat_speed ASC
LIMIT 10;
-- True sword swings (all criteria)
SELECT *
FROM mlb_pitches_enhanced
WHERE strikes = 2
AND events = 'strikeout'
AND description IN ('swinging_strike', 'swinging_strike_blocked')
AND bat_speed < 60
AND swing_path_tilt > 30
ORDER BY bat_speed ASC;
import pandas as pd
from supabase import create_client
# Connect
supabase = create_client(url, key)
# Query worst swords
result = supabase.table('mlb_pitches_enhanced')\
.select('*')\
.lt('bat_speed', 30)\
.gt('bat_speed', 0)\
.order('bat_speed')\
.limit(10)\
.execute()
# Get play IDs and download videos
from get_play_ids_on_demand import get_play_ids_for_pitches, download_video
df = pd.DataFrame(result.data)
df_with_ids = get_play_ids_for_pitches(df)
for _, pitch in df_with_ids.iterrows():
if pitch['mlb_play_id']:
download_video(pitch['mlb_play_id'], f"sword_{pitch['player_name']}.mp4")
# 1. Query your data
longest_hrs = supabase.table('mlb_pitches_enhanced')\
.select('*')\
.eq('is_home_run', True)\
.order('hit_distance_sc', desc=True)\
.limit(10)\
.execute()
# 2. Get play IDs (only when needed)
df = pd.DataFrame(longest_hrs.data)
df_with_ids = get_play_ids_for_pitches(df)
# 3. Download videos
for _, hr in df_with_ids.iterrows():
download_video(hr['mlb_play_id'], f"HR_{hr['hit_distance_sc']}ft.mp4")
Keep your data fresh with automated daily updates:
# Test the daily update script
python daily_update.py
# Schedule it to run at 1pm daily
# See setup_daily_update.md for options (cron, GitHub Actions, etc.)
The script will:
- Fetch yesterday's MLB data (~3,000 pitches)
- Calculate sword scores
- Insert into Supabase (handles duplicates)
- Calculate percentiles for the new data π
- Log the top sword swings
- Takes ~2 minutes to run
Instead of recalculating ALL percentiles (slow), the daily update:
- Uses cached distributions - Stores percentile breakpoints (1st, 5th, 10th, etc.)
- Interpolates new values - Estimates percentiles based on where they fall
- Refreshes weekly - Rebuilds the cache to stay accurate
- Runs automatically - Integrated into
daily_update.py
This approach is ~100x faster than recalculating everything!
Release extension dramatically affects how hitters perceive velocity:
Perceived Velocity = Actual Velocity Γ (60.5 / (60.5 - Extension))
Example: 95 mph with 7 feet of extension
- Effective distance: 60.5 - 7 = 53.5 feet
- Perceived velocity: 95 Γ (60.5/53.5) = 107.4 mph
- That's a +12.4 mph gain from extension alone!
python calculate_perceived_velocity.py
# Currently running - 5.8% complete, ETA: 10-12 hours
- Extension percentiles: How does a pitcher's extension compare?
- Perceived velocity percentiles: The velocity hitters actually experience
-- Find pitchers with elite extension
SELECT pitcher_name, AVG(release_extension) as avg_extension,
AVG(extension_percentile_overall) as extension_pct
FROM mlb_pitches_enhanced
WHERE release_extension IS NOT NULL
GROUP BY pitcher_name
HAVING COUNT(*) > 100
ORDER BY avg_extension DESC
LIMIT 10;
-- Biggest perceived velocity gains
SELECT pitcher_name, pitch_type, release_speed, release_extension,
perceived_velocity, (perceived_velocity - release_speed) as velo_gain
FROM mlb_pitches_enhanced
WHERE perceived_velocity IS NOT NULL
ORDER BY velo_gain DESC
LIMIT 20;
- Play ID Mapping: Each pitch needs the LAST play ID from its at-bat
- Data Integrity: Only ~10% of "sword candidates" are true sword swings
- Bat Tracking: Available for ~44% of all pitches in 2025
- Spring Training Videos: Spring training games (game_type = 'S') do NOT have video coverage!
- Only regular season games (game_type = 'R') have videos
- Use
process_regular_season_videos.py
to skip spring training automatically
π See TODO.md for the complete post-video roadmap!
-
Set Up Daily Updates β¨
- Run
python daily_update.py
to test - Follow
setup_daily_update.md
to schedule - Keeps your data fresh automatically
- Run
-
Calculate Percentiles β¨
- First, run
create_percentile_functions.sql
in Supabase SQL Editor - Then run
python calculate_percentiles_sql.py
for fast SQL-based calculation - Or use
python calculate_percentiles.py
for Python version - Calculates velocity, spin, movement, and bat speed percentiles
- Daily updates will handle percentiles automatically after this!
- See
percentile_calculation_plan.md
for details
- First, run
-
Process Videos
- Run
process_top_sword_videos.py
for top 100 swords - Upload to Azure Blob Storage
- Update video_azure_blob_url in database
- Run
-
Build API
- FastAPI with endpoints for queries
- Video streaming from Azure CDN
- Real-time updates via Supabase
-
Frontend
- Next.js app with video player
- Leaderboards and statistics
- Search and filter interface
The project has been cleaned up to maintain only essential production files:
- README.md - This documentation
- requirements.txt - Python dependencies
- TODO.md - Comprehensive roadmap (updated June 22, 2024)
- mlb_2025_full_season_complete.csv - Complete 2025 MLB data (227MB)
- mlb_2025_with_sword_scores.csv - MLB data with calculated sword scores (232MB)
- Data collection and upload scripts
- Video processing and Azure upload scripts
- Calculation scripts for perceived velocity, strike zone distance, and percentiles
- Daily update automation
- api.py - FastAPI backend with endpoints
- QUICK_START_UI.md - Step-by-step UI building guide
- Database setup and migration scripts
The legacy/
directory contains the original Flask app for reference
- Configuration and planning documents
- SQL scripts for database operations
- Testing scripts for specific dates/scenarios
- Gunnar Hoglund - 15.3 mph
- Trevor Williams - 15.4 mph
- Charlie Morton - 15.4 mph
- Landen Roupp - 484 ft
- Jack Leiter - 479 ft
- JP Sears - 470 ft
- Mason Miller - 103.9 mph
- Aroldis Chapman - 103.8 mph
- Mason Miller - 103.7 mph
- Check data types match schema
- Ensure no infinity/NaN values
- Verify column names match
- Confirm play_id is valid
- Check MLB video availability
- Some games may not have video
- Use indexes for common queries
- Batch operations when possible
- Consider partitioning by date
This project uses publicly available MLB data. Video content belongs to MLB.
Created: January 2025
Last Updated: June 22, 2024, 4:00 PM PST
Status: Background calculations in progress (ETA: 10-12 hours)
Next Steps: Complete calculations, then build frontend UI