-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Using CladeTime v2.0.0 in variant-nowcast-hub
Date: November 7, 2025
Context: If CladeTime PR #181 (metadata fallback) merged
Purpose: Guide for updating variant-nowcast-hub to use fixed CladeTime version
Note: Experimenting with Claude to resolve this issue.
Background
The Problem
In October 2025, Nextstrain began deleting historical metadata_version.json files from S3, causing variant-nowcast-hub's target data generation workflows to fail with:
ValueError: No version of files/ncov/open/metadata_version.json found before [date]
This broke the ability to generate:
- Oracle output: Gold standard evaluation data (90 days post-round)
- Time-series data: Historical sequence counts for model training
The Solution
CladeTime v2.0.0 introduces automatic fallback to variant-nowcast-hub's own versioned metadata archives in auxiliary-data/modeled-clades/ when Nextstrain S3 doesn't have historical metadata files.
Key Features:
- Transparent fallback - no code changes needed by users
- Archives date back to September 2024 (first round: 2024-09-04)
- Preserves reproducibility with versioned reference trees
- Works seamlessly for both historical and current dates
Step 1: Update CladeTime Dependency
Once CladeTime v2.0.0 is released to PyPI, update variant-nowcast-hub's dependency:
Option A: Update src/requirements.txt
cd /path/to/variant-nowcast-hubEdit src/requirements.txt:
- cladetime>=1.5.0
+ cladetime>=2.0.0Option B: Use inline script metadata (recommended)
If your Python scripts use inline PEP 723 metadata, update the dependency there:
# /// script
# dependencies = [
# "cladetime>=2.0.0",
# "polars>=1.0.0",
# ...
# ]
# ///Step 2: Generate Target Data for a Specific Round
The primary script for generating target data is src/get_target_data.py. This script:
- Reads round configuration from
auxiliary-data/modeled-clades/[round_id].json - Uses CladeTime to assign clades with the correct reference tree version
- Generates both
oracle-output/andtime-series/data
Manual Execution (Recommended for Testing)
# Navigate to variant-nowcast-hub repo
cd /path/to/variant-nowcast-hub
# Generate target data for a specific round (e.g., 2024-10-09)
uv run --with-requirements src/requirements.txt \
src/get_target_data.py \
--nowcast-date=2024-10-09What this does:
- Reads
auxiliary-data/modeled-clades/2024-10-09.jsonfor reference tree version - Downloads sequences from Nextstrain (as of nowcast_date + 90 days by default)
- Uses CladeTime with
tree_as_of= round opening date (from modeled-clades metadata) - Assigns clades using the reference tree that was current when the round opened
- Generates:
target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquettarget-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet
Testing the Fallback
To verify the fallback is working for historical dates:
# Test with a round before Nextstrain's October 2025 cleanup
uv run --with-requirements src/requirements.txt \
src/get_target_data.py \
--nowcast-date=2024-09-11Expected behavior:
{"event": "Nextstrain S3 metadata not available, will use Hub fallback", "date": "2024-09-11", ...}
{"event": "Attempting fallback to variant-nowcast-hub archives", "date": "2024-09-11", ...}
{"event": "Successfully retrieved metadata from Hub fallback", ...}
The script will seamlessly fall back to the nearest archive in auxiliary-data/modeled-clades/ (up to 30 days prior).
Important Date Considerations
The script uses two key reference dates:
-
tree_as_of: Reference tree date from when the round opened- Read from
auxiliary-data/modeled-clades/[round_id].jsonmetadata - Ensures consistent clade definitions for reproducibility
- Read from
-
sequence_as_of: Date to retrieve sequences (default: nowcast_date + 90 days)- Ensures ~all sequences have been reported to Nextstrain
- Can be overridden with
--sequence-as-offlag
Example:
# Generate target data with custom sequence retrieval date
uv run --with-requirements src/requirements.txt \
src/get_target_data.py \
--nowcast-date=2024-10-09 \
--sequence-as-of=2025-01-07Step 3: Batch Regenerate Multiple Rounds
If you need to regenerate target data for multiple rounds (e.g., after CladeTime update):
Option A: Shell Loop
cd /path/to/variant-nowcast-hub
# Regenerate last 13 rounds (typical for rolling window evaluation)
for round_id in 2024-09-11 2024-09-18 2024-09-25 2024-10-02 2024-10-09 \
2024-10-16 2024-10-23 2024-10-30 2024-11-06 2024-11-13 \
2024-11-20 2024-11-27 2024-12-04; do
echo "Generating target data for round: $round_id"
uv run --with-requirements src/requirements.txt \
src/get_target_data.py \
--nowcast-date=$round_id
doneOption B: Using GitHub Actions Workflow Dispatch
The run-post-submission-jobs.yaml workflow can be manually triggered for any past round:
- Go to: https://github.com/reichlab/variant-nowcast-hub/actions/workflows/run-post-submission-jobs.yaml
- Click "Run workflow"
- Enter the
nowcast-date(e.g.,2024-10-09) - Click "Run workflow"
What the workflow does:
- Runs
get_location_date_counts.pyto identify unscored location-dates - Runs
get_target_data.pyfor the specified round + 13 historical rounds - Commits generated files to the repository
Step 4: Verify Generated Data
After running get_target_data.py, verify the output:
Check Oracle Output
# View oracle output for a round
uv run python -c "
import polars as pl
df = pl.read_parquet('target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet')
print(df.head())
print(f'Total rows: {len(df)}')
print(f'Locations: {df[\"location\"].n_unique()}')
print(f'Clades: {df[\"clade\"].unique().sort()}')
"Expected output:
- Rows for all location-date-clade combinations with sequence counts
- 52 locations (50 states + DC + PR)
- Clades matching those in
auxiliary-data/modeled-clades/2024-10-09.json
Check Time-Series Data
# View time-series data
uv run python -c "
import polars as pl
df = pl.read_parquet('target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet')
print(df.head())
print(f'Date range: {df[\"target_date\"].min()} to {df[\"target_date\"].max()}')
"Expected output:
- Historical sequence counts for model training
- Date range extending back from nowcast_date
Step 5: Understanding Fallback Behavior
When Fallback Activates
The fallback triggers when:
- Nextstrain S3 doesn't have
metadata_version.jsonfor the requested date - The URL is empty or invalid
Archive Search Strategy
- Try exact date match: Check for
auxiliary-data/modeled-clades/YYYY-MM-DD.json - Search up to 30 days back: If exact match not found, try each prior date
- Raise error if not found: No archive within 30-day window
Example timeline:
Requested date: 2024-10-12 (Saturday, no round that week)
↓
Try 2024-10-12.json → 404
↓
Try 2024-10-11.json → 404
↓
Try 2024-10-10.json → 404
↓
Try 2024-10-09.json → ✅ Found! (Wednesday round)
↓
Use metadata from 2024-10-09 archive
Archive Availability
Variant-nowcast-hub archives are available for rounds starting:
- First archive: September 4, 2024 (2024-09-04.json)
- Frequency: Weekly (every Wednesday)
- Location:
auxiliary-data/modeled-clades/YYYY-MM-DD.json
Important: Dates before September 2024 will fail if Nextstrain S3 also lacks data.
Troubleshooting
Error: "No archive found within 30 days"
Cause: Requested date is too far from any archived round (e.g., before September 2024)
Solution:
- Use a date ≥ September 4, 2024
- Or ensure Nextstrain S3 still has metadata for that date
Error: "No version of sequences.fasta.zst found"
Cause: Nextstrain deleted historical sequence files (not just metadata)
Solution:
- Use more recent dates where sequence files still exist
- CladeTime fallback only covers metadata, not sequence files
Clades Don't Match Round Configuration
Cause: Using wrong reference tree version
Solution:
- Verify
tree_as_ofparameter matches round opening date - Check
auxiliary-data/modeled-clades/[round_id].jsonhas correct metadata - The script should automatically read this from the JSON file
Unexpected Clade Assignments
Cause: Reference tree version mismatch between round opening and evaluation
Solution:
- Always use
tree_as_offrom the round's modeled-clades metadata - This ensures reproducibility even if Nextstrain updates clade definitions
Monitoring Logs
When running workflows, monitor logs for fallback activation:
Successful Fallback (Expected)
{"event": "Nextstrain S3 metadata not available, will use Hub fallback", "date": "2024-10-09", "level": "warning"}
{"event": "Attempting fallback to variant-nowcast-hub archives", "date": "2024-10-09", "level": "info"}
{"event": "Successfully retrieved metadata from Hub fallback", "level": "info"}S3 Success (No Fallback Needed)
{"event": "Retrieved ncov metadata from S3", "level": "info"}Fallback Failure (Action Required)
{"event": "Hub fallback failed", "error": "No archive found within 30 days", "level": "error"}
{"event": "Both S3 and Hub fallback failed to retrieve metadata", "level": "warn"}Best Practices
1. Archive Maintenance
Never delete files in auxiliary-data/modeled-clades/:
- These are the fallback source for historical metadata
- Required for reproducible target data generation
- CladeTime depends on them for dates after October 2025
2. Testing New CladeTime Versions
Before deploying to production:
# Test with a known historical round
uv run --with-requirements src/requirements.txt \
src/get_target_data.py \
--nowcast-date=2024-09-11
# Compare output with existing target data
diff target-data/oracle-output/2024-09-11-variant-nowcast-hub-oracle.parquet \
target-data/oracle-output/2024-09-11-variant-nowcast-hub-oracle.parquet.backup3. Workflow Re-runs
When re-running workflows for past rounds:
- Use the workflow dispatch feature rather than editing workflow schedules
- Specify exact
nowcast-dateto avoid ambiguity - Monitor logs for any fallback warnings
- Verify generated files have expected structure and content
4. Dependency Updates
When updating CladeTime:
- Check CHANGELOG.md for breaking changes
- Test with at least one historical round before batch regeneration
- Update
src/requirements.txtwith specific version (not just>=) - Document the update in variant-nowcast-hub's commit message
Example: Complete Workflow for Round 2024-10-09
This example shows the complete process for generating target data for a specific round:
# 1. Ensure CladeTime v2.0.0+ is installed
cd /path/to/variant-nowcast-hub
uv pip list | grep cladetime
# Should show: cladetime 2.0.0 or higher
# 2. Verify round configuration exists
cat auxiliary-data/modeled-clades/2024-10-09.json
# Should show: clades list and metadata with reference tree version
# 3. Generate target data
uv run --with-requirements src/requirements.txt \
src/get_target_data.py \
--nowcast-date=2024-10-09
# 4. Verify oracle output was created
ls -lh target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet
# 5. Verify time-series data was created
ls -lh target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet
# 6. Quick data quality check
uv run python -c "
import polars as pl
# Check oracle output
oracle = pl.read_parquet('target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet')
print(f'Oracle rows: {len(oracle)}')
print(f'Locations: {oracle[\"location\"].n_unique()}')
print(f'Clades: {sorted(oracle[\"clade\"].unique())}')
# Check time-series
ts = pl.read_parquet('target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet')
print(f'Time-series rows: {len(ts)}')
print(f'Date range: {ts[\"target_date\"].min()} to {ts[\"target_date\"].max()}')
"
# 7. If everything looks good, commit the data
# git add target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet
# git add target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet
# git commit -m "Generate target data for round 2024-10-09"
# git pushSummary
With CladeTime v2.0.0, variant-nowcast-hub can:
- ✅ Generate target data for any round ≥ September 2024
- ✅ Work transparently despite Nextstrain's October 2025 cleanup
- ✅ Maintain reproducibility with versioned reference trees
- ✅ Continue automated workflows without code changes
The fallback mechanism ensures reliable target data generation for model evaluation while preserving the integrity of historical analyses.
Next Steps:
- Wait for CladeTime v2.0.0 release to PyPI
- Update
src/requirements.txtin variant-nowcast-hub - Test with a historical round (e.g., 2024-09-11)
- Resume normal workflow operations
Questions or Issues?
- CladeTime: https://github.com/reichlab/cladetime/issues
- variant-nowcast-hub: https://github.com/reichlab/variant-nowcast-hub/issues