Skip to content

Potential long-term fix: Use Hub archives as a fallback for historical metadata #815

@trobacker

Description

@trobacker

Using CladeTime v2.0.0 in variant-nowcast-hub

Date: November 7, 2025
Context: If CladeTime PR #181 (metadata fallback) merged
Purpose: Guide for updating variant-nowcast-hub to use fixed CladeTime version
Note: Experimenting with Claude to resolve this issue.


Background

The Problem

In October 2025, Nextstrain began deleting historical metadata_version.json files from S3, causing variant-nowcast-hub's target data generation workflows to fail with:

ValueError: No version of files/ncov/open/metadata_version.json found before [date]

This broke the ability to generate:

  • Oracle output: Gold standard evaluation data (90 days post-round)
  • Time-series data: Historical sequence counts for model training

The Solution

CladeTime v2.0.0 introduces automatic fallback to variant-nowcast-hub's own versioned metadata archives in auxiliary-data/modeled-clades/ when Nextstrain S3 doesn't have historical metadata files.

Key Features:

  • Transparent fallback - no code changes needed by users
  • Archives date back to September 2024 (first round: 2024-09-04)
  • Preserves reproducibility with versioned reference trees
  • Works seamlessly for both historical and current dates

Step 1: Update CladeTime Dependency

Once CladeTime v2.0.0 is released to PyPI, update variant-nowcast-hub's dependency:

Option A: Update src/requirements.txt

cd /path/to/variant-nowcast-hub

Edit src/requirements.txt:

- cladetime>=1.5.0
+ cladetime>=2.0.0

Option B: Use inline script metadata (recommended)

If your Python scripts use inline PEP 723 metadata, update the dependency there:

# /// script
# dependencies = [
#   "cladetime>=2.0.0",
#   "polars>=1.0.0",
#   ...
# ]
# ///

Step 2: Generate Target Data for a Specific Round

The primary script for generating target data is src/get_target_data.py. This script:

  • Reads round configuration from auxiliary-data/modeled-clades/[round_id].json
  • Uses CladeTime to assign clades with the correct reference tree version
  • Generates both oracle-output/ and time-series/ data

Manual Execution (Recommended for Testing)

# Navigate to variant-nowcast-hub repo
cd /path/to/variant-nowcast-hub

# Generate target data for a specific round (e.g., 2024-10-09)
uv run --with-requirements src/requirements.txt \
    src/get_target_data.py \
    --nowcast-date=2024-10-09

What this does:

  1. Reads auxiliary-data/modeled-clades/2024-10-09.json for reference tree version
  2. Downloads sequences from Nextstrain (as of nowcast_date + 90 days by default)
  3. Uses CladeTime with tree_as_of = round opening date (from modeled-clades metadata)
  4. Assigns clades using the reference tree that was current when the round opened
  5. Generates:
    • target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet
    • target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet

Testing the Fallback

To verify the fallback is working for historical dates:

# Test with a round before Nextstrain's October 2025 cleanup
uv run --with-requirements src/requirements.txt \
    src/get_target_data.py \
    --nowcast-date=2024-09-11

Expected behavior:

{"event": "Nextstrain S3 metadata not available, will use Hub fallback", "date": "2024-09-11", ...}
{"event": "Attempting fallback to variant-nowcast-hub archives", "date": "2024-09-11", ...}
{"event": "Successfully retrieved metadata from Hub fallback", ...}

The script will seamlessly fall back to the nearest archive in auxiliary-data/modeled-clades/ (up to 30 days prior).

Important Date Considerations

The script uses two key reference dates:

  1. tree_as_of: Reference tree date from when the round opened

    • Read from auxiliary-data/modeled-clades/[round_id].json metadata
    • Ensures consistent clade definitions for reproducibility
  2. sequence_as_of: Date to retrieve sequences (default: nowcast_date + 90 days)

    • Ensures ~all sequences have been reported to Nextstrain
    • Can be overridden with --sequence-as-of flag

Example:

# Generate target data with custom sequence retrieval date
uv run --with-requirements src/requirements.txt \
    src/get_target_data.py \
    --nowcast-date=2024-10-09 \
    --sequence-as-of=2025-01-07

Step 3: Batch Regenerate Multiple Rounds

If you need to regenerate target data for multiple rounds (e.g., after CladeTime update):

Option A: Shell Loop

cd /path/to/variant-nowcast-hub

# Regenerate last 13 rounds (typical for rolling window evaluation)
for round_id in 2024-09-11 2024-09-18 2024-09-25 2024-10-02 2024-10-09 \
                2024-10-16 2024-10-23 2024-10-30 2024-11-06 2024-11-13 \
                2024-11-20 2024-11-27 2024-12-04; do
    echo "Generating target data for round: $round_id"
    uv run --with-requirements src/requirements.txt \
        src/get_target_data.py \
        --nowcast-date=$round_id
done

Option B: Using GitHub Actions Workflow Dispatch

The run-post-submission-jobs.yaml workflow can be manually triggered for any past round:

  1. Go to: https://github.com/reichlab/variant-nowcast-hub/actions/workflows/run-post-submission-jobs.yaml
  2. Click "Run workflow"
  3. Enter the nowcast-date (e.g., 2024-10-09)
  4. Click "Run workflow"

What the workflow does:

  1. Runs get_location_date_counts.py to identify unscored location-dates
  2. Runs get_target_data.py for the specified round + 13 historical rounds
  3. Commits generated files to the repository

Step 4: Verify Generated Data

After running get_target_data.py, verify the output:

Check Oracle Output

# View oracle output for a round
uv run python -c "
import polars as pl
df = pl.read_parquet('target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet')
print(df.head())
print(f'Total rows: {len(df)}')
print(f'Locations: {df[\"location\"].n_unique()}')
print(f'Clades: {df[\"clade\"].unique().sort()}')
"

Expected output:

  • Rows for all location-date-clade combinations with sequence counts
  • 52 locations (50 states + DC + PR)
  • Clades matching those in auxiliary-data/modeled-clades/2024-10-09.json

Check Time-Series Data

# View time-series data
uv run python -c "
import polars as pl
df = pl.read_parquet('target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet')
print(df.head())
print(f'Date range: {df[\"target_date\"].min()} to {df[\"target_date\"].max()}')
"

Expected output:

  • Historical sequence counts for model training
  • Date range extending back from nowcast_date

Step 5: Understanding Fallback Behavior

When Fallback Activates

The fallback triggers when:

  1. Nextstrain S3 doesn't have metadata_version.json for the requested date
  2. The URL is empty or invalid

Archive Search Strategy

  1. Try exact date match: Check for auxiliary-data/modeled-clades/YYYY-MM-DD.json
  2. Search up to 30 days back: If exact match not found, try each prior date
  3. Raise error if not found: No archive within 30-day window

Example timeline:

Requested date: 2024-10-12 (Saturday, no round that week)
  ↓
Try 2024-10-12.json → 404
  ↓
Try 2024-10-11.json → 404
  ↓
Try 2024-10-10.json → 404
  ↓
Try 2024-10-09.json → ✅ Found! (Wednesday round)
  ↓
Use metadata from 2024-10-09 archive

Archive Availability

Variant-nowcast-hub archives are available for rounds starting:

  • First archive: September 4, 2024 (2024-09-04.json)
  • Frequency: Weekly (every Wednesday)
  • Location: auxiliary-data/modeled-clades/YYYY-MM-DD.json

Important: Dates before September 2024 will fail if Nextstrain S3 also lacks data.


Troubleshooting

Error: "No archive found within 30 days"

Cause: Requested date is too far from any archived round (e.g., before September 2024)

Solution:

  • Use a date ≥ September 4, 2024
  • Or ensure Nextstrain S3 still has metadata for that date

Error: "No version of sequences.fasta.zst found"

Cause: Nextstrain deleted historical sequence files (not just metadata)

Solution:

  • Use more recent dates where sequence files still exist
  • CladeTime fallback only covers metadata, not sequence files

Clades Don't Match Round Configuration

Cause: Using wrong reference tree version

Solution:

  • Verify tree_as_of parameter matches round opening date
  • Check auxiliary-data/modeled-clades/[round_id].json has correct metadata
  • The script should automatically read this from the JSON file

Unexpected Clade Assignments

Cause: Reference tree version mismatch between round opening and evaluation

Solution:

  • Always use tree_as_of from the round's modeled-clades metadata
  • This ensures reproducibility even if Nextstrain updates clade definitions

Monitoring Logs

When running workflows, monitor logs for fallback activation:

Successful Fallback (Expected)

{"event": "Nextstrain S3 metadata not available, will use Hub fallback", "date": "2024-10-09", "level": "warning"}
{"event": "Attempting fallback to variant-nowcast-hub archives", "date": "2024-10-09", "level": "info"}
{"event": "Successfully retrieved metadata from Hub fallback", "level": "info"}

S3 Success (No Fallback Needed)

{"event": "Retrieved ncov metadata from S3", "level": "info"}

Fallback Failure (Action Required)

{"event": "Hub fallback failed", "error": "No archive found within 30 days", "level": "error"}
{"event": "Both S3 and Hub fallback failed to retrieve metadata", "level": "warn"}

Best Practices

1. Archive Maintenance

Never delete files in auxiliary-data/modeled-clades/:

  • These are the fallback source for historical metadata
  • Required for reproducible target data generation
  • CladeTime depends on them for dates after October 2025

2. Testing New CladeTime Versions

Before deploying to production:

# Test with a known historical round
uv run --with-requirements src/requirements.txt \
    src/get_target_data.py \
    --nowcast-date=2024-09-11

# Compare output with existing target data
diff target-data/oracle-output/2024-09-11-variant-nowcast-hub-oracle.parquet \
     target-data/oracle-output/2024-09-11-variant-nowcast-hub-oracle.parquet.backup

3. Workflow Re-runs

When re-running workflows for past rounds:

  • Use the workflow dispatch feature rather than editing workflow schedules
  • Specify exact nowcast-date to avoid ambiguity
  • Monitor logs for any fallback warnings
  • Verify generated files have expected structure and content

4. Dependency Updates

When updating CladeTime:

  • Check CHANGELOG.md for breaking changes
  • Test with at least one historical round before batch regeneration
  • Update src/requirements.txt with specific version (not just >=)
  • Document the update in variant-nowcast-hub's commit message

Example: Complete Workflow for Round 2024-10-09

This example shows the complete process for generating target data for a specific round:

# 1. Ensure CladeTime v2.0.0+ is installed
cd /path/to/variant-nowcast-hub
uv pip list | grep cladetime
# Should show: cladetime 2.0.0 or higher

# 2. Verify round configuration exists
cat auxiliary-data/modeled-clades/2024-10-09.json
# Should show: clades list and metadata with reference tree version

# 3. Generate target data
uv run --with-requirements src/requirements.txt \
    src/get_target_data.py \
    --nowcast-date=2024-10-09

# 4. Verify oracle output was created
ls -lh target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet

# 5. Verify time-series data was created
ls -lh target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet

# 6. Quick data quality check
uv run python -c "
import polars as pl

# Check oracle output
oracle = pl.read_parquet('target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet')
print(f'Oracle rows: {len(oracle)}')
print(f'Locations: {oracle[\"location\"].n_unique()}')
print(f'Clades: {sorted(oracle[\"clade\"].unique())}')

# Check time-series
ts = pl.read_parquet('target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet')
print(f'Time-series rows: {len(ts)}')
print(f'Date range: {ts[\"target_date\"].min()} to {ts[\"target_date\"].max()}')
"

# 7. If everything looks good, commit the data
# git add target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet
# git add target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet
# git commit -m "Generate target data for round 2024-10-09"
# git push

Summary

With CladeTime v2.0.0, variant-nowcast-hub can:

  • ✅ Generate target data for any round ≥ September 2024
  • ✅ Work transparently despite Nextstrain's October 2025 cleanup
  • ✅ Maintain reproducibility with versioned reference trees
  • ✅ Continue automated workflows without code changes

The fallback mechanism ensures reliable target data generation for model evaluation while preserving the integrity of historical analyses.

Next Steps:

  1. Wait for CladeTime v2.0.0 release to PyPI
  2. Update src/requirements.txt in variant-nowcast-hub
  3. Test with a historical round (e.g., 2024-09-11)
  4. Resume normal workflow operations

Questions or Issues?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions