Potential long-term fix: Use Hub archives as a fallback for historical metadata

# Using CladeTime v2.0.0 in variant-nowcast-hub

**Date**: November 7, 2025
**Context**: If CladeTime PR #181 (metadata fallback) merged
**Purpose**: Guide for updating variant-nowcast-hub to use fixed CladeTime version
**Note**: Experimenting with Claude to resolve this issue.

---

## Background

### The Problem

In October 2025, Nextstrain began deleting historical `metadata_version.json` files from S3, causing variant-nowcast-hub's target data generation workflows to fail with:

```
ValueError: No version of files/ncov/open/metadata_version.json found before [date]
```

This broke the ability to generate:
- **Oracle output**: Gold standard evaluation data (90 days post-round)
- **Time-series data**: Historical sequence counts for model training

### The Solution

CladeTime v2.0.0 introduces automatic fallback to variant-nowcast-hub's own versioned metadata archives in `auxiliary-data/modeled-clades/` when Nextstrain S3 doesn't have historical metadata files.

**Key Features**:
- Transparent fallback - no code changes needed by users
- Archives date back to September 2024 (first round: 2024-09-04)
- Preserves reproducibility with versioned reference trees
- Works seamlessly for both historical and current dates

---

## Step 1: Update CladeTime Dependency

Once CladeTime v2.0.0 is released to PyPI, update variant-nowcast-hub's dependency:

### Option A: Update `src/requirements.txt`

```bash
cd /path/to/variant-nowcast-hub
```

Edit `src/requirements.txt`:
```diff
- cladetime>=1.5.0
+ cladetime>=2.0.0
```

### Option B: Use inline script metadata (recommended)

If your Python scripts use inline PEP 723 metadata, update the dependency there:

```python
# /// script
# dependencies = [
#   "cladetime>=2.0.0",
#   "polars>=1.0.0",
#   ...
# ]
# ///
```

---

## Step 2: Generate Target Data for a Specific Round

The primary script for generating target data is `src/get_target_data.py`. This script:
- Reads round configuration from `auxiliary-data/modeled-clades/[round_id].json`
- Uses CladeTime to assign clades with the correct reference tree version
- Generates both `oracle-output/` and `time-series/` data

### Manual Execution (Recommended for Testing)

```bash
# Navigate to variant-nowcast-hub repo
cd /path/to/variant-nowcast-hub

# Generate target data for a specific round (e.g., 2024-10-09)
uv run --with-requirements src/requirements.txt \
    src/get_target_data.py \
    --nowcast-date=2024-10-09
```

**What this does**:
1. Reads `auxiliary-data/modeled-clades/2024-10-09.json` for reference tree version
2. Downloads sequences from Nextstrain (as of nowcast_date + 90 days by default)
3. Uses CladeTime with `tree_as_of` = round opening date (from modeled-clades metadata)
4. Assigns clades using the reference tree that was current when the round opened
5. Generates:
   - `target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet`
   - `target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet`

### Testing the Fallback

To verify the fallback is working for historical dates:

```bash
# Test with a round before Nextstrain's October 2025 cleanup
uv run --with-requirements src/requirements.txt \
    src/get_target_data.py \
    --nowcast-date=2024-09-11
```

**Expected behavior**:
```
{"event": "Nextstrain S3 metadata not available, will use Hub fallback", "date": "2024-09-11", ...}
{"event": "Attempting fallback to variant-nowcast-hub archives", "date": "2024-09-11", ...}
{"event": "Successfully retrieved metadata from Hub fallback", ...}
```

The script will seamlessly fall back to the nearest archive in `auxiliary-data/modeled-clades/` (up to 30 days prior).

### Important Date Considerations

The script uses **two key reference dates**:

1. **`tree_as_of`**: Reference tree date from when the round opened
   - Read from `auxiliary-data/modeled-clades/[round_id].json` metadata
   - Ensures consistent clade definitions for reproducibility

2. **`sequence_as_of`**: Date to retrieve sequences (default: nowcast_date + 90 days)
   - Ensures ~all sequences have been reported to Nextstrain
   - Can be overridden with `--sequence-as-of` flag

**Example**:
```bash
# Generate target data with custom sequence retrieval date
uv run --with-requirements src/requirements.txt \
    src/get_target_data.py \
    --nowcast-date=2024-10-09 \
    --sequence-as-of=2025-01-07
```

---

## Step 3: Batch Regenerate Multiple Rounds

If you need to regenerate target data for multiple rounds (e.g., after CladeTime update):

### Option A: Shell Loop

```bash
cd /path/to/variant-nowcast-hub

# Regenerate last 13 rounds (typical for rolling window evaluation)
for round_id in 2024-09-11 2024-09-18 2024-09-25 2024-10-02 2024-10-09 \
                2024-10-16 2024-10-23 2024-10-30 2024-11-06 2024-11-13 \
                2024-11-20 2024-11-27 2024-12-04; do
    echo "Generating target data for round: $round_id"
    uv run --with-requirements src/requirements.txt \
        src/get_target_data.py \
        --nowcast-date=$round_id
done
```

### Option B: Using GitHub Actions Workflow Dispatch

The `run-post-submission-jobs.yaml` workflow can be manually triggered for any past round:

1. Go to: https://github.com/reichlab/variant-nowcast-hub/actions/workflows/run-post-submission-jobs.yaml
2. Click "Run workflow"
3. Enter the `nowcast-date` (e.g., `2024-10-09`)
4. Click "Run workflow"

**What the workflow does**:
1. Runs `get_location_date_counts.py` to identify unscored location-dates
2. Runs `get_target_data.py` for the specified round + 13 historical rounds
3. Commits generated files to the repository

---

## Step 4: Verify Generated Data

After running `get_target_data.py`, verify the output:

### Check Oracle Output

```bash
# View oracle output for a round
uv run python -c "
import polars as pl
df = pl.read_parquet('target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet')
print(df.head())
print(f'Total rows: {len(df)}')
print(f'Locations: {df[\"location\"].n_unique()}')
print(f'Clades: {df[\"clade\"].unique().sort()}')
"
```

**Expected output**:
- Rows for all location-date-clade combinations with sequence counts
- 52 locations (50 states + DC + PR)
- Clades matching those in `auxiliary-data/modeled-clades/2024-10-09.json`

### Check Time-Series Data

```bash
# View time-series data
uv run python -c "
import polars as pl
df = pl.read_parquet('target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet')
print(df.head())
print(f'Date range: {df[\"target_date\"].min()} to {df[\"target_date\"].max()}')
"
```

**Expected output**:
- Historical sequence counts for model training
- Date range extending back from nowcast_date

---

## Step 5: Understanding Fallback Behavior

### When Fallback Activates

The fallback triggers when:
1. Nextstrain S3 doesn't have `metadata_version.json` for the requested date
2. The URL is empty or invalid

### Archive Search Strategy

1. **Try exact date match**: Check for `auxiliary-data/modeled-clades/YYYY-MM-DD.json`
2. **Search up to 30 days back**: If exact match not found, try each prior date
3. **Raise error if not found**: No archive within 30-day window

**Example timeline**:
```
Requested date: 2024-10-12 (Saturday, no round that week)
  ↓
Try 2024-10-12.json → 404
  ↓
Try 2024-10-11.json → 404
  ↓
Try 2024-10-10.json → 404
  ↓
Try 2024-10-09.json → ✅ Found! (Wednesday round)
  ↓
Use metadata from 2024-10-09 archive
```

### Archive Availability

Variant-nowcast-hub archives are available for rounds starting:
- **First archive**: September 4, 2024 (2024-09-04.json)
- **Frequency**: Weekly (every Wednesday)
- **Location**: `auxiliary-data/modeled-clades/YYYY-MM-DD.json`

**Important**: Dates before September 2024 will fail if Nextstrain S3 also lacks data.

---

## Troubleshooting

### Error: "No archive found within 30 days"

**Cause**: Requested date is too far from any archived round (e.g., before September 2024)

**Solution**:
- Use a date ≥ September 4, 2024
- Or ensure Nextstrain S3 still has metadata for that date

### Error: "No version of sequences.fasta.zst found"

**Cause**: Nextstrain deleted historical sequence files (not just metadata)

**Solution**:
- Use more recent dates where sequence files still exist
- CladeTime fallback only covers metadata, not sequence files

### Clades Don't Match Round Configuration

**Cause**: Using wrong reference tree version

**Solution**:
- Verify `tree_as_of` parameter matches round opening date
- Check `auxiliary-data/modeled-clades/[round_id].json` has correct metadata
- The script should automatically read this from the JSON file

### Unexpected Clade Assignments

**Cause**: Reference tree version mismatch between round opening and evaluation

**Solution**:
- Always use `tree_as_of` from the round's modeled-clades metadata
- This ensures reproducibility even if Nextstrain updates clade definitions

---

## Monitoring Logs

When running workflows, monitor logs for fallback activation:

### Successful Fallback (Expected)
```json
{"event": "Nextstrain S3 metadata not available, will use Hub fallback", "date": "2024-10-09", "level": "warning"}
{"event": "Attempting fallback to variant-nowcast-hub archives", "date": "2024-10-09", "level": "info"}
{"event": "Successfully retrieved metadata from Hub fallback", "level": "info"}
```

### S3 Success (No Fallback Needed)
```json
{"event": "Retrieved ncov metadata from S3", "level": "info"}
```

### Fallback Failure (Action Required)
```json
{"event": "Hub fallback failed", "error": "No archive found within 30 days", "level": "error"}
{"event": "Both S3 and Hub fallback failed to retrieve metadata", "level": "warn"}
```

---

## Best Practices

### 1. Archive Maintenance

**Never delete** files in `auxiliary-data/modeled-clades/`:
- These are the fallback source for historical metadata
- Required for reproducible target data generation
- CladeTime depends on them for dates after October 2025

### 2. Testing New CladeTime Versions

Before deploying to production:
```bash
# Test with a known historical round
uv run --with-requirements src/requirements.txt \
    src/get_target_data.py \
    --nowcast-date=2024-09-11

# Compare output with existing target data
diff target-data/oracle-output/2024-09-11-variant-nowcast-hub-oracle.parquet \
     target-data/oracle-output/2024-09-11-variant-nowcast-hub-oracle.parquet.backup
```

### 3. Workflow Re-runs

When re-running workflows for past rounds:
- Use the workflow dispatch feature rather than editing workflow schedules
- Specify exact `nowcast-date` to avoid ambiguity
- Monitor logs for any fallback warnings
- Verify generated files have expected structure and content

### 4. Dependency Updates

When updating CladeTime:
- Check CHANGELOG.md for breaking changes
- Test with at least one historical round before batch regeneration
- Update `src/requirements.txt` with specific version (not just `>=`)
- Document the update in variant-nowcast-hub's commit message

---

## Example: Complete Workflow for Round 2024-10-09

This example shows the complete process for generating target data for a specific round:

```bash
# 1. Ensure CladeTime v2.0.0+ is installed
cd /path/to/variant-nowcast-hub
uv pip list | grep cladetime
# Should show: cladetime 2.0.0 or higher

# 2. Verify round configuration exists
cat auxiliary-data/modeled-clades/2024-10-09.json
# Should show: clades list and metadata with reference tree version

# 3. Generate target data
uv run --with-requirements src/requirements.txt \
    src/get_target_data.py \
    --nowcast-date=2024-10-09

# 4. Verify oracle output was created
ls -lh target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet

# 5. Verify time-series data was created
ls -lh target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet

# 6. Quick data quality check
uv run python -c "
import polars as pl

# Check oracle output
oracle = pl.read_parquet('target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet')
print(f'Oracle rows: {len(oracle)}')
print(f'Locations: {oracle[\"location\"].n_unique()}')
print(f'Clades: {sorted(oracle[\"clade\"].unique())}')

# Check time-series
ts = pl.read_parquet('target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet')
print(f'Time-series rows: {len(ts)}')
print(f'Date range: {ts[\"target_date\"].min()} to {ts[\"target_date\"].max()}')
"

# 7. If everything looks good, commit the data
# git add target-data/oracle-output/2024-10-09-variant-nowcast-hub-oracle.parquet
# git add target-data/time-series/2024-10-09-variant-nowcast-hub-time-series.parquet
# git commit -m "Generate target data for round 2024-10-09"
# git push
```

---

## Summary

With CladeTime v2.0.0, variant-nowcast-hub can:
- ✅ Generate target data for any round ≥ September 2024
- ✅ Work transparently despite Nextstrain's October 2025 cleanup
- ✅ Maintain reproducibility with versioned reference trees
- ✅ Continue automated workflows without code changes

The fallback mechanism ensures reliable target data generation for model evaluation while preserving the integrity of historical analyses.

**Next Steps**:
1. Wait for CladeTime v2.0.0 release to PyPI
2. Update `src/requirements.txt` in variant-nowcast-hub
3. Test with a historical round (e.g., 2024-09-11)
4. Resume normal workflow operations

---

**Questions or Issues?**
- CladeTime: https://github.com/reichlab/cladetime/issues
- variant-nowcast-hub: https://github.com/reichlab/variant-nowcast-hub/issues


Potential long-term fix: Use Hub archives as a fallback for historical metadata #815

Description

Using CladeTime v2.0.0 in variant-nowcast-hub

Background

The Problem

The Solution

Step 1: Update CladeTime Dependency

Option A: Update src/requirements.txt

Option B: Use inline script metadata (recommended)

Step 2: Generate Target Data for a Specific Round

Manual Execution (Recommended for Testing)

Testing the Fallback

Important Date Considerations

Step 3: Batch Regenerate Multiple Rounds

Option A: Shell Loop

Option B: Using GitHub Actions Workflow Dispatch

Step 4: Verify Generated Data

Check Oracle Output

Check Time-Series Data

Step 5: Understanding Fallback Behavior

When Fallback Activates

Archive Search Strategy

Archive Availability

Troubleshooting

Error: "No archive found within 30 days"

Error: "No version of sequences.fasta.zst found"

Clades Don't Match Round Configuration

Unexpected Clade Assignments

Monitoring Logs

Successful Fallback (Expected)

S3 Success (No Fallback Needed)

Fallback Failure (Action Required)

Best Practices

1. Archive Maintenance

2. Testing New CladeTime Versions

3. Workflow Re-runs

4. Dependency Updates

Example: Complete Workflow for Round 2024-10-09

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option A: Update `src/requirements.txt`