Skip to content

Move Evan's weekly covid clade count data to the variant nowcast hub #446

@bsweger

Description

@bsweger

Background

Each Monday morning, this script in the get-cloud-clade-counts repo creates a parquet file that summarizes SARS-CoV-2 sequences by clade, date, and location: https://github.com/reichlab/get-covid-clade-counts/blob/main/get_covid_clade_counts.py

The file then written to the covid-clade-counts public S3 bucket, making it available to modelers. The bucket's contents can be browsed and downloaded via the AWS CLI, polars, pandas, or R.

For example, to list the files via the AWS CLI:

aws s3 ls covid-clade-counts/ --no-sign-request

For reference, here's a snippet of the data from 2025-04-07_covid_clade_counts.parquet:

┌─────────────┬────────────┬──────────┬───────┐
│ clade       ┆ date       ┆ location ┆ count │
│ ---         ┆ ---        ┆ ---      ┆ ---   │
│ str         ┆ date       ┆ str      ┆ u32   │
╞═════════════╪════════════╪══════════╪═══════╡
│ 21I         ┆ 2021-09-28 ┆ MD       ┆ 10    │
│ 23E         ┆ 2023-08-02 ┆ TN       ┆ 2     │
│ 23A         ┆ 2023-03-05 ┆ UT       ┆ 4     │
│ 21L         ┆ 2022-02-18 ┆ VA       ┆ 1     │
│ 22A         ┆ 2022-07-07 ┆ IN       ┆ 3     │

This process has been working well, but has a few issues:

  • Because the repo isn't active, the job stops running every 60 days
  • The data are hard to find, since they're separate from other variant-nowcast-hub data (the separate repo was a deliberate decision that's no longer relevant)

The work

Assumption: Unlike the covid-clade-counts repo, which does not store the files (just generates and writes them to S3), we'll keep copies in GitHub and S3 (since we're targeting the auxiliary-data folder). This will make the repo bigger, but enables us to hook into the existing "sync to S3" process and wouldn't require anyone to have AWS access.

We'd like to move this data to the auxiliary-data folder of the variant-nowcast-hub repo. There are two main parts to this work.

Move existing files

  1. Do a one-time copy or move of the files in covid-clade-count to auxiliary-data/whatever-we-call-the-subfolder in the variant nowcast hub repo. Someone with the AWS CLI installed can do this from the root of the variant-nowcast-hub:

    aws s3 cp s3://covid-clade-counts/ ./auxiliary-data/new-folder --recursive --no-sign-request
  2. Create a PR for the new files and merge. This will trigger the "sync to S3" GitHub action and move the files to the covid-variant-nowcast-hub S3 bucket.

Move and schedule the script

Once the historic files are moved over:

  1. Move Evan's get_covid_clade_counts.py script to the src directory of variant-nowcast-hub
  2. Decide how to schedule/run it:
    • idea 1: Evan's script runs at 4:37 AM UTC each Monday, and our create_modeling_round workflow runs on Mondays at 3 AM UTC. We could merge Evan's script into our existing workflow, which streamline operations (e.g., use the existing "open round" PR)
    • idea 2: Create a separate workflow in the variant-nowcast-hub to run Evan's script. This means an extra workflow and PR but would mitigate any risk that a problem with the script delays the "create clade list" job

Update documentation

Update src/README.md with information about this script (including the workflow section)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions