Move Evan's weekly covid clade count data to the variant nowcast hub

**Background**

Each Monday morning, this script in the [`get-cloud-clade-counts` repo](https://github.com/reichlab/get-covid-clade-counts/tree/main) creates a parquet file that summarizes SARS-CoV-2 sequences by clade, date, and location: https://github.com/reichlab/get-covid-clade-counts/blob/main/get_covid_clade_counts.py

The file then written to the `covid-clade-counts` public S3 bucket, making it available to modelers. The bucket's contents can be browsed and downloaded via the AWS CLI, polars, pandas, or R.

For example, to list the files via the AWS CLI:

```bash
aws s3 ls covid-clade-counts/ --no-sign-request
```
For reference, here's a snippet of the data from `2025-04-07_covid_clade_counts.parquet`:

```script
┌─────────────┬────────────┬──────────┬───────┐
│ clade       ┆ date       ┆ location ┆ count │
│ ---         ┆ ---        ┆ ---      ┆ ---   │
│ str         ┆ date       ┆ str      ┆ u32   │
╞═════════════╪════════════╪══════════╪═══════╡
│ 21I         ┆ 2021-09-28 ┆ MD       ┆ 10    │
│ 23E         ┆ 2023-08-02 ┆ TN       ┆ 2     │
│ 23A         ┆ 2023-03-05 ┆ UT       ┆ 4     │
│ 21L         ┆ 2022-02-18 ┆ VA       ┆ 1     │
│ 22A         ┆ 2022-07-07 ┆ IN       ┆ 3     │
```

This process has been working well, but has a few issues:

- Because the repo isn't active, the job stops running every 60 days
- The data are hard to find, since they're separate from other variant-nowcast-hub data (the separate repo was a deliberate decision that's no longer relevant)

**The work**

❓ **Assumption:** Unlike the covid-clade-counts repo, which does not store the files (just generates and writes them to S3), we'll keep copies in GitHub _and_ S3 (since we're targeting the `auxiliary-data` folder). This will make the repo bigger, but enables us to hook into the existing "sync to S3" process and wouldn't require anyone to have AWS access.

We'd like to move this data to the [`auxiliary-data`](https://github.com/reichlab/variant-nowcast-hub/tree/main/auxiliary-data) folder of the `variant-nowcast-hub` repo. There are two main parts to this work.

### Move existing files

1. Do a one-time copy or move of the files in `covid-clade-count` to auxiliary-data/whatever-we-call-the-subfolder in the variant nowcast hub repo. Someone with the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) installed can do this from the root of the variant-nowcast-hub:

    ```bash
    aws s3 cp s3://covid-clade-counts/ ./auxiliary-data/new-folder --recursive --no-sign-request
    ```
    
2. Create a PR for the new files and merge. This will trigger the "sync to S3" GitHub action and move the files to the `covid-variant-nowcast-hub` S3 bucket.

### Move and schedule the script

Once the historic files are moved over:

1. Move Evan's `get_covid_clade_counts.py` script to the [`src` directory of `variant-nowcast-hub`](https://github.com/reichlab/variant-nowcast-hub/tree/main/src)
2. Decide how to schedule/run it:
    - **idea 1**: Evan's script runs at 4:37 AM UTC each Monday, and our [create_modeling_round workflow](https://github.com/reichlab/variant-nowcast-hub/blob/main/.github/workflows/create-modeling-round.yaml) runs on Mondays at 3 AM UTC. We could merge Evan's script into our existing workflow, which streamline operations (_e.g._, use the existing "open round" PR)
    - **idea 2**: Create a separate workflow in the variant-nowcast-hub to run Evan's script. This means an extra workflow and PR but would mitigate any risk that a problem with the script delays the "create clade list" job

### Update documentation

Update src/README.md with information about this script (including the workflow section)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move Evan's weekly covid clade count data to the variant nowcast hub #446

Move existing files

Move and schedule the script

Update documentation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Move Evan's weekly covid clade count data to the variant nowcast hub #446

Description

Move existing files

Move and schedule the script

Update documentation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions