Ingest workflow for CDC titer data

### Historical data

We currently get weekly updates of the full database dump that is for all tests since 2019-09-03 but we have older records in fauna that we should keep.

1. Export all data from fauna's cdc_tdb database

```shell
python3 ./tdb/download.py \
    --database cdc_tdb \
    --virus flu \
    --path data \
    --fstem all_cdc_tdb \
    --ftype json
```

2. Filter to all records before 2019-09-03

```shell
jq -rc '.[]| select(.assay_date < "2019-09-03")' data/all_cdc_tdb_titers.json \
    | augur curate passthru --output-metadata data/cdc_titers_pre_sept_2019.tsv
```

3. Save the historical data in the shared private GitHub repo as
`raw-data/CDC_titers_pre_sept_2019.tsv` (?)

### Proposed Workflow

1. Download input files from shared private GitHub repo. This will be a manual step at first. We might be able to automate this in the future, but I think that depends on if we have permission to set up token to automatically fetch from the private repo.

    - raw-data/CDC_titers_pre_sept_2019.tsv
    - raw-data/CDC_titers_sept_2019_onwards.tsv

2. Rename and subset columns in `raw-data/CDC_titers_sept_2019_onwards.tsv` to
the standard columns

    - subtype
    - virus_strain
    - serum_strain
    - serum_id
    - virus_passage
    - serum_passage
    - serum_host
    - titer
    - assay_type
    - assay_date
    - source
    - source_file
    - source_row
    - source_column - this will be empty for CDC since they already share a flat file

3. Concatenate the 2 input files
4. Curate - this should be designed to be shared for all CCs titers
    - pull out curations from [tdb/cdc_upload](https://github.com/nextstrain/fauna/blob/master/tdb/cdc_upload.py)
    - will eventually add curations from [tdb/elife_upload](https://github.com/nextstrain/fauna/blob/master/tdb/elife_upload.py) for other CCs
    - make sure to use the same strain name fixes as the sequence data ingest workflow
5. Split titer data by subtype, source, passage, assay_type, host
6. Upload split files to private S3 bucket


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ingest workflow for CDC titer data #249

Historical data

Proposed Workflow

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ingest workflow for CDC titer data #249

Description

Historical data

Proposed Workflow

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions