Skip to content

Ingest workflow for CDC titer data #249

@joverlee521

Description

@joverlee521

Historical data

We currently get weekly updates of the full database dump that is for all tests since 2019-09-03 but we have older records in fauna that we should keep.

  1. Export all data from fauna's cdc_tdb database
python3 ./tdb/download.py \
    --database cdc_tdb \
    --virus flu \
    --path data \
    --fstem all_cdc_tdb \
    --ftype json
  1. Filter to all records before 2019-09-03
jq -rc '.[]| select(.assay_date < "2019-09-03")' data/all_cdc_tdb_titers.json \
    | augur curate passthru --output-metadata data/cdc_titers_pre_sept_2019.tsv
  1. Save the historical data in the shared private GitHub repo as
    raw-data/CDC_titers_pre_sept_2019.tsv (?)

Proposed Workflow

  1. Download input files from shared private GitHub repo. This will be a manual step at first. We might be able to automate this in the future, but I think that depends on if we have permission to set up token to automatically fetch from the private repo.

    • raw-data/CDC_titers_pre_sept_2019.tsv
    • raw-data/CDC_titers_sept_2019_onwards.tsv
  2. Rename and subset columns in raw-data/CDC_titers_sept_2019_onwards.tsv to
    the standard columns

    • subtype
    • virus_strain
    • serum_strain
    • serum_id
    • virus_passage
    • serum_passage
    • serum_host
    • titer
    • assay_type
    • assay_date
    • source
    • source_file
    • source_row
    • source_column - this will be empty for CDC since they already share a flat file
  3. Concatenate the 2 input files

  4. Curate - this should be designed to be shared for all CCs titers

    • pull out curations from tdb/cdc_upload
    • will eventually add curations from tdb/elife_upload for other CCs
    • make sure to use the same strain name fixes as the sequence data ingest workflow
  5. Split titer data by subtype, source, passage, assay_type, host

  6. Upload split files to private S3 bucket

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions