-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
Historical data
We currently get weekly updates of the full database dump that is for all tests since 2019-09-03 but we have older records in fauna that we should keep.
- Export all data from fauna's cdc_tdb database
python3 ./tdb/download.py \
--database cdc_tdb \
--virus flu \
--path data \
--fstem all_cdc_tdb \
--ftype json
- Filter to all records before 2019-09-03
jq -rc '.[]| select(.assay_date < "2019-09-03")' data/all_cdc_tdb_titers.json \
| augur curate passthru --output-metadata data/cdc_titers_pre_sept_2019.tsv
- Save the historical data in the shared private GitHub repo as
raw-data/CDC_titers_pre_sept_2019.tsv
(?)
Proposed Workflow
-
Download input files from shared private GitHub repo. This will be a manual step at first. We might be able to automate this in the future, but I think that depends on if we have permission to set up token to automatically fetch from the private repo.
- raw-data/CDC_titers_pre_sept_2019.tsv
- raw-data/CDC_titers_sept_2019_onwards.tsv
-
Rename and subset columns in
raw-data/CDC_titers_sept_2019_onwards.tsv
to
the standard columns- subtype
- virus_strain
- serum_strain
- serum_id
- virus_passage
- serum_passage
- serum_host
- titer
- assay_type
- assay_date
- source
- source_file
- source_row
- source_column - this will be empty for CDC since they already share a flat file
-
Concatenate the 2 input files
-
Curate - this should be designed to be shared for all CCs titers
- pull out curations from tdb/cdc_upload
- will eventually add curations from tdb/elife_upload for other CCs
- make sure to use the same strain name fixes as the sequence data ingest workflow
-
Split titer data by subtype, source, passage, assay_type, host
-
Upload split files to private S3 bucket
huddlej and j23414
Metadata
Metadata
Assignees
Labels
No labels