-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
Description
Intro
- Problem statement:
- NCBI records vary in quality
- not available for download as a single data set
- annotation not consistent or difficult to piece together
- Previous 16S data sets
- RDP
- GreenGenes
- NCBI bioproject?
- Silva
- 16sitgdb - https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2022.905489/full
- GSR-DB - https://journals.asm.org/doi/10.1128/msystems.00950-23
- Summarize ya16sdb features
- annotation
- outlier detection (includes plotly website)
- sequence subsets by confidence
Methods
- ...
Results/Discussion
- Record counts in each category (16S genes, whole genomes, taxcheck pass vs fail, refseq, reference sequences)
- Outlier detection and taxcheck outcomes for each subset
- Discrepancies between taxcheck and outlier detection
- Maybe: are there any predictors of outliers (eg, by year, source, etc)
TODOs
- start a group zotero (YM)
- gather literature (group)
- Chris: begin methods in README or elsewhere in repo
- Create OneDrive doc for MS (NH)
- Start authoring problem statement (NH)