uniprot-pipeline

Imports external db ids, protein objects and sequences from UniProtKB.

PROTEIN LOADER

load protein_to_gene associations (into RGD_ASSOCIATIONS table)
load secondary uniprot xdb ids (into RGD_ACC_XDB table); secondary accession ids are imported as xdb ids with XDB_KEY=60
load protein sequences (into RGD_SEQUENCES table);
- the canonical protein sequences (available in the incoming data) are loaded with seq_type 'uniprot_seq'
- the old protein sequences are created with seq_type 'old_uniprot_seq'; they are created when the incoming sequence differs from the sequence in the database; sequences of type 'old_uniprot_seq' combined are considered the 'consensus sequence history'

FILE PARSER

processing of AC lines:
- primary uniprot acc id is the first acc id on the first AC line
- when a record had multiple AC lines, the other AC lines must be ignored;
processing os OS lines: (species)
- logic must be aware that OS value can span multiple lines
processing of Gene3D accession ids:
- they should be loaded just as they are; however in the past, Gene3D acc ids were in the format 'G3DSA:XX.XX.XX.XX'

added processing of chinchilla, bonobo, dog and squirrel

fixed matching logic to exclude self-matching (matching by UniProtKB ids brought in by UniProtKB pipeline in the past)

deletion of UniProt id results in insertion of an 'old_protein_id' alias

MATCHING

matching algorithm change:

incoming data is matched to RGD data in the following order (as below); matching is terminated as soon there is some matching data
OLD logic: RGD, GeneId, HGNC, MGI, UniProt, RefSeq Nucl, Ensembl
NEW logic: GeneId, HGNC, MGI, RefSeq Nucl, UniProt, Ensembl, UniProt GeneName

improved handling of multis:

where one UniProtKB id matches multiple genes in RGD previous logic was not properly handling of up to 0.1% of uniprot ids

discontinued ids_merged.log

ids_inserted and ids_deleted logs until June 22, 2015 contain a lot of duplication: due to code flaw some entries are unnecessarily inserted, and during next pipeline run they are deleted and so on; therefore as of June 22, 2015 these logs are discontinued; inserted.log and deleted.log are used instead

PROTEIN DOMAINS

protein domains are identified by name and loaded into the database;

if a protein has a given domain appearing multiple times, the occurrence number is added to the domain name, f.e. P58365 protein has Cadherin domain appearing 27 times! so the domain names are like this: 'Cadherin 1', 'Cadherin 2', ... 'Cadherin 27' we want to keep a clean domain name and therefore we strip the last part and keep only the domain name f.e. 'Cadherin'

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
lib		lib
src/main		src/main
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
changes.txt		changes.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

uniprot-pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

rat-genome-database/uniprot-pipeline

Folders and files

Latest commit

History

Repository files navigation

uniprot-pipeline

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages