Skip to content

Modify SuppKG parser to better deal with fake UMLS IDs #220

@andrewsu

Description

@andrewsu

We created an API for SuppKG in #55 and biothings/biothings_explorer#706. We previously noted that SuppKG created UMLS-like identifiers (which have the format "DCXXXXXXX" instead of "CXXXXXXX"). At the time, we decided to treat them as if they were UMLS IDs, but now that is resulting in some confusing results (e.g., https://github.com/NCATSTranslator/Feedback/issues/836), so it's time to adjust this behavior.

Vlado helped map these fake UMLS "DC" IDs to more common identifiers, the results of which are in supp_kg_chem_nodes.txt. To summarize those results, there were 56636 IDs for suppkg nodes, 53707 of which start with "C" -- we assume these are valid UMLS. Of the remaining 2928 whose IDs that start with "DC", Vlado mapped 841 of those to CHEBI, CID, UNII, MESH, etc. In our parser script, let's replace the "DC" IDs for these IDs in our API. For the remaining 2087 nodes for which Vlado could not find mappings, let's delete records using those IDs in our API.

An analysis of the namespaces used for the 841 (262 are mapped to multiple identifiers):

$ grep '^D' supp_kg_chem_nodes.tsv  | gawkt '$3>0{print $NF}' | tr '|' '\n' | sed 's/:.*//' | sort | uniq -c | sort -k1nr
    626 CHEBI
    298 CID
    181 UNII
     78 MESH
     38 ChEMBL
     19 PHARMGKB.CHEMICAL
      6 CHEMBL.TARGET
      3 HMDB
      2 CAS
      2 DrugBank

Metadata

Metadata

Assignees

Labels

data sourceData source pending to create a new APIenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions