-
Notifications
You must be signed in to change notification settings - Fork 13
Description
We created an API for SuppKG in #55 and biothings/biothings_explorer#706. We previously noted that SuppKG created UMLS-like identifiers (which have the format "DCXXXXXXX" instead of "CXXXXXXX"). At the time, we decided to treat them as if they were UMLS IDs, but now that is resulting in some confusing results (e.g., https://github.com/NCATSTranslator/Feedback/issues/836), so it's time to adjust this behavior.
Vlado helped map these fake UMLS "DC" IDs to more common identifiers, the results of which are in supp_kg_chem_nodes.txt. To summarize those results, there were 56636 IDs for suppkg nodes, 53707 of which start with "C" -- we assume these are valid UMLS. Of the remaining 2928 whose IDs that start with "DC", Vlado mapped 841 of those to CHEBI, CID, UNII, MESH, etc. In our parser script, let's replace the "DC" IDs for these IDs in our API. For the remaining 2087 nodes for which Vlado could not find mappings, let's delete records using those IDs in our API.
An analysis of the namespaces used for the 841 (262 are mapped to multiple identifiers):
$ grep '^D' supp_kg_chem_nodes.tsv | gawkt '$3>0{print $NF}' | tr '|' '\n' | sed 's/:.*//' | sort | uniq -c | sort -k1nr
626 CHEBI
298 CID
181 UNII
78 MESH
38 ChEMBL
19 PHARMGKB.CHEMICAL
6 CHEMBL.TARGET
3 HMDB
2 CAS
2 DrugBank