Skip to content

DiscOmicsPS is a platform that enables the computational analysis of high-throughput omics data, focusing on the discovery of new biomarkers and therapeutic targets from the multiparametric analysis of biological and clinical correlates.

License

Notifications You must be signed in to change notification settings

juretica3/discomicsps

Repository files navigation

DiscOmicsPS

Introduction

DiscOmicsPS is a platform that enables the computational analysis of high-throughput omics data, focusing on the discovery of new biomarkers and therapeutic targets from the multiparametric analysis of biological and clinical correlates.

Note

Developed by Jure Tica and Athanasios Didangelos. The app is a prototype, use at your own discretion.

DiscOmicsPS consists of three functional modules whose common objective is to perform high-throughput text-mining.

  1. Proteolysis: retrieves research publications that provide evidence for the proteolytic cleavage of each input gene.
  2. Biomarker: retrieves publications that implicate each of the input genes as a potential biomarker.
  3. Custom: can be fully customised by the user, and retrieves publications that link each of the input genes to any keywords of interest.

A published example obtained with DiscOmicsPS is illustrated below. The genes that are significantly regulated after spinal cord injury (down- and up-) are plotted in terms of their:

  1. Network centrality on the y-axis; a measure of biological importance.
  2. Druggability on the x-axis; number of drugs found to target the genes.
  3. Novelty as data point size; amount of articles retrieved against the 'spinal cord injury' keyword in the Custom module.

Note

The druggability functionality is currently disabled because of recent changes to the DGIdb API. This will be reintroduced at a later stage if there is sufficient interest.

User Interface

After a query is completed, the DiscOmicsPS interface consists of:

  • Left: protein/protease tables, where a selection updates the central articles table and the details pane on the right.
  • Centre: articles table, where a selection updates the details pane on the right.
  • Right: protein and article details pane, showing details based on the last selection in the corresponding tables.

Right-clicking the table entries gives additional options.

DiscOmicsPS Queries

DiscOmicsPS relies on retrieval of data through the use of REST API services of HGNC, STRING, DGIdb, UniProt, Europe PMC, and PubMed.

Starting a query with a list of genes performs the following steps:

  1. Gene information retrieval for each gene.
  2. Gene nomenclature expansion with in-built algorithms.
  3. Literature queries for each of the input genes against the keywords of interest.
  4. Retrieved articles filtering with in-built algorithms.
  5. Protein-protein interaction networks retrieval.
  6. Gene 'novelty' score calculation.

Input Gene List

A new query can be started from the File -> New Search ... menu. DiscOmicsPS accepts a list of whitespace-separated gene identifiers (e.g. VCAN), UniProt identifiers (e.g. P13611) or Ensembl identifiers (e.g. ENSG00000038427). The input list can contain various types of identifiers simultaneously, each gene is treated separately.

DiscOmicsPS accepts human, rat and mouse genes; however, the use of human genes is highly recommended. Genes that are not human will not be included in the protein-protein interaction networks and will be assigned a network centrality score of zero in downstream steps.

An ideal gene list is not too small (> 30 genes) nor too large (< 2,000 genes) and is derived from a high-quality high-throughput experiment with an appropriate statistical threshold for gene selection.

Tip

Online tools, such as DAVID, can be used to clean up gene lists and prepare them for DiscOmicsPS.

Additional search options include:

  • Including articles that do not pass filtering: by default these articles are omitted. A deeper search can also be performed by right-clicking the genes of interest in the Proteins Table after the query is completed (Recommended).
  • Performing the literature search in both EPMC and PubMed: by default only EPMC is searched.
  • Only including gene names in the literature search and omitting full names and aliases: by default all names retrieved from HGNC are included.
  • Supplementing pseudogenes with their corresponding non-pseudogene variants: by default this is disabled.

The Custom module keywords are defined in a simple pop-up window that shows when the Custom search option is enabled. Make sure to include all synonyms to broaden the scope of the search.

Step 1-2: Gene Information Retrieval & Expansion

Once the search is initiated, DiscOmicsPS retrieves standard gene nomenclature information using the HGNC API. HGNC approved and alias names and abbreviations are used for the search; past names and abbreviations are not included. The nomenclature of non-human genes is retrieved through the STRING API. Genes not found on either HGNC or STRING are omitted from the subsequent steps.

In the literature, proteins are often referred to with multiple synonyms, and naming conventions can be loose. DiscOmicsPS implements name-processing and text-handling algorithms to increase the retrieval of relevant articles and simultaneously reduce false positives.

The Uniprot API is used to download GO annotations for each of the input genes. If the Proteolysis search is enabled, the STRING database is queried for experimentally confirmed physical interactions between the input genes and the proteases belonging to the selected protease family.

Example HGNC query URL for VCAN gene:
https://rest.genenames.org/fetch/symbol/VCAN

Example UniProt query URL for VCAN gene:
https://www.ebi.ac.uk/proteins/api/proteins/P13611

Example STRING query URL for VCAN:
https://string-db.org/api/json/resolve?identifier=ENSG00000038427&species=9606

Step 3: Literature Retrieval

The retrieved gene information is used to query Europe PMC and PubMed APIs. Scientific articles are searched in their titles and abstracts by default. The database queries consist of multiple 'groups', where retrieved article contains at least one search term from each of these groups.

The three modules of DiscOmicsPS implement different types of search query structures.

The Proteolysis module requires a positive hit to contain at least one gene identifier (group 1), at least one protease family identifier (block 2) and at least one verb that indicates proteolysis (block 3).

(geneID<sub>1</sub> OR geneID<sub>2</sub> OR ...) AND (pfID<sub>1</sub> OR pfID<sub>2</sub> OR ...) AND (verb<sub> 1</sub> OR verb<sub>2</sub> OR ...)

The Biomarker module requires an article to contain at least one gene identifier (group 1) and the word biomarker or a synonym (group 2).

(geneID<sub>1</sub> OR geneID<sub>2</sub> OR ...) AND (marker<sub>1</sub> OR marker<sub>2</sub> OR ...)

In the Custom module, the first search block contains the gene identifiers (group 1) and all other search blocks are defined by the user (groups 2, 3, ...).

(geneID<sub>1</sub> OR geneID<sub>2</sub> OR ...) AND (term<sub>1,1</sub> OR term<sub>1,2</sub> OR ...) AND (term<sub> 2,1</sub> OR term<sub>2,2</sub> OR ...) AND ...

Example EPMC decoded query URL in Custom module with 'IgA nephropathy' keywords:\

https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=json&resulttype=core&pageSize=1000&query=(TITLE:"cspg2" OR TITLE:"pg-m" OR TITLE:"versican" OR TITLE:"versican proteoglycan" OR TITLE:"vcan" OR ABSTRACT:"cspg2" OR ABSTRACT:"pg-m" OR ABSTRACT:"versican" OR ABSTRACT:"versican proteoglycan" OR ABSTRACT:"vcan") AND (TITLE:"iga nephropathy" OR TITLE:"nephropathy iga" OR ABSTRACT:"iga nephropathy" OR ABSTRACT:"nephropathy iga")

Step 4: Article Post-Processing

The retrieved articles are then passed through the DiscOmicsPS article post-processing algorithms that eliminate false positives by scanning the article abstract and title for the keywords of interest.

The number of articles remaining after filtering is recorded and presented to the user in the ‘Total Hits’ column of the ‘Protein Table’. Only the most recent 50 articles that pass filtering are presented to the user; this can be changed by the user in the File -> Settings menu.

In the Proteolysis module, the article abstracts are also scanned for protease subtypes belonging to the selected protease family, in addition to the filtering. Various protease nomenclature formats are supported. The algorithms also recognise concatenated protease subtypes; e.g. matrix metalloprotease-1, -2, -4, and -12.

In the Biomarker module, the articles are also scanned for tissues and body fluids, including urine, blood, saliva, and an option that can be customised by the user.

Step 5: Retrieval of Protein Interaction Networks

The STRING API is queried for a protein interaction network of the input genes. The node degree centrality scores are computed for each input gene, where the score assigned to each node is the sum of the weights of all its edges normalised to the logarithm of the size of the network.

Step 6: Gene ‘Novelty’ Scoring

Each gene is assigned scores for its novelty and network centrality. These metrics are scaled to fall within a desired range and have a balanced effect on the compound score. The network degree centrality score is scaled by dividing it by the logarithm of the total number of nodes in the network. The number of retrieved articles is scaled by taking the logarithm. In this way, each additional article has a smaller incremental effect on the overall score. Raw values can also be exported by right-clicking the respective table where these are shown.

Score interfaces can be acessed from the Summary menu for each of the search modules (Proteolysis, Biomarker and Custom). The image below shows the Summary Score interface, where different metrics can be calculated and plotted. For example:

  • P/C metric: number of proteolysis articles is divided by number of custom articles. A high score corresponds to a protein that can likely be cleaved by the selected family of proteases, and is novel in the context of the Custom module keywords.
  • B/C metric: number of biomarker articles is divided by number of custom articles. A high score corresponds to a protein with good biomarker potential, novel in the context of the Custom module keywords.
  • 1/C metric: reciprocal of the number of custom articles. A high score corresponds to a protein that is novel in the context of the Custom module keywords.

Timing and Computational Resources

Usually, the time taken to analyse sets of up to 100 genes ranges between 2 and 15 minutes. Large and demanding gene lists could take up to an hour to complete. This depends on size of the list, the search modules enabled, the type of keywords used in the custom search, the speed of the internet connection and computer specs. Gene sets that are too demanding could fail to finish successfully due to the exhaustion of the local memory resources.

The search can be made lighter with the following methods:

  • Disable the Biomarkers search module,
  • Make the Custom keywords less general,
  • Reduce the size of the input gene list,
  • Faster and more stable internet connection,
  • Decrease the number of retrieved articles in the File -> Settings menu.

About

DiscOmicsPS is a platform that enables the computational analysis of high-throughput omics data, focusing on the discovery of new biomarkers and therapeutic targets from the multiparametric analysis of biological and clinical correlates.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages