Skip to content

Graphical Models for Mixed Multi Modal Data

GaryBAYLOR edited this page Mar 16, 2017 · 9 revisions

Background

Graphical models provide a powerful and flexible framework for understanding complex multivariate data. These models, sometimes also referred to as network models, capture dependencies in multivariate data, allowing statisticians to discover underlying connections among measured variables. These models have been widely used in applied statistics and machine learning, with particular success in genetics, neuroscience, and finance. Perhaps the clearest sign of their importance to modern data science is the CRAN Task View devoted to “gRaphical Models in R” [1].

Much of the early work on graphical models assumed that the data followed a multivariate Gaussian model. In this case, the graphical model is fully specified by the (inverse) covariance matrix of the normal distribution. While mathematically tractable and easy to fit, this assumption is violated by real-world data sets. In particular, this model assumes that each variable has a univariate Gaussian distribution. To address this limitation, Yang et al. developed mixed or multi-modal graphical models, which allow for arbitrary distributions for each variable [2,3]. These models allow for the joint analysis of data of different types (e.g. continuous data, binary data, count data, proportions) in a single graphical model.

In this project, we propose a new package to make mixed graphical models readily available to a wide audience. The proposed package will allow for fitting, simulating from, and visualizing mixed graphical models. We anticipate that having an easy-to-use R package will increase adoption of these powerful new models.

[1] https://CRAN.R-project.org/view=gR

[2] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, and Zhandong Liu. “Graphical Models via Univariate Exponential Family Distributions.” Journal of Machine Learning Research 16 (2015), 3813-3847. https://www.jmlr.org/papers/volume16/yang15a/yang15a.pdf

[3] See also ArXiv 1411.0288 and references therein.

Related work

A number of existing R packages [1] provide estimation for Gaussian graphical models. In particular, the huge package [2] provides similar functionality to our proposed package, but only for Gaussian graphical models and minor extensions thereof. The XMRF package [3], developed by two of the mentors, fits graphical models with an arbitrary distribution for the nodes, but does not allow for mixed data types across different nodes. No existing software, for R or otherwise, currently handles mixed graphical models.

The visualization component of the proposed package will build upon the RCytoscape Bioconductor package [4] and the underlying Cytoscape visualization library. While the glmnet package [5] is widely used for L1-penalized regression, the solutions it provides are not accurate to high precision and are not sufficiently robust for use in mixed graphical models; furthermore, it is written in an obscure Fortran dialect. Michael Kane’s pirls [6] library is written in standard C++ but inherits many of the weaknesses of the algorithms used by glmnet.

[1] E.g. glasso, QUIC, and GGMselect

[2] https://CRAN.R-project.org/package=huge

[3] https://CRAN.R-project.org/package=XMRF; http://dx.doi.org/10.1186/s12918-016-0313-0

[4] http://bioconductor.org/packages/RCytoscape

[5] https://CRAN.R-project.org/package=glmnet

[6] http://github.com/kaneplusplus/pirls

Details of your coding project

  • Fitting Specialized algorithms are required to fit mixed graphical models efficiently, accurately, and robustly. These methods require solving a very large number of L1-penalized GLMs, so we will implement a high-performance solver in C++. Further computational gains can be achieved by parallelizing the fitting process: the edges of the graph are estimated node-wise, giving an embarrassingly parallel algorithm. To take advantage of this, the package will use the flexible foreach framework, which will allow the end-user to seamlessly select from a wide range of parallel computing strategies. Estimated time: 5 weeks.
  • Visualization Graphical models naturally lend themselves to elegant visualizations. The package will provide visualizations using the Cytoscape graph visualization library, building upon the Bioconductor RCytoscape package. If time allows, interactive visualizations based on the Cytoscape.js library will also be implemented. Estimated time: 3 weeks.
  • Sampling Two key steps in any statistical analysis are i) model checking; and ii) providing an accurate measure of the variability of the estimated model. The ability to simulate data from a model is essential for both of these steps. Straightforward Gibbs sampling techniques can be used to generate data from mixed graphical models, but these iterative algorithms are often slow when implemented in pure R. The package will contain a high-performance C++ sampler to generate synthetic data from arbitrary mixed graphical models. Estimated time: 4 weeks.

Expected Impact

The proposed package will make mixed graphical models widely available for the first time. By providing robust and efficient tools for fitting, visualization, and simulation, the package will allow the use of mixed graphical models throughout data analysis process.

Mentors

  • Dr. Genevera Allen [Theory and Algorithms]

    Departments of Statistics and ECE, Rice University Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children’s Hospital http://www.stat.rice.edu/~gallen

  • Dr. Zhandong Liu [Implementation]

    Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children’s Hospital http://www.liuzlab.org/

  • Michael Weylandt [Implementation]

    Department of Statistics, Rice University

Tests

Potential applicants must:

  1. Implement L1-penalized Poisson (log-linear) regression in portable, standard C++;
  2. Wrap their implementation using Rcpp;
  3. Test their implementation using testthat;
  4. Package their implementation and pass R CMD check on at least two of the three major platforms: Windows, MacOS, and Linux (Debian/Ubuntu).

Solutions of Tests

Numerical results will be compared against glmnet::glmnet(..., family='poisson'). Mentors will check that the package passes R CMD check without any WARNING(s) or ERROR(s).

Test Solution https://github.com/GaryBAYLOR/testRepo.git

Clone this wiki locally