Skip to content

Graphical Models for Mixed Multi Modal Data

michaelweylandt edited this page Feb 10, 2017 · 9 revisions

Background

Graphical models provide a powerful and flexible framework for understanding complex multivariate data. These models, sometimes also referred to as network models, capture dependencies in multivariate data, allowing statisticians to discover underlying connections among measured variables. These models have been widely used in modern applied statistics and machine learning, with particular success in genetics, neuroscience, and finance. Perhaps the clearest sign of their importance to modern data science is the CRAN Task View devoted to “gRaphical Models in R” [1].

Classical approaches to graphical models rely heavily on properties of the multivariate Gaussian and are not applicable to non-Gaussian data. To address this significant limitation, Yang et al. developed mixed or multi-modal graphical models, which allow for arbitrary distributions for each node [2,3]. These models allow for the joint analysis of data of different modalities (e.g. continuous data, binary data, count data, proportions) in a single model.

In this project, we propose a new package to make mixed graphical models readily available to a wide audience. The proposed package will allow for fitting, simulating from, and visualizing mixed graphical models. We anticipate that having an easy-to-use R package will increase adoption of these powerful new models.

[1] https://CRAN.R-project.org/view=gR

[2] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, and Zhandong Liu. “Graphical Models via Univariate Exponential Family Distributions.” Journal of Machine Learning Research 16 (2015), 3813-3847. https://www.jmlr.org/papers/volume16/yang15a/yang15a.pdf

[3] See also ArXiv 1411.0288 and references therein.

Related work

A number of existing packages [1] provide graphical model estimation in R for Gaussian graphical models only. The huge package [2] provides similar functionality to our proposed package, though, again, only for Gaussian graphical models and minor extensions thereof. The XMRF package [3], developed by two of the mentors, fits graphical models with an arbitrary distribution for the nodes, but does not allow for mixed data types. No existing software, for R or otherwise, currently allow for mixed graphical models.

The visualization component of the proposed package will build upon the RCytoscape Bioconductor package [4]. While the glmnet package [5] is widely used for L1-penalized regression, the solutions it provides are not accurate to high precision and are not sufficiently robust for use in mixed graphical models; furthermore, it is written in an obscure Fortran dialect. Michael Kane’s pirls [6] library is written in standard C++ but inherits many of the algorithmic weaknesses of glmnet.

[1] E.g. glasso, QUIC, and GGMselect

[2] https://CRAN.R-project.org/package=huge

[3] https://CRAN.R-project.org/package=XMRF

[4] http://bioconductor.org/packages/RCytoscape

[5] https://CRAN.R-project.org/package=glmnet

[6] http://github.com/kaneplusplus/pirls

Details of your coding project

  • Fitting Specialized algorithms are required to fit mixed graphical models efficiently, accurately, and robustly. These methods require solving a very large number of L1-penalized GLMs, so we will implement a high-performance L1-penalized GLM solver in C++ using Rcpp. The graph edges are estimated node-wide, giving a highly parallelizable algorithm; to allow users to take advantage of this parallelizability, our fitting routines will be based around the flexible foreach framework. Estimated time: 5 weeks.
  • Visualization Graphical models naturally lend themselves to elegant visualizations. The package will provide visualizations using the cytoscape graph visualization library, building upon the Bioconductor RCytoscape package. Estimated time: 3 weeks.
  • Sampling Two key steps in any statistical analysis are i) model checking; and ii) providing an accurate measure of the variability of the estimated model. The ability to generate data from the model is essential for both of these steps. Gibbs samplers can be used to generate data from mixed graphical models. A high-performance C++ sampling routine will be developed, which allows for efficient data generation from arbitrary mixed graphical models. Estimated time: 4 weeks.

Expected Impact

The proposed package will make mixed graphical models widely available for the first time. By providing robust and efficient tools for fitting, visualization, and simulation, the package will allow the use of mixed graphical models throughout data analysis process.

Mentors

  • Dr. Genevera Allen [Theory and Algorithms]

    Departments of Statistics and ECE, Rice University Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children’s Hospital http://www.stat.rice.edu/~gallen

  • Dr. Zhandong Liu [Implementation] Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children’s Hospital http://www.liuzlab.org/
  • Michael Weylandt [Implementation]

    Department of Statistics, Rice University

Tests

Potential applicants must:

  1. Implement L1-penalized Poisson (log-linear) regression in portable, standard C++;
  2. Wrap their implementation using Rcpp;
  3. Test their implementation using testthat;
  4. Package their implementation and pass R CMD check on at least two of Windows, MacOS, and Linux (Debian/Ubuntu).

Solutions of Tests

Numerical results will be compared against glmnet::glmnet(..., family='poisson'). Mentors will check that the package passes R CMD check.

Clone this wiki locally