-
Notifications
You must be signed in to change notification settings - Fork 31
Graphical Models for Mixed Multi Modal Data
Graphical models provide a powerful and flexible framework for understanding complex multivariate data. These models, sometimes also referred to as network models, capture dependencies in multivariate data, allowing statisticians to discover underlying connections among measured variables. These models have been widely used in modern applied statistics and machine learning, with particular success in genetics, neuroscience, and finance. Perhaps the clearest sign of their importance to modern data science is the CRAN Task View devoted to “gRaphical Models in R” [1].
Classical approaches to graphical models rely heavily on properties of the multivariate Gaussian and are not applicable to non-Gaussian data. To address this significant limitation, Yang et al. developed mixed or multi-modal graphical models, which allow for arbitrary distributions for each node [2,3]. These models allow for the joint analysis of data of different modalities (e.g. continuous data, binary data, count data, proportions) in a single model.
In this project, we propose a new package to make mixed graphical
models readily available to a wide audience. The proposed package will
allow for fitting, simulating from, and visualizing mixed graphical
models. We anticipate that having an easy-to-use R
package will
increase adoption of these powerful new models.
[1] https://CRAN.R-project.org/view=gR
[2] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, and Zhandong Liu. “Graphical Models via Univariate Exponential Family Distributions.” Journal of Machine Learning Research 16 (2015), 3813-3847. https://www.jmlr.org/papers/volume16/yang15a/yang15a.pdf
[3] See also ArXiv 1411.0288 and references therein.
A number of existing packages [1] provide graphical model estimation
in R
for Gaussian graphical models only. The huge
package [2] provides
similar functionality to our proposed package, though, again, only for
Gaussian graphical models and minor extensions thereof. The XMRF
package [3], developed by two of the mentors, fits graphical models
with an arbitrary distribution for the nodes, but does not allow for
mixed data types. No existing software, for R
or otherwise,
currently allow for mixed graphical models.
The visualization component of the proposed package will build upon
the RCytoscape
Bioconductor package [4]. While the glmnet
package [5] is widely used for L1-penalized regression, the solutions
it provides are not accurate to high precision and are not
sufficiently robust for use in mixed graphical models; furthermore, it
is written in an obscure Fortran dialect. Michael Kane’s pirls
[6]
library is written in standard C++
but inherits many of the
algorithmic weaknesses of glmnet
.
[1] E.g. glasso, QUIC, and GGMselect
[2] https://CRAN.R-project.org/package=huge
[3] https://CRAN.R-project.org/package=XMRF
[4] http://bioconductor.org/packages/RCytoscape
[5] https://CRAN.R-project.org/package=glmnet
[6] http://github.com/kaneplusplus/pirls
-
Fitting Specialized algorithms are required to fit mixed graphical
models efficiently, accurately, and robustly. These methods require
solving a very large number of L1-penalized GLMs, so we will
implement a high-performance L1-penalized GLM solver in
C++
usingRcpp
. The graph edges are estimated node-wide, giving a highly parallelizable algorithm; to allow users to take advantage of this parallelizability, our fitting routines will be based around the flexibleforeach
framework. Estimated time: 5 weeks. -
Visualization Graphical models naturally lend themselves to
elegant visualizations. The package will provide visualizations
using the
cytoscape
graph visualization library, building upon the BioconductorRCytoscape
package. Estimated time: 3 weeks. -
Sampling Two key steps in any statistical analysis are i) model
checking; and ii) providing an accurate measure of the variability
of the estimated model. The ability to generate data from the model
is essential for both of these steps. Gibbs samplers can be used to
generate data from mixed graphical models. A high-performance
C++
sampling routine will be developed, which allows for efficient data generation from arbitrary mixed graphical models. Estimated time: 4 weeks.
The proposed package will make mixed graphical models widely available for the first time. By providing robust and efficient tools for fitting, visualization, and simulation, the package will allow the use of mixed graphical models throughout data analysis process.
- Dr. Genevera Allen [Theory and Algorithms]
Departments of Statistics and ECE, Rice University Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children’s Hospital http://www.stat.rice.edu/~gallen
- Dr. Zhandong Liu [Implementation] Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children’s Hospital http://www.liuzlab.org/
- Michael Weylandt [Implementation]
Department of Statistics, Rice University
Potential applicants must:
- Implement L1-penalized Poisson (log-linear) regression in portable, standard
C++
; - Wrap their implementation using
Rcpp
; - Test their implementation using
testthat
; - Package their implementation and pass
R CMD check
on at least two of Windows, MacOS, and Linux (Debian/Ubuntu).
Numerical results will be compared against glmnet::glmnet(...,
family='poisson')
. Mentors will check that the package passes R CMD check
.