-
Notifications
You must be signed in to change notification settings - Fork 31
Graphical Models for Mixed Multi Modal Data
Graphical models provide a powerful and flexible framework for understanding complex multivariate data. These models, sometimes also referred to as network models, capture dependencies in multivariate data, allowing statisticians to discover underlying connections among measured variables. These models have been widely used in applied statistics and machine learning, with particular success in genetics, neuroscience, and finance. Perhaps the clearest sign of their importance to modern data science is the CRAN Task View devoted to “gRaphical Models in R” [1].
Much of the early work on graphical models assumed that the data followed a multivariate Gaussian model. In this case, the graphical model is fully specified by the (inverse) covariance matrix of the normal distribution. While mathematically tractable and easy to fit, this assumption is violated by real-world data sets. In particular, this model assumes that each variable has a univariate Gaussian distribution. To address this limitation, Yang et al. developed mixed or multi-modal graphical models, which allow for arbitrary distributions for each variable [2,3]. These models allow for the joint analysis of data of different types (e.g. continuous data, binary data, count data, proportions) in a single graphical model.
In this project, we propose a new package to make mixed graphical
models readily available to a wide audience. The proposed package will
allow for fitting, simulating from, and visualizing mixed graphical
models. We anticipate that having an easy-to-use R
package will
increase adoption of these powerful new models.
[1] https://CRAN.R-project.org/view=gR
[2] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, and Zhandong Liu. “Graphical Models via Univariate Exponential Family Distributions.” Journal of Machine Learning Research 16 (2015), 3813-3847. https://www.jmlr.org/papers/volume16/yang15a/yang15a.pdf
[3] See also ArXiv 1411.0288 and references therein.
A number of existing R
packages [1] provide estimation for Gaussian
graphical models. In particular, the huge
package [2] provides
similar functionality to our proposed package, but only for Gaussian
graphical models and minor extensions thereof. The XMRF
package [3],
developed by two of the mentors, fits graphical models with an
arbitrary distribution for the nodes, but does not allow for mixed
data types across different nodes. No existing software, for R
or
otherwise, currently handles mixed graphical models.
The visualization component of the proposed package will build upon
the RCytoscape
Bioconductor package [4] and the underlying Cytoscape
visualization library. While the glmnet
package [5] is widely used
for L1-penalized regression, the solutions it provides are not
accurate to high precision and are not sufficiently robust for use in
mixed graphical models; furthermore, it is written in an obscure
Fortran dialect. Michael Kane’s pirls
[6] library is written in
standard C++
but inherits many of the weaknesses of the algorithms
used by glmnet
.
[1] E.g. glasso, QUIC, and GGMselect
[2] https://CRAN.R-project.org/package=huge
[3] https://CRAN.R-project.org/package=XMRF; http://dx.doi.org/10.1186/s12918-016-0313-0
[4] http://bioconductor.org/packages/RCytoscape
[5] https://CRAN.R-project.org/package=glmnet
[6] http://github.com/kaneplusplus/pirls
-
Fitting Specialized algorithms are required to fit mixed graphical
models efficiently, accurately, and robustly. These methods require
solving a very large number of L1-penalized GLMs, so we will
implement a high-performance solver in
C++
. Further computational gains can be achieved by parallelizing the fitting process: the edges of the graph are estimated node-wise, giving an embarrassingly parallel algorithm. To take advantage of this, the package will use the flexibleforeach
framework, which will allow the end-user to seamlessly select from a wide range of parallel computing strategies. Estimated time: 5 weeks. -
Visualization Graphical models naturally lend themselves to
elegant visualizations. The package will provide visualizations
using the
Cytoscape
graph visualization library, building upon the BioconductorRCytoscape
package. If time allows, interactive visualizations based on theCytoscape.js
library will also be implemented. Estimated time: 3 weeks. -
Sampling Two key steps in any statistical analysis are i) model
checking; and ii) providing an accurate measure of the variability
of the estimated model. The ability to simulate data from a model is
essential for both of these steps. Straightforward Gibbs sampling
techniques can be used to generate data from mixed graphical models,
but these iterative algorithms are often slow when implemented in
pure
R
. The package will contain a high-performanceC++
sampler to generate synthetic data from arbitrary mixed graphical models. Estimated time: 4 weeks.
The proposed package will make mixed graphical models widely available for the first time. By providing robust and efficient tools for fitting, visualization, and simulation, the package will allow the use of mixed graphical models throughout data analysis process.
- Dr. Genevera Allen [Theory and Algorithms]
Departments of Statistics and ECE, Rice University Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children’s Hospital http://www.stat.rice.edu/~gallen
- Dr. Zhandong Liu [Implementation]
Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children’s Hospital http://www.liuzlab.org/
- Michael Weylandt [Implementation]
Department of Statistics, Rice University
Potential applicants must:
- Implement L1-penalized Poisson (log-linear) regression in portable, standard
C++
; - Wrap their implementation using
Rcpp
; - Test their implementation using
testthat
; - Package their implementation and pass
R CMD check
on at least two of the three major platforms: Windows, MacOS, and Linux (Debian/Ubuntu).
Numerical results will be compared against glmnet::glmnet(...,
family='poisson')
. Mentors will check that the package passes R CMD
check
without any WARNING(s)
or ERROR(s)
.
Test Solution https://github.com/GaryBAYLOR/testRepo.git
Test Solution https://github.com/Xia-Zhang/Poisson-Regression
Test Solution https://github.com/aditya2410/POISSON_Regresssion