-
Notifications
You must be signed in to change notification settings - Fork 31
Efficient Second order Optimization Solvers for Sparse Learning in R
Background
Computational algorithms for sparse learning problems are mostly first-order methods, especially in high dimensions. The computation cost of evaluating Hessian matrices in high dimensions prevents algorithms from exploiting the second-order information to boost convergence.
In this project, we propose to develop efficient second-order proximal Newton solvers for solve sparse learning problems in high dimensions, e.g., sparse generalized linear model and undirected graphical model estimation.
The current regularized generalized linear model and undirected graphical model packages do not have any module on post-regularization inference, such as the p-value of the estimated model parameter. We hope to program the the most efficient inference procedure into our solver.
Our solver will also be applicable to sparse semiparametric generalized linear model estimation, where the link function is unknown. Semiparametric generalized linear model estimation was proposed in [1] but no efficient computational solver has been developed.
The new solvers and new modules will be either integrated into two existing popular packages: HUGE (originally developed by Tuo Zhao, Primary Mentor) and PICASSO (originally developed by Xingguo Li, Secondary Mentor), or delivered in a new R package.
Related Work
Solvers for sparse generalized linear model estimation using first-order algorithms usually have sublinear convergence rates and the computation cost per iteration scales linearly with dimension. The R packages gcdnet [2] and SPArse Modeling Software (SPAMS) [3] fall into this category. They are implemented with generalized coordinate descent and fast iterative shrinkage-thresholding algorithm (FISTA) respectively. We found out that these packages cannot scale efficiently for high dimensional problems.
Second-order algorithms perform empirically better. For example, R package glmnet [4] and dglars [5] are faster than the solvers with first-order algorithms for solving sparse generalized linear model. The state-of-the-art package for estimating sparse undirected graphical model QUIC [6] is also a second-order algorithms. However, these solvers do not have an efficient active set strategy to utilize the sparse structure of the problem. We aim to find a better active set strategy to scale second-order Newton-type algorithms for solving high dimensional sparse learning problems.
Project Details
The student developer will implement an efficient proximal Newton solver with novel active set updating strategies in R with Rcpp for a class of sparse learning problems, including parametric and semiparametric sparse generalized linear model estimation, and sparse undirected graphical model estimation. We aim to improve the performance of existing solvers in terms of CPU time and estimation accuracy on real and synthetic datasets. Post-regularization statistical inference module will be implemented. Detailed documentation describing the algorithm and its theoretical properties will be provided in the vignettes.
Expected Impact
The delivered solvers and function modules will provide a tool that can solve a wide class of sparse learning problems under the same algorithmic framework and achieve state-of-the-art performance on each individual problem in terms of CPU time and estimation error. It will also be the first R package that provides computational tools for regularized semiparametric generalized linear model estimation.
Test
The student developer is supposed to complete the following tests: Easy. Benchmark L1-regularized logistic regression on two existing solvers with any simulated dataset with respect to their CPU Time. Medium. Add pairwise correlation between columns in the dataset and check how the existing solvers perform with highly correlated feature columns. Hard. Download and preprocess a real dataset for binary classification. Remove missing values and constant columns. Report the benchmarking results for L1-regularized logistic regression on two existing solvers in terms of CPU time.
Student and Mentors
Prospective Student Developer: Jason Ge, Ph.D. Student, Princeton University
Primary Mentor: Tuo Zhao, Assistant Professor, Georgia Tech
Secondary Mentor: Xingguo Li, Ph.D. Student, University of Minnesota
References
[1] Ning, Yang, Tianqi Zhao, and Han Liu. "A likelihood ratio framework for high dimensional semiparametric regression." arXiv preprint arXiv:1412.2295 (2014).
[2] Yang, Yi, et al. "Package ‘gcdnet’." (2012).
[3] Mairal, J., et al. "Spams: Sparse modeling software." WILLOW, INRIA 2 (2011).
[4] Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. "glmnet: Lasso and elastic-net regularized generalized linear models." R package version 2.0-5 (2016).
[5] Augugliaro, Luigi, Angelo M. Mineo, and Ernst C. Wit. "dglars: An R Package to Estimate Sparse Generalized Linear Models." Journal of Statistical Software 59.1 (2014): 1-40.
[6] Hsieh, Cho-Jui, et al. "BIG & QUIC: Sparse inverse covariance estimation for a million variables." Advances in Neural Information Processing Systems. 2013.