-
Notifications
You must be signed in to change notification settings - Fork 8
PRIMAL: PaRametric sImplex Method for spArse Learning
Background
A broad class of sparse learning approaches can be formulated as high dimensional Linear Programming (LP) problems. More specifically, many sparse learning methods can be written as a linear program in the following LP problem:
min_x ||x||_1 s.t. ||Ax-b||_inf<= lambda,
where lambda is the tuning parameter. Examples include the Dantzig selector (for linear regression) [2], sparse quantile regression [8], sparse support vector machines [3]. These methods have been widely used in machine learning for high dimensional data analysis[8, 9, 10]. Despite of their popularity, however, their software implementations in R are quite limited.
This GSOC project aims to develop a new R package -- PaRametric sImplex Method for spArse Learning (PRIMAL) with three key features: 1) It provides a highly efficient optimization engine that solves (1) based on the parametric simplex method, which can efficiently solve large scale sparse learning problems; 2) It also has user interface for a large collection of the LP based sparse learning methods; 3) Besides the estimation procedures, it provides additional functional modules such as data-dependent model selection and model visualization.
Related Work
There are several general methods for solving Linear Programs (LP) that already have corresponding R packages. In particular, Interior point method [5] is proven to solve LPs in polynomial time, however, its total computational cost is cubically dependent on the scale of the problem, which is not scalable in high dimensions. Moreover, the computation at every iteration cannot take the advantage of the sparsity in the underlying models to boost the computation. The reason behind is that interior point method uses the log barrier to handle the constraints, thus, cannot yield sparse iterates. On the other hand, simplex method has stood the test of various practical problems by efficiently finding optimal solutions, though its worst-case iteration complexity has been shown to scale exponentially with the problem scale in existing literature.
These methods, though popular, are usually designed for solving (3) for one single regularization factor. This is not satisfactory, since an appropriate choice of tuning parameter is usually unknown. Thus, one usually expects an algorithm to obtain multiple solutions tuned over a reasonable range of values for tuning parameter. For each value of tuning parameter, we need to solve a linear program from scratch, and it is therefore often very inefficient for high dimensional problems.
Recently, a variant of Simplex method -- Parametric Simplex Method (PSM) [6] to efficiently solve LP-based sparse learning problems is proposed. Different from existing general methods, PSM naturally obtains the complete solution path for all values of the regularization parameter in one run. As a result, PSM handles parameter tuning efficiently. Furthermore, PSM also provides a high precision dual certificate stopping criterion, and yields sparse solutions through very few iterations, and the solution sparsity significantly reduces the computational cost per iteration.
Project Details
We aim to develop an efficient R package for a large class of LP-based sparse learning methods using the Parametric Simplex Method (PSM) as the core optimization algorithm. The package has two key parts: a PSM engine and an user interface for the statistical methods based on LP. The PSM engine is the core optimization engine in C++ that aims to solve the LP problem with regularization parameters efficiently. Unlike the traditional generic LP solver, the PSM engine is specially designed for statistical methods that has a regularization parameter that requires tuning. Therefore, the PSM engine handles the tuning process efficiently. Furthermore, in order to achieve efficient and reliable matrices and vectors computation, we will exploit Eigen library, an open-source C++ library for linear algebra. Another key part of our library is the user interface for the statistical methods based on LP. More specifically, we will implement the user interface for Dantzig selector, sparse support vector machine, and differential graph estimation, which are all important methods in high-dimensional statistical learning. Besides, a visualization functions for visualizing solution path will also be implemented.
Expected Impact
The delivered R packages will provide an efficient tool that can solve a wide class of sparse learning problems. It will be the first package that handles parameter tuning using the parametric simplex method. Furthermore, the package contains user interface for Dantzig selector, sparse support vector machine, and differential graph estimation. These methods are important in high-dimensional statistical learning. Dantzig selector and sparse support vector machine is proposed to solve sparse linear regression problem, which is one of the fundamental problems in machine learning. Differential graph estimation aims to identify the difference between two undirected graphs [7]. The method has wide applications in biological and medical research [9, 10].
Test
Students, please do one or more of the following tests before contacting the mentors above.
- Easy: Based on [2], implement a Dantzig selector using any existing LP solver using R. Test it on a synthetic data.
- Medium: Add pairwise correlation between columns in the dataset and check how the Dantzig selector perform with highly correlated feature columns.
- Hard: Since the core code should be implemented in C/C++. Write a simple package implementing Matrix multiplication. The main code should use C/C++.
Mentors
Primary Mentor: Xingguo Li, Postdoctral Fellow, Princeton University
Secondary Mentor: Tuo Zhao, Assistant Professor, Georgia Tech
References
[1] Dantzig G. Linear programming and extensions. Princeton university press, 2016. [2] Candes E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The annals of Statistics, 2007, 35(6): 2313-2351. [3] Wang L. The L1 penalized LAD estimator for high dimensional linear regression. Journal of Multivariate Analysis, 2013, 120: 135-151. [4] Bradley P S, Mangasarian O L. Feature selection via concave minimization and support vector machines[C]//ICML. 1998, 98: 82-90. [5] Mehrotra S. On the implementation of a primal-dual interior point method. SIAM Journal on optimization, 1992, 2(4): 575-601. (Textbook by Yinyu Ye) [6] Pang H, Liu H, Vanderbei R, Zhao T. Parametric simplex method for sparse learning. Advances in Neural Information Processing Systems. 2017. [7] Zhao S D, Cai T T, Li H. Direct estimation of differential networks. Biometrika, 2014. [8] Belloni A, Chernozhukov V. ℓ1-penalized quantile regression in high-dimensional sparse models[J]. The Annals of Statistics, 2011, 39(1): 82-130. [9] Hudson N J, Reverter A, Dalrymple B P. A differential wiring analysis of expression data correctly identifies the gene containing the causal mutation[J]. PLoS computational biology, 2009, 5(5): e1000382. [10] Bandyopadhyay S, Mehta M, Kuo D, et al. Rewiring of genetic networks in response to DNA damage[J]. Science, 2010, 330(6009): 1385-1389.