Skip to content

Commit 77bee7a

Browse files
committed
Lecture 11: Sparse regression
1 parent dc4b978 commit 77bee7a

File tree

2 files changed

+328
-0
lines changed

2 files changed

+328
-0
lines changed

docs/make.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@ lecture_10 = joinpath.("./lecture_10/", [
102102
lecture_11 = joinpath.("./lecture_11/", [
103103
"monte.md",
104104
"glm.md",
105+
"sparse.md",
105106
])
106107

107108
lecture_12 = joinpath.("./lecture_12/", [

docs/src/lecture_11/sparse.md

Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
# Linear regression with sparse constraints
2+
3+
The standard regression problem reads
4+
5+
```math
6+
\operatorname{minimize}_w\qquad \sum_{i=1}^n(w^\top x_i - y_i)^2.
7+
```
8+
9+
Often, a regularization term is added. There are two possibilities. The [ridge regression](https://en.wikipedia.org/wiki/Ridge_regression) adds the weighted squared ``l_2``-norm penalization term to the objective:
10+
11+
```math
12+
\operatorname{minimize}_w\qquad \sum_{i=1}^n(w^\top x_i - y_i)^2 + \frac{\mu}{2}\|w\|_2^2.
13+
```
14+
15+
[LASSO](https://en.wikipedia.org/wiki/Lasso_(statistics)) adds the weighted ``l_1``-norm penalization term to the objective:
16+
17+
```math
18+
\operatorname{minimize}_w\qquad \sum_{i=1}^n(w^\top x_i - y_i)^2 + \mu \|w|\|_1.
19+
```
20+
21+
Both approaches try to keep the norm of parameters ``w`` small to prevent overfitting. The first approach results in a simpler numerical method, while the second one induces sparsity. Before we start with both topics, we will briefly mention matrix decompositions which plays a crucial part in numerical computations.
22+
23+
24+
## Theory of matrix eigendecomposition
25+
26+
27+
Consider a square matrix ``A\in \mathbb R^{n\times n}`` with real-valued entries. We there exist ``\lambda\in\mathbb R`` and ``v\in\mathbb^n`` such that
28+
29+
```math
30+
Av = \lambda b,
31+
```
32+
33+
we say that ``\lambda`` is a eigenvalue of ``A`` and ``v`` is the corresponding eigenvector.
34+
35+
For the rest of this section, we will assume that ``A`` is a symmetric matrix. Then these eigenvector are perpendicular to each other. We create the diagonal matrix ``\Lambda`` with eigenvalues on diagonal and the matrix ``Q`` with columns consisting of the corresponding eigenvectors. Then we have
36+
37+
```math
38+
A = Q\Lambda Q^\top
39+
```
40+
41+
and for any real number ``\mu``, we also have
42+
43+
```math
44+
A + \mu I = Q(\Lambda + \mu I) Q^\top
45+
```
46+
47+
Since the eigenvectors are perpendicular, ``Q`` is an orthonormal matrix and therefore ``Q^{-1} = Q^\top``. This implies that we can easily invert the matrix ``A + \mu I`` by
48+
49+
```math
50+
(A + \mu I)^{-1} = Q^\top (\Lambda + \mu I)^{-1} Q.
51+
```
52+
53+
Because ``\Lambda + \mu I`` is a diagonal matrix, its inverse is simple to compute.
54+
55+
56+
57+
## Theory of ridge regression
58+
59+
60+
The optimality condition for the ridge regression reads
61+
62+
```math
63+
X^\top (Xw - y) + \mu w = 0.
64+
```
65+
66+
Therefore, the optimal solution satisfies
67+
68+
```math
69+
w = (X^\top X + \mu I)^{-1}X^\top y.
70+
```
71+
72+
For ``\mu=0``, we obtain the classical result for the linear regression. Since ``X^\top X`` is symmetric, we can compute its eigendecomposition
73+
74+
```math
75+
X^\top X = Q\Lambda Q^\top.
76+
```
77+
78+
Then the formula for optimal weights simplifies into
79+
80+
```math
81+
w = Q^\top (\Lambda+\mu I)^{-1} QX^\top y.
82+
```
83+
84+
Since this formula uses only matrix-vector multiplication and an inversion of a diagonal matrix, we can employ it to fast compute the solution for multiple values of ``\mu``.
85+
86+
87+
88+
## Theory of LASSO
89+
90+
91+
Unlike ridge regression, LASSO does not have a closed-form solution. Since it is a structured convex problem, it can be solved the [ADMM algorithm](https://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf). It is a primal-dual algorithm, which employs the primal original variable ``w``, the primal auxiliary variable ``z`` and the dual variable ``u`` with the iterative updates:
92+
93+
```math
94+
\begin{aligned}
95+
w^{k+1} &= (X^\top X + \rho I)^{-1}(X^\top y + \rho z^k - \rho u^k), \\
96+
z^{k+1} &= S_{\mu / \rho}(w^{k+1} + u^k) \\
97+
u^{k+1} &= u^k + w^{k+1} - z^k.
98+
\end{aligned}
99+
```
100+
101+
Here, ``\rho > 0`` is an arbitrary number and
102+
103+
```math
104+
S_\eta(z) = \max\{z - \eta, 0\} - \max\{-z -\eta, 0\}
105+
```
106+
107+
is the so-called soft thresholding operator. Since these updates must be performed many times, it may be a good idea to perform the same factorization of the matrix ``X^\top X + \rho I`` as in the case of ridge regression.
108+
109+
110+
111+
112+
## Ridge regression
113+
114+
115+
We will randomly generate ``10000`` samples in ``\mathbb R^{1000}``.
116+
117+
```@example sparse
118+
using LinearAlgebra
119+
using Random
120+
using Plots
121+
122+
n = 10000
123+
m = 1000
124+
125+
Random.seed!(666)
126+
X = randn(n, m)
127+
128+
nothing # hide
129+
```
130+
131+
The real dependence depends only on the first two features and reads
132+
133+
```math
134+
y = 10x_1 + x_2.
135+
```
136+
137+
This is a natural problem for sparse models because most of the weights should be zero. We generate the labels but add noise to them.
138+
139+
```@example sparse
140+
y = 10*X[:,1] + X[:,2] + randn(n)
141+
142+
nothing # hide
143+
```
144+
145+
146+
147+
148+
149+
150+
151+
152+
153+
154+
155+
The first exercise compares both approaches to solving the ridge regression.
156+
157+
```@raw html
158+
<div class = "exercise-body">
159+
<header class = "exercise-header">Exercise:</header><p>
160+
```
161+
162+
Implement the methods for the `ridge_reg` function. Verify that the result in the same result.
163+
164+
**Hint**: The eigendecomposition can be found by `eigen(A)` or `eigen(A).values`.
165+
166+
**Hint**: The identity matrix is implemented by `I` in the `LinearAlgebra` package.
167+
168+
```@raw html
169+
</p></div>
170+
<details class = "solution-body">
171+
<summary class = "solution-header">Solution:</summary><p>
172+
```
173+
174+
The simple implementation for the solution is the same as in the case of linear regression. We only need to add `μ*I`.
175+
176+
```@example sparse
177+
ridge_reg(X, y, μ) = (X'*X + μ*I) \ (X'*y)
178+
179+
nothing # hide
180+
```
181+
182+
We first compute the eigendecomposition and save it into `eigen_dec`. Then we extract the eigenvector and eigenvalues. We also transpose the matrix ``Q`` and save it into `Q_inv` so that we do not have to compute it repeatedly.
183+
184+
```@example sparse
185+
eigen_dec = eigen(X'*X)
186+
Q = eigen_dec.vectors
187+
Q_inv = Matrix(Q')
188+
λ = eigen_dec.values
189+
190+
nothing # hide
191+
```
192+
193+
The more sophisticated way of solving the ridge regression contains only matrix-vector multiplication and the inversion of the diagonal matrix ``(\Lambda + \mu I)^{-1}``. We need to properly add paranthesis, to start multiplication from the right and evade matrix-matrix multiplication, which would occur if we started from the left. Since the matrix ``\Lambda + \mu I`` is diagonal, its inverse is the digonal matrix formed from the inverted diagonal.
194+
195+
```@example sparse
196+
ridge_reg(X, y, μ, Q, Q_inv, λ) = Q * ((Diagonal(1 ./ (λ .+ μ)) * ( Q_inv * (X'*y))))
197+
198+
nothing # hide
199+
```
200+
201+
When we compare both solution, we see that they are the same.
202+
203+
```@example sparse
204+
w1 = ridge_reg(X, y, 10)
205+
w2 = ridge_reg(X, y, 10, Q, Q_inv, λ)
206+
207+
norm(w1 - w2)
208+
```
209+
210+
```@raw html
211+
</p></details>
212+
```
213+
214+
215+
216+
217+
218+
219+
220+
221+
222+
223+
224+
225+
226+
227+
To test the speed, we use the `BenchmarkTools` package. The second option is significantly faster. The price to pay is the need to pre-compute the matrix decomposition.
228+
229+
```julia
230+
julia> using BenchmarkTools
231+
232+
julia> @time ridge_reg(X, y, 10);
233+
0.204391 seconds (9 allocations: 22.912 MiB)
234+
235+
julia> @btime ridge_reg(X, y, 10, Q, Q_inv, λ);
236+
6.251 ms (5 allocations: 39.69 KiB)
237+
```
238+
239+
Now we create multiple values of ``\mu`` and compute the ridge regression for all of them. Since the broadcasting would broadcast all matrices, we need to fix all but one by the `Ref` command.
240+
241+
```@example sparse
242+
μs = range(0, 1000; length=50)
243+
244+
ws = hcat(ridge_reg.(Ref(X), Ref(y), μs, Ref(Q), Ref(Q_inv), Ref(λ))...)
245+
246+
plot(μs, abs.(ws');
247+
label="",
248+
yscale=:log10,
249+
xlabel="mu",
250+
ylabel="weights: log scale",
251+
)
252+
253+
savefig("Sparse1.svg") # hide
254+
```
255+
256+
![](Sparse1.svg)
257+
258+
The regularization seems to have a small effect on the solution.
259+
260+
261+
## Lasso
262+
263+
For LASSO, we define the soft thresholding operator.
264+
265+
```@example sparse
266+
S(x, η) = max(x-η, 0) - max(-x-η, 0)
267+
268+
nothing # hide
269+
```
270+
271+
Then we define the iterative updates from ADMM. It is important that we allow to insert the initial values for ``w^0``, ``z^0`` and ``u^0``. If they are not provided, they are initialized by zeros with the correct dimension. We should implement a proper termination condition but, for simplicity, we run ADMM for a fixed number of iterations.
272+
273+
```@example sparse
274+
function lasso(X, y, μ, Q, Q_inv, λ;
275+
max_iter = 100,
276+
ρ = 1e3,
277+
w = zeros(size(X,2)),
278+
u = zeros(size(X,2)),
279+
z = zeros(size(X,2)),
280+
)
281+
282+
for i in 1:max_iter
283+
w = Q * ( (Diagonal(1 ./ (λ .+ ρ)) * ( Q_inv * (X'*y + ρ*(z-u)))))
284+
z = S.(w + u, μ / ρ)
285+
u = u + w - z
286+
end
287+
return w, u, z
288+
end
289+
290+
nothing # hide
291+
```
292+
293+
Finally, we compute the values for all regularization parameters ``\mu``. The second line in the loop says that if ``i=1``, then run LASSO without the initial values, and if ``i>1``, then run it with the initial values from the previous iteration. SInce the visibility of `w`, `u` and `z` is only one iteration, we need to specify that they are global variables.
294+
295+
```@example sparse
296+
ws = zeros(size(X,2), length(μs))
297+
298+
for (i, μ) in enumerate(μs)
299+
global w, u, z
300+
w, u, z = i > 1 ? lasso(X, y, μ, Q, Q_inv, λ; w, u, z) : lasso(X, y, μ, Q, Q_inv, λ)
301+
ws[:,i] = w
302+
end
303+
```
304+
305+
When we plot the parameter values, we see that they are significantly smaller than for the ridge regression. This is precisely what we meant when we mentioned that ``l_1``-norm regularization induces sparsity.
306+
307+
308+
```@example sparse
309+
plot(μs, abs.(ws');
310+
label="",
311+
yscale=:log10,
312+
xlabel="mu",
313+
ylabel="weights: log scale",
314+
)
315+
```
316+
317+
```@raw html
318+
<div class = "info-body">
319+
<header class = "info-header">Warm start</header><p>
320+
```
321+
322+
The technique of starting from a previously computed value is called warm start or hor start. It is commonly used when some parameter changes only slightly. Then the solution changes only slightly and the previous solution provides is close to the new solution. Therefore, we initialize the algorithm from the old solution.
323+
324+
```@raw html
325+
</p></div>
326+
```
327+

0 commit comments

Comments
 (0)