Skip to content

Commit aca5c93

Browse files
committed
Small changes in admonitions
1 parent 8105147 commit aca5c93

File tree

11 files changed

+391
-424
lines changed

11 files changed

+391
-424
lines changed

docs/src/lecture_08/exercises.md

Lines changed: 51 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -38,43 +38,29 @@ w = log_reg(X, y, zeros(size(X,2)))
3838
σ(z) = 1/(1+exp(-z))
3939
```
4040

41+
# [Exercises](@id l8-exercises)
4142

43+
!!! homework "Homework: Data normalization"
44+
Data are often normalized. Each feature subtracts its mean and then divides the result by its standard deviation. The normalized features have zero mean and unit standard deviation. This may help in several cases:
45+
- When each feature has a different order of magnitude (such as millimetres and kilometres). Then the gradient would ignore the feature with the smaller values.
46+
- When problems such as vanishing gradients are present (we will elaborate on this in Exercise 4).
4247

48+
Write function ```normalize``` which takes as an input a dataset and normalizes it. Then train the same classifier as we did for [logistic regression](@ref log-reg). Use the original and normalized dataset. Which differences did you observe when
49+
- the logistic regression is optimized via gradient descent?
50+
- the logistic regression is optimized via Newton's method?
51+
Do you have any intuition as to why?
4352

44-
45-
# [Exercises](@id l8-exercises)
53+
Write a short report (in LaTeX) summarizing your findings.
4654

4755
```@raw html
48-
<div class="admonition is-category-homework">
49-
<header class="admonition-header">Homework: Data normalization</header>
56+
<div class="admonition is-category-exercise">
57+
<header class="admonition-header">Exercise 1:</header>
5058
<div class="admonition-body">
5159
```
52-
Data are often normalized. Each feature subtracts its mean and then divides the result by its standard deviation. The normalized features have zero mean and unit standard deviation. This may help in several cases:
53-
- When each feature has a different order of magnitude (such as millimetres and kilometres). Then the gradient would ignore the feature with the smaller values.
54-
- When problems such as vanishing gradients are present (we will elaborate on this in Exercise 4).
55-
56-
Write function ```normalize``` which takes as an input a dataset and normalizes it. Then train the same classifier as we did for [logistic regression](@ref log-reg). Use the original and normalized dataset. Which differences did you observe when
57-
- the logistic regression is optimized via gradient descent?
58-
- the logistic regression is optimized via Newton's method?
59-
Do you have any intuition as to why?
60-
61-
Write a short report (in LaTeX) summarizing your findings.
62-
```@raw html
63-
</div></div>
64-
```
65-
66-
67-
68-
69-
7060

61+
The logistic regression on the iris dataset failed in 6 out of 100 samples. But the visualization shows the failure only in 5 cases. How is it possible?
7162

7263
```@raw html
73-
<div class="admonition is-category-exercise">
74-
<header class="admonition-header">Exercise 1</header>
75-
<div class="admonition-body">
76-
```
77-
The logistic regression on the iris dataset failed in 6 out of 100 samples. But the visualization shows the failure only in 5 cases. How is it possible?```@raw html
7864
</div></div>
7965
<details class = "solution-body">
8066
<summary class = "solution-header">Solution:</summary><p>
@@ -107,27 +93,28 @@ As we can see, there are three samples with the same data. Two of them have labe
10793
</p></details>
10894
```
10995

110-
111-
112-
113-
114-
115-
11696
```@raw html
11797
<div class="admonition is-category-exercise">
11898
<header class="admonition-header">Exercise 2: Disadvantages of the sigmoid function</header>
11999
<div class="admonition-body">
120100
```
121-
Show that Newton's method fails when started from the vector ``(1,2,3)``. Can you guess why it happened? What are the consequences for optimization? Is gradient descent going to suffer from the same problems?```@raw html
101+
102+
Show that Newton's method fails when started from the vector ``(1,2,3)``. Can you guess why it happened? What are the consequences for optimization? Is gradient descent going to suffer from the same problems?
103+
104+
```@raw html
122105
</div></div>
123106
<details class = "solution-body">
124107
<summary class = "solution-header">Solution:</summary><p>
125108
```
109+
126110
First, we run the logistic regression as before, only with a different starting point
111+
127112
```@example ex_log
128113
log_reg(X, y, [1;2;3])
129114
```
115+
130116
This resulted in NaNs. When something fails, it may be a good idea to run a step-by-step analysis. In this case, we will run the first iteration of Newton's method
117+
131118
```@repl ex_log
132119
w = [1;2;3];
133120
X_mult = [row*row' for row in eachrow(X)];
@@ -136,21 +123,29 @@ grad = X'*(y_hat.-y) / size(X,1)
136123
hess = y_hat.*(1 .-y_hat).*X_mult |> mean
137124
w -= hess \ grad
138125
```
126+
139127
Starting from the bottom, we can see that even though we started with relatively small ``w``, the next iteration is four degrees of magnitude larger. This happened because the Hessian ```hess``` is much smaller than the gradient ```grad```. This indicates that there is some kind of numerical instability. The prediction ```y_hat``` should lie in the interval ``[0,1]`` but it seems that it is almost always close to 1. Let us verify this by showing the extrema of ```y_hat```
128+
140129
```@example ex_log
141130
extrema(y_hat)
142131
```
132+
143133
They are indeed too large.
144134

145135
Now we explain the reason. We know that the prediction equals to
136+
146137
```math
147138
\hat y_i = \sigma(w^\top x_i),
148139
```
140+
149141
where ``\sigma`` is the sigmoid function. Since the mimimum from ``w^\top x_i``
142+
150143
```@example ex_log
151144
minimum(X*[1;2;3])
152145
```
146+
153147
is large, all ``w^\top x_i`` are large. But plotting the sigmoid funtion
148+
154149
```@example ex_log
155150
xs = -10:0.01:10
156151
plot(xs, σ, label="", ylabel="Sigmoid function")
@@ -163,73 +158,75 @@ savefig("sigmoid.svg") # hide
163158
it is clear that all ``w^\top x_i`` hit the part of the sigmoid which is flat. This means that the derivative is almost zero, and the Hessian is "even smaller" zero. Then the ratio of the gradient and Hessian is huge.
164159

165160
The gradient descent will probably run into the same difficulty. Since the gradient will be too small, it will take a huge number of iterations to escape the flat region of the sigmoid. This is a known problem of the sigmoid function. It is also the reason why it was replaced in neural networks by other activation functions.
161+
166162
```@raw html
167163
</p></details>
168164
```
169165

170-
171-
172-
173-
174-
175-
176-
177-
178-
179-
180166
```@raw html
181167
<div class="admonition is-category-exercise">
182-
<header class="admonition-header">Exercise 3 (theory)</header>
168+
<header class="admonition-header">Exercise 3 (theory):</header>
183169
<div class="admonition-body">
184170
```
171+
185172
Show the details for the derivation of the loss function of the logistic regression.
173+
186174
```@raw html
187175
</div></div>
188176
<details class = "solution-body">
189177
<summary class = "solution-header">Solution:</summary><p>
190178
```
179+
191180
Since ``\hat y`` equals the probability of predicting ``1``, we have
181+
192182
```math
193183
\hat y = \frac{1}{1+e^{-w^\top x}}
194184
```
185+
195186
Then the cross-entropy loss reduces to
187+
196188
```math
197189
\begin{aligned}
198190
\operatorname{loss}(y,\hat y) &= - y\log \hat y - (1-y)\log(1-\hat y) \\
199191
&= y\log(1+e^{-w^\top x}) - (1-y)\log(e^{-w^\top x}) + (1-y)\log(1+e^{-w^\top x}) \\
200192
&= \log(1+e^{-w^\top x}) + (1-y)w^\top x.
201193
\end{aligned}
202194
```
195+
203196
Then it remains to sum this term over all samples.
197+
204198
```@raw html
205199
</p></details>
206200
```
207201

208-
209-
210-
211-
212-
213-
214202
```@raw html
215203
<div class="admonition is-category-exercise">
216-
<header class="admonition-header">Exercise 4 (theory)</header>
204+
<header class="admonition-header">Exercise 4 (theory):</header>
217205
<div class="admonition-body">
218206
```
219-
Show that if the Newton's method converged for the logistic regression, then it found a point globally minimizing the logistic loss. ```@raw html
207+
208+
Show that if the Newton's method converged for the logistic regression, then it found a point globally minimizing the logistic loss.
209+
210+
```@raw html
220211
</div></div>
221212
<details class = "solution-body">
222213
<summary class = "solution-header">Solution:</summary><p>
223214
```
215+
224216
We derived that the Hessian of the objective function for logistic regression is
217+
225218
```math
226219
\nabla^2 L(w) = \frac 1n \sum_{i=1}^n\hat y_i(1-\hat y_i)x_i x_i^\top.
227220
```
221+
228222
For any vector ``a``, we have
223+
229224
```math
230225
a^\top x_i x_i^\top a = (x_i^\top a)^\top (x_i^\top a) = \|x_i^\top a\|^2 \ge 0,
231226
```
227+
232228
which implies that ``x_i x_i^\top`` is a positive semidefinite matrix (it is known as rank-1 matrix as its rank is always 1 if ``x_i`` is a non-zero vector). Since ``y_i(1-\hat y_i)\ge 0``, it follows that ``\nabla^2 L(w)`` is a positive semidefinite matrix. If a Hessian of a function is positive semidefinite everywhere, the function is immediately convex. Since Newton's method found a stationary point, this points is a global minimum.
229+
233230
```@raw html
234231
</p></details>
235-
```
232+
```

0 commit comments

Comments
 (0)