You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/lecture_08/exercises.md
+51-54Lines changed: 51 additions & 54 deletions
Original file line number
Diff line number
Diff line change
@@ -38,43 +38,29 @@ w = log_reg(X, y, zeros(size(X,2)))
38
38
σ(z) = 1/(1+exp(-z))
39
39
```
40
40
41
+
# [Exercises](@id l8-exercises)
41
42
43
+
!!! homework "Homework: Data normalization"
44
+
Data are often normalized. Each feature subtracts its mean and then divides the result by its standard deviation. The normalized features have zero mean and unit standard deviation. This may help in several cases:
45
+
- When each feature has a different order of magnitude (such as millimetres and kilometres). Then the gradient would ignore the feature with the smaller values.
46
+
- When problems such as vanishing gradients are present (we will elaborate on this in Exercise 4).
42
47
48
+
Write function ```normalize``` which takes as an input a dataset and normalizes it. Then train the same classifier as we did for [logistic regression](@ref log-reg). Use the original and normalized dataset. Which differences did you observe when
49
+
- the logistic regression is optimized via gradient descent?
50
+
- the logistic regression is optimized via Newton's method?
51
+
Do you have any intuition as to why?
43
52
44
-
45
-
# [Exercises](@id l8-exercises)
53
+
Write a short report (in LaTeX) summarizing your findings.
46
54
47
55
```@raw html
48
-
<div class="admonition is-category-homework">
49
-
<header class="admonition-header">Homework: Data normalization</header>
Data are often normalized. Each feature subtracts its mean and then divides the result by its standard deviation. The normalized features have zero mean and unit standard deviation. This may help in several cases:
53
-
- When each feature has a different order of magnitude (such as millimetres and kilometres). Then the gradient would ignore the feature with the smaller values.
54
-
- When problems such as vanishing gradients are present (we will elaborate on this in Exercise 4).
55
-
56
-
Write function ```normalize``` which takes as an input a dataset and normalizes it. Then train the same classifier as we did for [logistic regression](@ref log-reg). Use the original and normalized dataset. Which differences did you observe when
57
-
- the logistic regression is optimized via gradient descent?
58
-
- the logistic regression is optimized via Newton's method?
59
-
Do you have any intuition as to why?
60
-
61
-
Write a short report (in LaTeX) summarizing your findings.
62
-
```@raw html
63
-
</div></div>
64
-
```
65
-
66
-
67
-
68
-
69
-
70
60
61
+
The logistic regression on the iris dataset failed in 6 out of 100 samples. But the visualization shows the failure only in 5 cases. How is it possible?
The logistic regression on the iris dataset failed in 6 out of 100 samples. But the visualization shows the failure only in 5 cases. How is it possible?```@raw html
78
64
</div></div>
79
65
<details class = "solution-body">
80
66
<summary class = "solution-header">Solution:</summary><p>
@@ -107,27 +93,28 @@ As we can see, there are three samples with the same data. Two of them have labe
107
93
</p></details>
108
94
```
109
95
110
-
111
-
112
-
113
-
114
-
115
-
116
96
```@raw html
117
97
<div class="admonition is-category-exercise">
118
98
<header class="admonition-header">Exercise 2: Disadvantages of the sigmoid function</header>
119
99
<div class="admonition-body">
120
100
```
121
-
Show that Newton's method fails when started from the vector ``(1,2,3)``. Can you guess why it happened? What are the consequences for optimization? Is gradient descent going to suffer from the same problems?```@raw html
101
+
102
+
Show that Newton's method fails when started from the vector ``(1,2,3)``. Can you guess why it happened? What are the consequences for optimization? Is gradient descent going to suffer from the same problems?
103
+
104
+
```@raw html
122
105
</div></div>
123
106
<details class = "solution-body">
124
107
<summary class = "solution-header">Solution:</summary><p>
125
108
```
109
+
126
110
First, we run the logistic regression as before, only with a different starting point
111
+
127
112
```@example ex_log
128
113
log_reg(X, y, [1;2;3])
129
114
```
115
+
130
116
This resulted in NaNs. When something fails, it may be a good idea to run a step-by-step analysis. In this case, we will run the first iteration of Newton's method
117
+
131
118
```@repl ex_log
132
119
w = [1;2;3];
133
120
X_mult = [row*row' for row in eachrow(X)];
@@ -136,21 +123,29 @@ grad = X'*(y_hat.-y) / size(X,1)
136
123
hess = y_hat.*(1 .-y_hat).*X_mult |> mean
137
124
w -= hess \ grad
138
125
```
126
+
139
127
Starting from the bottom, we can see that even though we started with relatively small ``w``, the next iteration is four degrees of magnitude larger. This happened because the Hessian ```hess``` is much smaller than the gradient ```grad```. This indicates that there is some kind of numerical instability. The prediction ```y_hat``` should lie in the interval ``[0,1]`` but it seems that it is almost always close to 1. Let us verify this by showing the extrema of ```y_hat```
128
+
140
129
```@example ex_log
141
130
extrema(y_hat)
142
131
```
132
+
143
133
They are indeed too large.
144
134
145
135
Now we explain the reason. We know that the prediction equals to
136
+
146
137
```math
147
138
\hat y_i = \sigma(w^\top x_i),
148
139
```
140
+
149
141
where ``\sigma`` is the sigmoid function. Since the mimimum from ``w^\top x_i``
142
+
150
143
```@example ex_log
151
144
minimum(X*[1;2;3])
152
145
```
146
+
153
147
is large, all ``w^\top x_i`` are large. But plotting the sigmoid funtion
it is clear that all ``w^\top x_i`` hit the part of the sigmoid which is flat. This means that the derivative is almost zero, and the Hessian is "even smaller" zero. Then the ratio of the gradient and Hessian is huge.
164
159
165
160
The gradient descent will probably run into the same difficulty. Since the gradient will be too small, it will take a huge number of iterations to escape the flat region of the sigmoid. This is a known problem of the sigmoid function. It is also the reason why it was replaced in neural networks by other activation functions.
a^\top x_i x_i^\top a = (x_i^\top a)^\top (x_i^\top a) = \|x_i^\top a\|^2 \ge 0,
231
226
```
227
+
232
228
which implies that ``x_i x_i^\top`` is a positive semidefinite matrix (it is known as rank-1 matrix as its rank is always 1 if ``x_i`` is a non-zero vector). Since ``y_i(1-\hat y_i)\ge 0``, it follows that ``\nabla^2 L(w)`` is a positive semidefinite matrix. If a Hessian of a function is positive semidefinite everywhere, the function is immediately convex. Since Newton's method found a stationary point, this points is a global minimum.
0 commit comments