You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/lecture_11/glm.md
+8-61Lines changed: 8 additions & 61 deletions
Original file line number
Diff line number
Diff line change
@@ -19,16 +19,12 @@ function plot_histogram(xs, f; kwargs...)
19
19
end
20
20
```
21
21
22
-
23
22
# [Linear regression revisited](@id statistics)
24
23
25
24
This section revisits the linear regression. The classical statistical approach uses derives the same formulation for linear regression as the optimization approach. Besides point estimates for parameters, it also computes their confidence intervals and can test whether some parameters can be omitted from the model. We will start with hypothesis testing and then continue with regression.
26
25
27
26
Julia provides lots of statistical packages. They are summarized at the [JuliaStats](https://juliastats.org/) webpage. This section will give a brief introduction to many of them.
28
27
29
-
30
-
31
-
32
28
## Theory of hypothesis testing
33
29
34
30
Hypothesis testing verifies whether data satisfy a given null hypothesis ``H_0``. Most of the tests need some assumptions about the data, such as normality. Under the validity of the null hypothesis, the test derives that a transformation of the data follows some distribution. Then it constructs a confidence interval of this distribution and checks whether the transformed variable lies in this confidence interval. If it lies outside of it, the test rejects the null hypothesis. In the opposite case, it fails to reject the null hypothesis. The latter is different from confirming the null hypothesis. Hypothesis testing is like a grumpy professor during exams. He never acknowledges that a student knows the topic sufficiently, but he is often clear that the student does not know it.
@@ -47,13 +43,8 @@ p = 2\min\{\mathbb P(T\le t \mid H_0), \mathbb P(T\ge t\mid H_0)\}
47
43
48
44
If the ``p``-value is smaller than a given threshold, usually ``5\%``, the null hypothesis is rejected. In the opposite case, it is not rejected. The ``p``-value is a measure of the probability that an observed difference could have occurred just by random chance.
49
45
50
-
51
-
52
-
53
46
## Hypothesis testing
54
47
55
-
56
-
57
48
We first randomly generate data from the normal distribution with zero mean.
58
49
59
50
```@example glm
@@ -72,20 +63,18 @@ nothing # hide
72
63
73
64
The following exercise performs the ``t``-test to check whether the data come from a distribution with zero mean.
Use the ``t``-test to verify whether the samples were generated from a distribution with zero mean.
85
73
86
-
**Hint**: the Student's distribution is invoked by `TDist()`.
74
+
**Hints:**
75
+
- The Student's distribution is invoked by `TDist()`.
76
+
- The probability ``\mathbb P(T\le t)`` equals to the [distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function)``F(t)``, which can be called by `cdf`.
87
77
88
-
**Hint**: the probability ``\mathbb P(T\le t)`` equals to the [distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function)``F(t)``, which can be called by `cdf`.
89
78
```@raw html
90
79
</div></div>
91
80
<details class = "solution-body">
@@ -109,14 +98,6 @@ The ``p``-value is significantly larger than ``5\%``. Therefore, we cannot rejec
109
98
</p></details>
110
99
```
111
100
112
-
113
-
114
-
115
-
116
-
117
-
118
-
119
-
120
101
Even though the computation of the ``p``-value is simple, we can use the [HypothesisTests](https://juliastats.org/HypothesisTests.jl/stable/) package. When we run the test, it gives us the same results as we computed.
121
102
122
103
```@example glm
@@ -125,12 +106,6 @@ using HypothesisTests
125
106
OneSampleTTest(xs)
126
107
```
127
108
128
-
129
-
130
-
131
-
132
-
133
-
134
109
## Theory of generalized linear models
135
110
136
111
The statistical approach to linear regression is different from the one from machine learning. It also assumes a linear prediction function:
@@ -153,11 +128,6 @@ Since the density is the derivative of the distribution function, the term ``f(y
153
128
154
129
is often maximized. Since the logarithm is an increasing function, these two formulas are equivalent.
155
130
156
-
157
-
158
-
159
-
160
-
161
131
#### Case 1: Linear regression
162
132
163
133
The first case considers ``g(z)=z`` to be the identity function and ``y\mid x`` with the [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution)``N(\mu_i, \sigma^2)``. Then
@@ -174,15 +144,12 @@ and, therefore, we need the solve the following optimization problem:
174
144
175
145
Since we maximize with respect to ``w``, most terms behave like constants, and this optimization problem is equivalent to
This is precisely linear regression as derived in the previous lectures.
183
152
184
-
185
-
186
153
#### Case 2: Logistic regression
187
154
188
155
The second case considers ``g(z)=\log z`` to be the logarithm function and ``y\mid x`` with the [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution)``Po(\lambda)``. The inverse function to ``g`` is ``g^{-1}(z)=e^z``. Since the Poisson distribution has non-negative discrete values with probabilities ``\mathbb P(Y=k) = \frac{1}{k!}\lambda^ke^{-\lambda}``, labels ``y_i`` must also be non-negative integers. The same formula for the conditional expectation as before yields:
@@ -205,7 +172,6 @@ By using the formula for ``\lambda_i`` and getting rid of constants, we transfor
205
172
206
173
This function is similar to the one derived for logistic regression.
207
174
208
-
209
175
## Linear models
210
176
211
177
We will use the [Employment and Wages in Spain](https://vincentarelbundock.github.io/Rdatasets/doc/plm/Snmesp.html) dataset because it is slightly larger than the iris dataset. It contains 5904 observations of wages from 738 companies in Spain from 1983 to 1990. We will estimate the dependence of wages on other factors such as employment or cash flow. We first load the dataset and transform the original log-wages into non-normalized wages. We use base ``2`` to obtain relatively small numbers.
@@ -241,22 +207,18 @@ model = lm(@formula(W ~ 1 + N + Y + I + K + F), wages)
241
207
242
208
The table shows the parameter values and their confidence intervals. Besides that, it also tests the null hypothesis ``H_0: w_j = 0`` whether some of the regression coefficients can be omitted. The ``t`` statistics is in column `t`, while its ``p``-value in column `Pr(>|t|)`. The next exercise checks whether we can achieve the same results with fewer features.
Check that the solution computed by hand and by `lm` are the same.
256
217
257
218
Then remove the feature with the highest ``p``-value and observe whether there was any performance drop. The performance is usually evaluated by the [coeffient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) denoted by ``R^2\in[0,1]``. Its higher values indicate a better model.
258
219
259
220
**Hint**: Use functions `coef` and `r2`.
221
+
260
222
```@raw html
261
223
</div></div>
262
224
<details class = "solution-body">
@@ -287,13 +249,6 @@ Since we observe only a small performance drop, we could omit this feature witho
287
249
</p></details>
288
250
```
289
251
290
-
291
-
292
-
293
-
294
-
295
-
296
-
297
252
The core assumption of this approach is that ``y`` follows the normal distribution. We use the `predict` function for predictions and then use the `plot_histogram` function written earlier to plot the histogram and a density of the normal distribution. For the normal distribution, we need to specify the correct mean and variance.
The result is expected. The ``p``-value is close to ``1\%``, which means that we reject the null hypothesis that the data follow the normal distribution even though it is not entirely far away.
318
273
319
-
320
-
321
274
## Generalized linear models
322
275
323
276
While the linear models do not transform the labels, the generalized models transform them by the link function. Moreover, they allow choosing other than the normal distribution for labels. Therefore, we need to specify the link function ``g`` and the distribution of ``y \mid x``.
@@ -328,21 +281,16 @@ We repeat the same example with the link function ``g(z) = \sqrt{z}`` and the [i
328
281
model = glm(@formula(W ~ 1 + N + Y + I + K + F), wages, InverseGaussian(), SqrtLink())
329
282
```
330
283
331
-
332
-
333
-
334
-
335
-
336
-
337
-
338
284
The following exercise plots the predictions for the generalized linear model.
0 commit comments