Skip to content

Commit 08ee7cd

Browse files
committed
Lecture 11: Polishing touches
1 parent 77bee7a commit 08ee7cd

File tree

2 files changed

+127
-79
lines changed

2 files changed

+127
-79
lines changed

docs/src/lecture_11/glm.md

Lines changed: 50 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,11 @@
22
function plot_histogram(xs, f; kwargs...)
33
plt = histogram(xs;
44
label="Sampled density",
5-
nbins=100,
5+
xlabel = "x",
6+
ylabel = "pdf(x)",
7+
nbins = 85,
68
normalize = :pdf,
9+
opacity = 0.5,
710
kwargs...
811
)
912
@@ -19,7 +22,7 @@ end
1922

2023
# [Linear regression revisited](@id statistics)
2124

22-
This section revisits the linear regression. The classical statistical approach uses a different approach, but derives the same formulation for linear regression. Besides point estimates for parameters, this approach also computes their confidence intervals and is able to test whether some parameters can be omitted from the model. We will start with hypothesis testing and then continue with regression.
25+
This section revisits the linear regression. The classical statistical approach uses derives the same formulation for linear regression as the optimization approach. Besides point estimates for parameters, it also computes their confidence intervals and can test whether some parameters can be omitted from the model. We will start with hypothesis testing and then continue with regression.
2326

2427
Julia provides lots of statistical packages. They are summarized at the [JuliaStats](https://juliastats.org/) webpage. This section will give a brief introduction to many of them.
2528

@@ -28,18 +31,18 @@ Julia provides lots of statistical packages. They are summarized at the [JuliaSt
2831

2932
## Theory of hypothesis testing
3033

31-
Hypothesis testing verifies whether data satisfy a given null hypothesis ``H_0``. Under some assumptions and the validity of the null hypothesis, the test derives that a transformation of the data follows some distribution. Then it constructs a confidence interval of this distribution and checks whether the trasnformed variable lies into the confidence interval. If it lies ourside of it, the test rejects the null hypothesis. In the opposite case, it fails to reject the null hypothesis. The latter is different from confirming the null hypothesis. Hypothesis is like a grumpy professor who tells whether a student knows his lecture. He never acknowledges that the student knows it sufficiently, but he is often clear that the student does not know it.
34+
Hypothesis testing verifies whether data satisfy a given null hypothesis ``H_0``. Most of the tests need some assumptions about the data, such as normality. Under the validity of the null hypothesis, the test derives that a transformation of the data follows some distribution. Then it constructs a confidence interval of this distribution and checks whether the transformed variable lies in this confidence interval. If it lies outside of it, the test rejects the null hypothesis. In the opposite case, it fails to reject the null hypothesis. The latter is different from confirming the null hypothesis. Hypothesis testing is like a grumpy professor during exams. He never acknowledges that a student knows the topic sufficiently, but he is often clear that the student does not know it.
3235

33-
An example is the one-sided [Student's t-test](https://en.wikipedia.org/wiki/Student's_t-test) that verifies that a dataset has mean ``\mu``. It can be generalized to compare the mean (performance) of two datasets. Under some assumption, it derives that
36+
An example is the one-sided [Student's t-test](https://en.wikipedia.org/wiki/Student's_t-test) that verifies that a one-dimensional dataset has mean ``\mu``. It can be generalized to compare the mean (performance) of two datasets. Under some assumptions, it derives that
3437

3538
```math
36-
t = \sqrt{n}\frac{\overline X - \mu}{\hat\sigma}
39+
t = \sqrt{n}\frac{\hat \mu - \mu}{\hat\sigma}
3740
```
3841

39-
follows the [Student's distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution) with ``n-1`` degress of freedom. Here, the dataset ``X`` has size ``n``, its mean is ``\overline X`` and its standard deviation ``\hat \sigma``. Instead of computing the confidence interval, the usual way is to define the ``p``-value
42+
follows the [Student's distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution) with ``n-1`` degrees of freedom. Here, ``n`` is the number of datapoints. Their mean is ``\hat \mu`` and their standard deviation ``\hat \sigma``. Instead of computing the confidence interval, the usual way is to define the [``p``-value](https://en.wikipedia.org/wiki/P-value)
4043

4144
```math
42-
p = 2\min\{\mathbb P(T\le t), \mathbb P(T\ge t)\}
45+
p = 2\min\{\mathbb P(T\le t \mid H_0), \mathbb P(T\ge t\mid H_0)\}
4346
```
4447

4548
If the ``p``-value is smaller than a given threshold, usually ``5\%``, the null hypothesis is rejected. In the opposite case, it is not rejected. The ``p``-value is a measure of the probability that an observed difference could have occurred just by random chance.
@@ -51,7 +54,7 @@ If the ``p``-value is smaller than a given threshold, usually ``5\%``, the null
5154

5255

5356

54-
We first randomly generate data with from the normal distribution with zero mean.
57+
We first randomly generate data from the normal distribution with zero mean.
5558

5659
```@example glm
5760
using Random
@@ -67,7 +70,7 @@ xs = randn(n)
6770
nothing # hide
6871
```
6972

70-
The following exercise checks performs the ``t``-test to check whether it comes from a distribution with zero mean.
73+
The following exercise performs the ``t``-test to check whether the data come from a distribution with zero mean.
7174

7275

7376

@@ -99,10 +102,10 @@ t = mean(xs) / std(xs) * sqrt(n)
99102
100103
prob = cdf(TDist(n-1), t)
101104
p = 2*min(prob, 1-prob)
102-
103-
nothing # hide
104105
```
105106

107+
The ``p``-value is significantly larger than ``5\%``. Therefore, we cannot reject the zero hypothesis, which is fortunate because the data were generated from the normal distribution with zero mean.
108+
106109
```@raw html
107110
</p></details>
108111
```
@@ -131,25 +134,25 @@ OneSampleTTest(xs)
131134

132135
## Theory of generalized linear models
133136

134-
The statistical approach to linear regression is different from the one from machine learning. It also assumes linear prediction function
137+
The statistical approach to linear regression is different from the one from machine learning. It also assumes a linear prediction function:
135138

136139
```math
137140
\operatorname{predict}(w;x) = w^\top x.
138141
```
139142

140-
Then it considers some invertible link function ``g:\mathbb R\to \mathbb R`` and assumes that ``y`` conditioned on ``x`` follows an apriori specified distribution with density ``f``. The parameters of this distribution are uknown, but the distribution should satisfy the conditional expectation ``E(y\mid x) = g^{-1}(w^\top x)``. The goal is to find the weights ``w`` by the [maximum likelihood estimate](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation). This technique maximizes the likelihood function
143+
Then it considers some invertible link function ``g:\mathbb R\to \mathbb R`` and assumes that ``y`` conditioned on ``x`` follows an apriori specified distribution with density ``f``. The parameters of this distribution are unknown, but the distribution should satisfy the conditional expectation ``E(y\mid x) = g^{-1}(w^\top x)``. The goal is to find the weights ``w`` by the [maximum likelihood estimate](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation). This technique maximizes the likelihood function:
141144

142145
```math
143146
\operatorname{maximize}\qquad \prod_{i=1}^n f(y_i).
144147
```
145148

146-
Since the density is the derivative of the distribution function, the term ``f(y_i)`` describes the "probability" of ``y_i`` under density ``f``. If ``y_i`` are independent, then the product is the joint probability for all samples. Therefore, maximizing the the likelihood function amounts to finding the parameters of the apriori specified distribution such that the observed samples ``y_i`` have the highest probability to follow this distriubtion. Since these distributions are usually taken from the [exponential family](https://en.wikipedia.org/wiki/Exponential_family), the log-likelihood
149+
Since the density is the derivative of the distribution function, the term ``f(y_i)`` describes the "probability" of ``y_i`` under density ``f``. If ``y_i`` are independent, then the product is the joint probability for all samples. Therefore, maximizing the likelihood function amounts to finding the parameters of the apriori specified distribution such that the observed samples ``y_i`` have the highest probability. Since these distributions are usually taken from the [exponential family](https://en.wikipedia.org/wiki/Exponential_family), the log-likelihood
147150

148151
```math
149152
\operatorname{maximize}\qquad \sum_{i=1}^n \log f(y_i).
150153
```
151154

152-
is often maximized. Since logarithm is an increasing function, these two approaches are equivalent.
155+
is often maximized. Since the logarithm is an increasing function, these two formulas are equivalent.
153156

154157

155158

@@ -164,20 +167,20 @@ The first case considers ``g(z)=z`` to be the identity function and ``y\mid x``
164167
w^\top x_i = g^{-1}(w^\top x_i) = \mathbb E(y_i \mid x_i) = \mu_i,
165168
```
166169

167-
and, therefore, we need the solve the following optimization problem.
170+
and, therefore, we need the solve the following optimization problem:
168171

169172
```math
170173
\operatorname{maximize}_w\qquad \sum_{i=1}^n \log \left(\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(\frac{-(y_i - w^\top x_i)^2}{2\sigma^2}\right)\right).
171174
```
172175

173-
Since we maximize with respect to ``w``, most terms behave like constants and this optimization problem is equivalent to
176+
Since we maximize with respect to ``w``, most terms behave like constants, and this optimization problem is equivalent to
174177

175178

176179
```math
177-
\operatorname{minimize}\qquad \sum_{i=1}^n (y_i - w^\top x_i)^2.
180+
\operatorname{minimize}_w\qquad \sum_{i=1}^n (y_i - w^\top x_i)^2.
178181
```
179182

180-
This is precisely the linear regression as derived in the previous lectures.
183+
This is precisely linear regression as derived in the previous lectures.
181184

182185

183186

@@ -186,10 +189,10 @@ This is precisely the linear regression as derived in the previous lectures.
186189
The second case considers ``g(z)=\log z`` to be the logarithm function and ``y\mid x`` with the [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution) ``Po(\lambda)``. The inverse function to ``g`` is ``g^{-1}(z)=e^z``. Since the Poisson distribution has non-negative discrete values with probabilities ``\mathbb P(Y=k) = \frac{1}{k!}\lambda^ke^{-\lambda}``, labels ``y_i`` must also be non-negative integers. The same formula for the conditional expectation as before yields:
187190

188191
```math
189-
e^{w^\top x_i} = g^{-1}(w^\top x_i) = \mathbb E(y_i \mid x_i) = \lambda_i,
192+
e^{w^\top x_i} = g^{-1}(w^\top x_i) = \mathbb E(y_i \mid x_i) = \lambda_i.
190193
```
191194

192-
Plugging this term into the maximizing the log-likelihood function, results in the following optimization problem:
195+
Plugging this term into the log-likelihood function results in the following optimization problem:
193196

194197
```math
195198
\operatorname{maximize}_w\qquad \sum_{i=1}^n\log\left( \frac{1}{y_i!}\lambda_i^{y_i} e^{-\lambda_i}\right).
@@ -206,7 +209,7 @@ This function is similar to the one derived for logistic regression.
206209

207210
## Linear models
208211

209-
We will use the [Employment and Wages in Spain](https://vincentarelbundock.github.io/Rdatasets/doc/plm/Snmesp.html) dataset because it is slightly larger that iris. It contains 5904 observations of wages from 738 companies in Spain during 1983 to 1990. We will estimate the dependance of wages on other factors such as employment or cash flow. We first load the dataset and transform the original log-wages into normal wages. We use base ``2`` to obtain relatively small numbers.
212+
We will use the [Employment and Wages in Spain](https://vincentarelbundock.github.io/Rdatasets/doc/plm/Snmesp.html) dataset because it is slightly larger than the iris dataset. It contains 5904 observations of wages from 738 companies in Spain from 1983 to 1990. We will estimate the dependence of wages on other factors such as employment or cash flow. We first load the dataset and transform the original log-wages into non-normalized wages. We use base ``2`` to obtain relatively small numbers.
210213

211214
```@example glm
212215
using RDatasets
@@ -217,7 +220,7 @@ wages.W = 2. .^ (wages.W)
217220
nothing # hide
218221
```
219222

220-
We can use the already known precedure to compute the best fit.
223+
We can use the already known procedure to compute the best fit.
221224

222225
```@example glm
223226
X = Matrix(wages[:, [:N, :Y, :I, :K, :F]])
@@ -229,15 +232,15 @@ w0 = (X'*X) \ (X'*y)
229232
nothing # hide
230233
```
231234

232-
Another possibility is to use the packege [GLM](https://juliastats.org/GLM.jl/stable/) and its command `lm` for linear models.
235+
Another possibility is to use the package [GLM](https://juliastats.org/GLM.jl/stable/) and its command `lm` for linear models.
233236

234237
```@example glm
235238
using GLM
236239
237240
model = lm(@formula(W ~ 1 + N + Y + I + K + F), wages)
238241
```
239242

240-
The table shows the value of the parameters and their confidence intervals. Besides that, it also tests the null hypothesis ``H_0: w_j = 0`` whether some of the regression coefficient can be omitted. The ``t`` statistics is in column `t`, while its ``p``-value in column `Pr(>|t|)`. The next exercise checks whether we can achieve the same results with fewer features.
243+
The table shows the parameter values and their confidence intervals. Besides that, it also tests the null hypothesis ``H_0: w_j = 0`` whether some of the regression coefficients can be omitted. The ``t`` statistics is in column `t`, while its ``p``-value in column `Pr(>|t|)`. The next exercise checks whether we can achieve the same results with fewer features.
241244

242245

243246

@@ -250,9 +253,9 @@ The table shows the value of the parameters and their confidence intervals. Besi
250253
<header class = "exercise-header">Exercise:</header><p>
251254
```
252255

253-
Check that the results by computing the solution by hand and by `lm` are the same.
256+
Check that the solution computed by hand and by `lm` are the same.
254257

255-
Then remove the feature with the highest ``p``-value and observe whether there was any performance drop. The performance is usually evaluated by the [coeffient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) denote by ``R^2\in[0,1]``. Its higher values indicate better model.
258+
Then remove the feature with the highest ``p``-value and observe whether there was any performance drop. The performance is usually evaluated by the [coeffient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) denoted by ``R^2\in[0,1]``. Its higher values indicate a better model.
256259

257260
**Hint**: Use functions `coef` and `r2`.
258261

@@ -268,18 +271,20 @@ Since the parameters for both approaches are almost the same, the approaches giv
268271
norm(coef(model) - w0)
269272
```
270273

271-
The ``p``-value for column ``K`` is ``3.3\%``. We define the reduced model without this feature.
274+
The table before this exercise shows that the ``p``-value for feature ``K`` is ``3.3\%``. We define the reduced model without this feature.
272275

273276
```@example glm
274277
model_red = lm(@formula(W ~ 1 + N + Y + I + F), wages)
275278
```
276279

277-
Since we observe only a small performance drop, we could omit this feature without a change to prediction capability of the model.
280+
Now we show the performances of both models.
278281

279282
```@example glm
280-
r2(model) - r2(model_red)
283+
(r2(model), r2(model_red))
281284
```
282285

286+
Since we observe only a small performance drop, we could omit this feature without changing the model prediction capability.
287+
283288
```@raw html
284289
</p></details>
285290
```
@@ -291,28 +296,35 @@ r2(model) - r2(model_red)
291296

292297

293298

294-
The core assumption of this approach is that ``y`` has normal distribution. We use the `predict` function to predict the data and then use the `plot_histogram` function written earlier to plot the histogram and a density of the normal distribution. For the normal distribution, we need to specify the correct mean and variance.
299+
The core assumption of this approach is that ``y`` follows the normal distribution. We use the `predict` function for predictions and then use the `plot_histogram` function written earlier to plot the histogram and a density of the normal distribution. For the normal distribution, we need to specify the correct mean and variance.
295300

296301
```@example glm
297302
y_hat = predict(model)
298303
299304
plot_histogram(y_hat, x -> pdf(Normal(mean(y_hat), std(y_hat)), x))
300305
```
301306

302-
The distribution resembles the normal distribution, but there are some differences. We can use the more formal [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) test to verify whether a sample comes from a distribution.
307+
Another possibility would be the `fit` function from the `Distributions` package.
308+
309+
```@example glm
310+
plot_histogram(y_hat, x -> pdf(fit(Normal, y_hat), x))
311+
```
312+
313+
The results look identical. The distribution resembles the normal distribution, but there are some differences. We can use the more formal [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test), which verifies whether a sample comes from some distribution.
303314

304315
```@example glm
305316
test_normality = ExactOneSampleKSTest(y_hat, Normal(mean(y_hat), std(y_hat)))
306317
```
307318

308-
The result is expected. The ``p``-value is close to ``1\%``, which means that we would reject the hypothesis that the data come from the normal distribution even though it is not entirely far away.
319+
The result is expected. The ``p``-value is close to ``1\%``, which means that we reject the null hypothesis that the data follow the normal distribution even though it is not entirely far away.
309320

310321

311322

312323
## Generalized linear models
313324

325+
While the linear models do not transform the labels, the generalized models transform them by the link function. Moreover, they allow choosing other than the normal distribution for labels. Therefore, we need to specify the link function ``g`` and the distribution of ``y \mid x``.
314326

315-
For the generalized linear models, we need to specify the link function ``g`` and the distribution of ``y \mid x``. For the former we choose ``g(z) = \sqrt{z}`` and for the latter the [inverse Gaussian](https://en.wikipedia.org/wiki/Inverse_Gaussian_distribution) distribution. Since we want to use the generalized linear model, we replace `lm` by `glm`.
327+
We repeat the same example with the link function ``g(z) = \sqrt{z}`` and the [inverse Gaussian](https://en.wikipedia.org/wiki/Inverse_Gaussian_distribution) distribution for the labels. Since we want to use the generalized linear model, we replace `lm` by `glm`.
316328

317329
```@example glm
318330
model = glm(@formula(W ~ 1 + N + Y + I + K + F), wages, InverseGaussian(), SqrtLink())
@@ -325,22 +337,22 @@ model = glm(@formula(W ~ 1 + N + Y + I + K + F), wages, InverseGaussian(), SqrtL
325337

326338

327339

328-
The next exercise plots the prediction for the linear model.
340+
The following exercise plots the predictions for the generalized linear model.
329341

330342
```@raw html
331343
<div class = "exercise-body">
332344
<header class = "exercise-header">Exercise:</header><p>
333345
```
334346

335-
Create the scatter plot of predictions and labels. You are not allowed to use the `predict` function.
347+
Create the scatter plot of predictions and labels. Do not use the `predict` function.
336348

337349
```@raw html
338350
</p></div>
339351
<details class = "solution-body">
340352
<summary class = "solution-header">Solution:</summary><p>
341353
```
342354

343-
Due to construction of generalized linear models, the prediction equals ``g^{-1}(w^\top x)``. We save it into ``\hat y``.
355+
Due to the construction of the generalized linear model, the prediction equals ``g^{-1}(w^\top x)``. We save it into ``\hat y``.
344356

345357
```@example glm
346358
g_inv(z) = z^2
@@ -353,7 +365,7 @@ nothing # hide
353365
The scatter plot is now simple.
354366

355367
```@example glm
356-
scatter(y, predict(model);
368+
scatter(y, y_hat;
357369
label="",
358370
xlabel="Label",
359371
ylabel="Prediction",

0 commit comments

Comments
 (0)