Linear Regression

Marie-Pierre Etienne

https://marieetienne.github.io/TopicsInStatistics/

2024-10-09

Introduction

Lecture objectives

  • Present the classical statistical version of linear model ( a bit different from the machine learning approach)

  • Equivalence between Ordinary Least square estimation and normal likelihood estimation

  • A reminder about statistical testing and an overview of the classical test in linear modelling

  • Some diagnostic regarding the assumptions

  • A tutorial with Python

Regression analysis context

Focus on a quantitative variable \(Y\) and interesting in

  • understanding the relationship with this variables and other variables of interest, named covariates (explanatory variables, regressors, features),
  • selecting the meaningful covariates,
  • predicting the value of \(Y\) given the values of the covariates.

Examples

  • Economic: \(Y\) is the inflation rate and the covariates are interest rates, unemployment rates, and government spending.

  • Marketing: \(Y\) is the value of a customer purchase, explained by age, salary, marketting campaigns exposure, …

  • Environmental Science: \(Y\) is the daily Ozone concentration at a given site which might depend on industrial activity, population density, and meteorological factors.

  • Agriculture: \(Y\) is the crop yields, which might be explained by soil quality, rainfall, previous production, the quantity of fertilizers, ..

Regression Model

Multiple Linear Regression Model

\[Y_{k} = \beta_0 +\beta_1 x_{k}^{1} + \beta_2 x_{k}^{2} + \ldots + \beta_p x_{k}^{p} + E_{k},\quad E_{k}\overset{ind}{\sim}\mathcal{N}(0, \sigma^2),\]

Where: - \(x_{k}^{l}\) is the value of explanatory variable \(l\) for observation \(k\), - \(k=1,\ldots,n\) is the index of the individual, and \(n\) is the total number of individuals, - \(\beta_0\) is the intercept, - \(\beta_l\) represents the effect of variable \(X^{l}\) on the dependent variable, - \(\sigma^2\) is the variance.

Equivalent Notation

\[Y_{k} \overset{ind}{\sim}\mathcal{N}(\beta_0 +\beta_1 x_{k}^{1} + \beta_2 x_{k}^{2} + \ldots + \beta_p x_{k}^{p} , \sigma^2).\]

Number of Parameters

  • \(p+1\) parameters for the mean \((\beta_0, \beta_1, \ldots, \beta_p)\),
  • 1 parameter for the variance \(\sigma^2\) (the noise).

Sous forme matricielle

\[\bf{Y = X\beta + E}\]

\[Y=\begin{pmatrix} Y_{1}\\ Y_{2}\\ \vdots\\ Y_{k}\\ \vdots\\ Y_{n}\end{pmatrix}, \quad {\bf{X}} ={\begin{pmatrix} 1 & x_1^{1} & x_1^{2} & \ldots &x_1^{l}\\ 1 & x_2^{1} & x_2^{2} & \ldots &x_2^{l}\\ \vdots & \vdots& \vdots && \vdots\\ 1 & x_k^{1} & x_k^{2} & \ldots &x_k^{l}\\ \vdots & \vdots& \vdots && \vdots\\ 1 & x_n^{1} & x_n^{2} & \ldots &x_n^{l}\\ \end{pmatrix}},\quad {\bf{\beta}} =\begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \vdots\\ \beta_l\\ \end{pmatrix}, \quad{\bf{E}} = \overset{}{\begin{pmatrix} E_{1}\\ E_{2}\\ \vdots\\ E_{k}\\ \vdots\\ E_{n}\\ \end{pmatrix}}.\]

Python requirement

conda create --name topicsinstats python=3.11 pip 
conda activate topicsinstats
pip install -r requirements.txt

Palmers penuins companion Data set

Just go back to our dear penguins. We want to explain the flipper length by the body mass and the bill length.

Write the first rows of \(Y\), \(X\) and specify the dimension of the different objects

import pandas as pd
import seaborn as sns 
from palmerpenguins import load_penguins
import statsmodels.api as sm

penguins = load_penguins() # load dataset
penguins = penguins.dropna() # drop na
penguins[['flipper_length_mm', 'body_mass_g', 'bill_length_mm']] # show first rows
     flipper_length_mm  body_mass_g  bill_length_mm
0                181.0       3750.0            39.1
1                186.0       3800.0            39.5
2                195.0       3250.0            40.3
4                193.0       3450.0            36.7
5                190.0       3650.0            39.3
..                 ...          ...             ...
339              207.0       4000.0            55.8
340              202.0       3400.0            43.5
341              193.0       3775.0            49.6
342              210.0       4100.0            50.8
343              198.0       3775.0            50.2

[333 rows x 3 columns]

\[ Y = \begin{pmatrix} 181 \\ 186 \\ 195 \\ 193 \\ 190 \\ \vdots \end{pmatrix}, \quad X = \begin{pmatrix} 1 & 3750 & 39.10 \\ 1 & 3800 & 39.50 \\ 1 & 3250 & 40.30 \\ 1 & 3450 & 36.70 \\ 1 & 3650 & 39.30 \\ & \vdots& \end{pmatrix}, \quad \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \end{pmatrix}. \]

Y = penguins['flipper_length_mm']  # regressor / features      
X = penguins[['body_mass_g', 'bill_length_mm']]   # regressor / features      
X.head()
print("\n")

X = sm.add_constant(X)          # Add the constant to the regressor
X.head()
   body_mass_g  bill_length_mm
0       3750.0            39.1
1       3800.0            39.5
2       3250.0            40.3
4       3450.0            36.7
5       3650.0            39.3

   const  body_mass_g  bill_length_mm
0    1.0       3750.0            39.1
1    1.0       3800.0            39.5
2    1.0       3250.0            40.3
4    1.0       3450.0            36.7
5    1.0       3650.0            39.3

Quick Reminder on Gaussian Multivariate Random Vector

A Gaussian Multivariate Random Vector \(\mathbf{U} \in \mathbb{R}^n\) is defined by the following properties:

  1. Mean Vector \(\boldsymbol{\mu} = \mathbb{E}[\mathbf{U}] \in \mathbb{R}^n\):
    \[\boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_n \end{bmatrix}\]

  2. Covariance Matrix \(\boldsymbol{\Sigma} = \mathbb{E}[(\mathbf{U} - \boldsymbol{\mu})(\mathbf{U} - \boldsymbol{\mu})^\top] \in \mathbb{R}^{n \times n}\):
    \[\boldsymbol{\Sigma} = \begin{bmatrix} \sigma_{11} & \sigma_{12} & \dots & \sigma_{1n} \\ \sigma_{21} & \sigma_{22} & \dots & \sigma_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{n1} & \sigma_{n2} & \dots & \sigma_{nn} \end{bmatrix}\]

  3. Probability Density Function (PDF) for \(\mathbf{U}\): \[f(\mathbf{U}) = \frac{1}{(2\pi)^{n/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2} (\mathbf{U} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{U} - \boldsymbol{\mu})\right)\]

Key Properties

  • Marginals: Each component \(X_i\) of \(\mathbf{X}\) is normally distributed: \(X_i \sim \mathcal{N}(\mu_i, \sigma_{ii})\).

  • Independence: If the covariance matrix \(\boldsymbol{\Sigma}\) is diagonal, then all components are independent.

  • Linear Combinations: Any linear combination of the components of a Gaussian vector is also Gaussian.

\[c^\intercal \mathbf{U} \sim \mathcal{N} \left ( c^\intercal \mathbb{E}(\mathbf{U}), c^\intercal \mathbb{\Sigma} c \right).\]

Cochran’s Theorem

Theorem

Let \(\mathbf{X} \in \mathbb{R}^n\) be a multivariate Gaussian vector such that:

\[\mathbf{X} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_n),\] let \(Q_1, \cdots, Q_k\) be quadratic forms with ranks \(r_1,\cdots,r_k\) and suppose the \(Q_1 +\cdots Q_k= X^\intercal X,\)

Then \(Q_1, \cdots, Q_k\) are independently distributed as \(\chi^2\) with \(r_1, \cdots, r_k\) degrees of freedom respectively if and only if \(\sum_{i=1}^k r_i =n\)

Useful Application

Consider \(P\) the projector on some subspace \(V \subset \R^n\) of dimension \(p\), so that \(P=P^\intercal\) and \(P^2=P\). * \((X-PX) \perp PX\), since \(P (X-PX) = PX - P^2 X = 0,\) * \[\begin{align} X^\intercal X & = \left ((X-PX) + PX \right)^\intercal \left ((X-PX) + PX \right) = (X-PX)^\intercal(X-PX) + (PX)^\intercal (PX) + (X-PX)^\intercal PX + (PX)^\intercal (X-PX)\cr & =(X-PX)^\intercal(X-PX) + (PX)^\intercal (PX), \end{align}\] and \(rank(P^\intercal P)=p\) \(rank((I-P)^\intercal (I-P))=n-p,\) so that \[\norm{PX}^2 \sim \chi^2(p), \norm{X-PX}^2 \sim \chi^2(n-p), \quad \norm{PX}^2 \indep \norm{X-PX}^2.\]

Statistical inference

Parameter estimation

How to pick a good value for \(\beta\) in

\[Y = X \beta + E\]

\[\begin{align} \mathbb{E}(Y) &= X\beta \cr X^\intercal \mathbb{E}(Y) &= X^\intercal X\beta \mbox{ if } X \mbox {is full rank} \cr \left( X^\intercal X\right)^{-1} X^\intercal \mathbb{E}(Y) &= \beta \cr \end{align}\]

A good guess would be \(\hat{\beta} = \left( X^\intercal X\right)^{-1} X^\intercal Y\)

  • Find the \(\hat{\beta}\) so that \(X\hat{\beta}\) is as close as possible from \(Y\), i.e
  • Find the element of \(Im(X)\) as close as possible from \(Y\),
  • Consider the orthogonal projection on \(Im(X)\) \(X\hat{\beta},\)

Then \(Y - X\hat{\beta}\) is orthogonal to any vector of \(Im(X)= Xv,\) \[\begin{align} \forall v\in \R^n, \quad <Xv, Y-X\hat{\beta}> & = 0\cr \forall v\in \R^n, \quad v^\intercal X^\intercal (Y-X\hat{\beta}) & = 0\cr \forall v\in \R^n, \quad v^\intercal X^\intercal Y & = v^\intercal X^\intercal X\hat{\beta}\cr X^\intercal Y & = X^\intercal X\hat{\beta}\cr \left( X^\intercal X\right)^{-1} X^\intercal Y & = \hat{\beta} \mbox{ if } X \mbox {is full rank} \cr \end{align}\]

Ordinary Least Square estimation: \(\hat{\beta}\), so that \(\norm{Y-X\hat{\beta}}^2\) is minimal.

The log-likelihood of the model \(\ell(\beta, \sigma^2; Y)\) is given by

\[\ell(\beta, \sigma^2; Y) = Cte - \frac{n}{2} \log{\sigma^2} -\frac{1}{2\sigma^2} (\mathbf{Y} - X\beta)^\intercal (\mathbf{Y} - X\beta).\] By maximizing the log-likelihood and after a matrix derivation, we get maximum likelihood estimator (MLE) \(\hat{\beta}\) is solution to \[X^\intercal Y = X^\intercal X \hat{\beta}.\]

The MLE for \(\sigma^2\) is \[\widehat{\sigma^2}_{MLE} = \frac{1}{n} (\mathbf{Y} - X\hat{\beta})^\intercal (\mathbf{Y} - X\hat{\beta})\]

Estimators properties

As \(\hat{\beta} = \left(X^\intercal X\right)^{-1} X^\intercal Y\) is a linear combination of a Gaussian vector, \[\hat{\beta} \sim\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma} ),\] with

  • \(\boldsymbol{\mu} = \left(X^\intercal X\right)^{-1} X^\intercal \mathbb{E}\left(Y\right)=\left(X^\intercal X\right)^{-1} X^\intercal X \beta = \beta\),

  • and \(\boldsymbol{\Sigma} = \left(X^\intercal X\right)^{-1} X^\intercal \sigma^2 I_n \left( \left(X^\intercal X\right)^{-1} X^\intercal\right)^\intercal = \sigma^2 \left(X^\intercal X\right)^{-1}.\)

\(\mathbb{E}{\hat{\beta}}=\beta\), the estimator is unbiased.

Recall that \(X\hat{\beta}\) is the the orthogonal projection on \(Im(X)\), so that the projector \(P\) is defined by \(P = X\hat{\beta}=X \left(X^\intercal X\right)^{-1} X^\intercal.\)

And \(\norm{Y-X\theta}^2 = \norm{Y-PY}^2 + \norm{PY-X\beta}^2,\) following Cochran theorem

\[\frac{1}{\sigma^2}\norm{Y-PY}^2\sim\chi^2(n-rank(X)), \quad \frac{1}{\sigma^2}\norm{PY-X\theta}^2\sim\chi^2(rank(X)), \quad \norm{Y-PY}^2\indep \norm{PY-X\theta}^2.\] As \(\mathbb{E}\left(\frac{1}{n} \norm{Y-PY}^2 \right) = \frac{n-rank(X)}{n}\) of \(\sigma^2\), the MLE estimator is biased

We prefer the unbiased version \[\widehat{\sigma}^2 = \frac{1}{n-rank(X)} \norm{Y-PY}^2.\]

Illustrationon the Palmer penguins

import numpy as np
import statsmodels.api as sm
from statsmodels.stats.anova import anova_lm

n = X.shape[0]

# linear regression model
lm1 = sm.OLS(Y, X)  # OLS: Ordinary Least Squares (Moindres Carrés Ordinaires)
fit1 = lm1.fit()    # fit the model
print(fit1.summary())

X_numpy = X.to_numpy()
Y_numpy = Y.to_numpy()
p = np.linalg.matrix_rank(X_numpy)

XT_X = X_numpy.T.dot(X_numpy)
XT_X_inv = np.linalg.inv(XT_X)
hat_beta = XT_X_inv.dot(X_numpy.T.dot(Y_numpy))
print(hat_beta)

P = X_numpy.dot(XT_X_inv.dot(X_numpy.T))
R = Y_numpy - P.dot(Y_numpy)
print(R[range(0,5)])
## to compare with 
fit1.resid

hat_sigma2 = R.T.dot(R)/(n-p)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:      flipper_length_mm   R-squared:                       0.791
Model:                            OLS   Adj. R-squared:                  0.790
Method:                 Least Squares   F-statistic:                     626.3
Date:                Wed, 09 Oct 2024   Prob (F-statistic):          4.51e-113
Time:                        06:59:32   Log-Likelihood:                -1090.1
No. Observations:                 333   AIC:                             2186.
Df Residuals:                     330   BIC:                             2198.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const            122.2687      2.864     42.697      0.000     116.635     127.902
body_mass_g        0.0130      0.001     24.034      0.000       0.012       0.014
bill_length_mm     0.5440      0.080      6.822      0.000       0.387       0.701
==============================================================================
Omnibus:                        8.498   Durbin-Watson:                   1.854
Prob(Omnibus):                  0.014   Jarque-Bera (JB):                8.830
Skew:                          -0.393   Prob(JB):                       0.0121
Kurtosis:                       2.867   Cond. No.                     3.49e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.49e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
[1.22268670e+02 1.30173337e-02 5.44035274e-01]
[-11.35545088  -7.22393167   8.50037362   5.85543388  -1.16252457]
0     -11.355451
1      -7.223932
2       8.500374
4       5.855434
5      -1.162525
         ...    
339     2.304827
340    11.806861
341    -5.393255
342     6.723270
343    -0.719676
Length: 333, dtype: float64

Regression tests

Test on a linear combination of parameter

To test whether or not a linear combination \(c\) of parameters equals \(0\), we would like to decide between two assumptions \[H_0 = \left \lbrace c^\intercal \beta = 0\right\rbrace\quad \mbox{vs} \quad H_1 = \left \lbrace c^\intercal \beta \ne 0\right\rbrace.\]

Since \(c^\intercal \hat{\beta} \sim \mathcal{N}\left(c^\intercal \beta, \sigma^2 c^\intercal \left(X^\intercal X\right)^{-1} c\right)\) and \(\sigma^{-2}\norm{Y-PY}^2 \sim \chi^2(n-p),\) then

\[\sqrt{n-p}\frac{c^\intercal\hat{\beta} - c^\intercal{\beta}}{\norm{Y-PY} \sqrt{c^\intercal \left(X^\intercal X\right)^{-1} c}} \sim \mathcal{T}(n-p).\] So that under assumption \(H_0\) \[\sqrt{n-p}\frac{c^\intercal\hat{\beta}}{\norm{Y-PY} \sqrt{c^\intercal \left(X^\intercal X\right)^{-1} c}} \sim \mathcal{T}(n-p).\]

Describe this test when \(p=3\) and \(c = (0, 0, 1)^\intercal\)

  • \(c^\intercal \beta = \beta_2\)

  • Under assumption \(H_0 = \left\lbrace \beta_2=0 \right\rbrace\) \[\frac{\hat{\beta}}{ \sqrt{ (n-p) \widehat{\sigma^2} \left(X^\intercal X\right)^{-1}_{33} }} \sim \mathcal{T}(n-p).\]

Nested model test

A model \(M_1\) is said to be nested within a more complex model \(M_2\) if all the parameters in \(M_1\) are also present in \(M_2\), but \(M_2\) has additional parameters that \(M_1\).

  • \(M2:\quad Y_k = \beta_0 + \beta_1 x_{k1} + \beta_2 x_{k2} + E_k\) and \(M1:\quad Y_k = \beta_0 + \beta_1 x_{k1} + E_k\)

But also

  • \(M_2:\quad Y_k = \beta_0 + \beta_1 x_{k1} + \beta_2 x_{k2} + E_k\) and \(M_1:\quad Y_k = \beta_0 + \beta_1 (x_{k1} + x_{k2}) E_k\)

If \(M_1\) and \(M_2\) are two linear models, with \(M_1\) nested in \(M_2\).

\[M_1:\quad Y = X_1 \beta^1 + E^1; \quad \quad M_2:\quad Y = X_2 \beta^2 + E^2;\] with \(Im(X_1) \subset Im(X_2).\)

Applying Cochran theorem, under \(H_0=\left\lbrace X_1\beta^1 =X_2 \beta^2\right\rbrace,\)

\(\sigma^{-2}\norm{Y - P_2 Y}^2 \sim \chi^2\left(n-rank(X_2)\right)\), \(\sigma^{-2}\norm{P_2 Y - P_1 Y }^2 \sim \chi^2(rank(X_2)-rank(X_1))\) and \(\norm{P_2 Y - P_1 Y }^2 \indep \norm{Y- P_2 Y}^2.\)

Under \(H_0=\left\lbrace X_1\beta^1 =X_2 \beta^2\right\rbrace,\) \[\frac{\frac{\norm{P_2 Y - P_1 Y}^2}{p_2-p_1}}{\frac{\norm{Y- P_2 Y}^2}{n-p_2}}\sim \mathcal{F}_{p_2-p_1, n- p_2}\]

## fit the null model 
X0 = X[['const']]
lm0 = sm.OLS(Y, X0)  # OLS: Ordinary Least Squares (Moindres Carrés Ordinaires)
fit0 = lm0.fit()    # fit the model

anova_results = anova_lm(fit0, fit1)
print(anova_results)
   df_resid           ssr  df_diff       ss_diff           F         Pr(>F)
0     332.0  65218.636637      0.0           NaN         NaN            NaN
1     330.0  13598.383598      2.0  51620.253039  626.349572  4.508888e-113

What about categorical variables ?

Does the flipper length varies between species?

The model

\[Y_{sk} = \mu + \alpha_s + E_{sk},\] where \(s\) stands for the species.

  • Transform into dummies variables:
penguins_dummies = pd.get_dummies(penguins, 
columns=['species'],  dtype="int")
print(penguins_dummies.iloc[:, [1] + list(range(7,10))].head())
   bill_length_mm  species_Adelie  species_Chinstrap  species_Gentoo
0            39.1               1                  0               0
1            39.5               1                  0               0
2            40.3               1                  0               0
4            36.7               1                  0               0
5            39.3               1                  0               0
  • What is the rank of the so define design matrix ?
penguins_dummies_c = pd.get_dummies(penguins,
columns=['species'],  dtype="int", drop_first = True)
penguins_dummies_c.iloc[:, [1] + list(range(7,9))].head()
   bill_length_mm  species_Chinstrap  species_Gentoo
0            39.1                  0               0
1            39.5                  0               0
2            40.3                  0               0
4            36.7                  0               0
5            39.3                  0               0

Be very careful in the interpretation

X_unid = penguins_dummies[['species_Adelie', 'species_Gentoo', 'species_Chinstrap']]
X_unid = sm.add_constant(X_unid)          # Add the constant to the regressor
p_unid = np.linalg.matrix_rank(X_unid)
print(p_unid)


X = penguins_dummies_c[[ 'species_Gentoo', 'species_Chinstrap']]
X = sm.add_constant(X)          # Add the constant to the regressor
p = np.linalg.matrix_rank(X)
print(p)

Y = penguins_dummies['flipper_length_mm']
X_unid = X_unid[['const', 'species_Adelie', 'species_Chinstrap', 'species_Gentoo']]
# linear regression model
lm1 = sm.OLS(Y, X_unid)  # OLS: Ordinary Least Squares (Moindres Carrés Ordinaires)
fit1 = lm1.fit()    # fit the model
print(fit1.summary())
3
3
                            OLS Regression Results                            
==============================================================================
Dep. Variable:      flipper_length_mm   R-squared:                       0.775
Model:                            OLS   Adj. R-squared:                  0.773
Method:                 Least Squares   F-statistic:                     567.4
Date:                Wed, 09 Oct 2024   Prob (F-statistic):          1.59e-107
Time:                        06:59:33   Log-Likelihood:                -1103.0
No. Observations:                 333   AIC:                             2212.
Df Residuals:                     330   BIC:                             2223.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const               150.7904      0.289    522.250      0.000     150.222     151.358
species_Adelie       39.3123      0.486     80.950      0.000      38.357      40.268
species_Chinstrap    45.0331      0.641     70.266      0.000      43.772      46.294
species_Gentoo       66.4449      0.520    127.769      0.000      65.422      67.468
==============================================================================
Omnibus:                        1.423   Durbin-Watson:                   2.268
Prob(Omnibus):                  0.491   Jarque-Bera (JB):                1.470
Skew:                           0.156   Prob(JB):                        0.479
Kurtosis:                       2.904   Cond. No.                     2.10e+15
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.03e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
                  sum_sq     df           F         PR(>F)
C(species)  50525.883615    2.0  567.406992  1.587418e-107
Residual    14692.753022  330.0         NaN            NaN

Quality of the estimation

Condidence interval

\[\hat{\beta} \sim \mathcal{N}(\beta, \sigma^2 \left ( X^\intercal X\right )^{-1} ),\]

So that \[ \sigma^{-1} \left ( X^\intercal X\right )^{1/2} (\hat{\beta}-\beta) \sim \mathcal{N}(0, I_p)\]

Prediction interval