https://marieetienne.github.io/TopicsInStatistics/
2024-10-09
\[ \newcommand\R{{\mathbb{R}}} \newcommand\norm[1]{\lVert#1\rVert} \newcommand{\indep}{\perp \!\!\! \perp} \]
Present the classical statistical version of linear model ( a bit different from the machine learning approach)
Equivalence between Ordinary Least square estimation and normal likelihood estimation
A reminder about statistical testing and an overview of the classical test in linear modelling
Some diagnostic regarding the assumptions
A tutorial with Python
Focus on a quantitative variable \(Y\) and interesting in
Economic: \(Y\) is the inflation rate and the covariates are interest rates, unemployment rates, and government spending.
Marketing: \(Y\) is the value of a customer purchase, explained by age, salary, marketting campaigns exposure, …
Environmental Science: \(Y\) is the daily Ozone concentration at a given site which might depend on industrial activity, population density, and meteorological factors.
Agriculture: \(Y\) is the crop yields, which might be explained by soil quality, rainfall, previous production, the quantity of fertilizers, ..
\[Y_{k} = \beta_0 +\beta_1 x_{k}^{1} + \beta_2 x_{k}^{2} + \ldots + \beta_p x_{k}^{p} + E_{k},\quad E_{k}\overset{ind}{\sim}\mathcal{N}(0, \sigma^2),\]
Where: - \(x_{k}^{l}\) is the value of explanatory variable \(l\) for observation \(k\), - \(k=1,\ldots,n\) is the index of the individual, and \(n\) is the total number of individuals, - \(\beta_0\) is the intercept, - \(\beta_l\) represents the effect of variable \(X^{l}\) on the dependent variable, - \(\sigma^2\) is the variance.
\[Y_{k} \overset{ind}{\sim}\mathcal{N}(\beta_0 +\beta_1 x_{k}^{1} + \beta_2 x_{k}^{2} + \ldots + \beta_p x_{k}^{p} , \sigma^2).\]
\[\bf{Y = X\beta + E}\]
\[Y=\begin{pmatrix} Y_{1}\\ Y_{2}\\ \vdots\\ Y_{k}\\ \vdots\\ Y_{n}\end{pmatrix}, \quad {\bf{X}} ={\begin{pmatrix} 1 & x_1^{1} & x_1^{2} & \ldots &x_1^{l}\\ 1 & x_2^{1} & x_2^{2} & \ldots &x_2^{l}\\ \vdots & \vdots& \vdots && \vdots\\ 1 & x_k^{1} & x_k^{2} & \ldots &x_k^{l}\\ \vdots & \vdots& \vdots && \vdots\\ 1 & x_n^{1} & x_n^{2} & \ldots &x_n^{l}\\ \end{pmatrix}},\quad {\bf{\beta}} =\begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \vdots\\ \beta_l\\ \end{pmatrix}, \quad{\bf{E}} = \overset{}{\begin{pmatrix} E_{1}\\ E_{2}\\ \vdots\\ E_{k}\\ \vdots\\ E_{n}\\ \end{pmatrix}}.\]
Just go back to our dear penguins. We want to explain the flipper length by the body mass and the bill length.
Write the first rows of \(Y\), \(X\) and specify the dimension of the different objects
flipper_length_mm body_mass_g bill_length_mm
0 181.0 3750.0 39.1
1 186.0 3800.0 39.5
2 195.0 3250.0 40.3
4 193.0 3450.0 36.7
5 190.0 3650.0 39.3
.. ... ... ...
339 207.0 4000.0 55.8
340 202.0 3400.0 43.5
341 193.0 3775.0 49.6
342 210.0 4100.0 50.8
343 198.0 3775.0 50.2
[333 rows x 3 columns]
\[ Y = \begin{pmatrix} 181 \\ 186 \\ 195 \\ 193 \\ 190 \\ \vdots \end{pmatrix}, \quad X = \begin{pmatrix} 1 & 3750 & 39.10 \\ 1 & 3800 & 39.50 \\ 1 & 3250 & 40.30 \\ 1 & 3450 & 36.70 \\ 1 & 3650 & 39.30 \\ & \vdots& \end{pmatrix}, \quad \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \end{pmatrix}. \]
body_mass_g bill_length_mm
0 3750.0 39.1
1 3800.0 39.5
2 3250.0 40.3
4 3450.0 36.7
5 3650.0 39.3
const body_mass_g bill_length_mm
0 1.0 3750.0 39.1
1 1.0 3800.0 39.5
2 1.0 3250.0 40.3
4 1.0 3450.0 36.7
5 1.0 3650.0 39.3
A Gaussian Multivariate Random Vector \(\mathbf{U} \in \mathbb{R}^n\) is defined by the following properties:
Mean Vector \(\boldsymbol{\mu} = \mathbb{E}[\mathbf{U}] \in \mathbb{R}^n\):
\[\boldsymbol{\mu} = \begin{bmatrix}
\mu_1 \\
\mu_2 \\
\vdots \\
\mu_n
\end{bmatrix}\]
Covariance Matrix \(\boldsymbol{\Sigma} = \mathbb{E}[(\mathbf{U} - \boldsymbol{\mu})(\mathbf{U} - \boldsymbol{\mu})^\top] \in \mathbb{R}^{n \times n}\):
\[\boldsymbol{\Sigma} = \begin{bmatrix}
\sigma_{11} & \sigma_{12} & \dots & \sigma_{1n} \\
\sigma_{21} & \sigma_{22} & \dots & \sigma_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
\sigma_{n1} & \sigma_{n2} & \dots & \sigma_{nn}
\end{bmatrix}\]
Probability Density Function (PDF) for \(\mathbf{U}\): \[f(\mathbf{U}) = \frac{1}{(2\pi)^{n/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2} (\mathbf{U} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{U} - \boldsymbol{\mu})\right)\]
Marginals: Each component \(X_i\) of \(\mathbf{X}\) is normally distributed: \(X_i \sim \mathcal{N}(\mu_i, \sigma_{ii})\).
Independence: If the covariance matrix \(\boldsymbol{\Sigma}\) is diagonal, then all components are independent.
Linear Combinations: Any linear combination of the components of a Gaussian vector is also Gaussian.
\[c^\intercal \mathbf{U} \sim \mathcal{N} \left ( c^\intercal \mathbb{E}(\mathbf{U}), c^\intercal \mathbb{\Sigma} c \right).\]
Theorem
Let \(\mathbf{X} \in \mathbb{R}^n\) be a multivariate Gaussian vector such that:
\[\mathbf{X} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_n),\] let \(Q_1, \cdots, Q_k\) be quadratic forms with ranks \(r_1,\cdots,r_k\) and suppose the \(Q_1 +\cdots Q_k= X^\intercal X,\)
Then \(Q_1, \cdots, Q_k\) are independently distributed as \(\chi^2\) with \(r_1, \cdots, r_k\) degrees of freedom respectively if and only if \(\sum_{i=1}^k r_i =n\)
Consider \(P\) the projector on some subspace \(V \subset \R^n\) of dimension \(p\), so that \(P=P^\intercal\) and \(P^2=P\). * \((X-PX) \perp PX\), since \(P (X-PX) = PX - P^2 X = 0,\) * \[\begin{align} X^\intercal X & = \left ((X-PX) + PX \right)^\intercal \left ((X-PX) + PX \right) = (X-PX)^\intercal(X-PX) + (PX)^\intercal (PX) + (X-PX)^\intercal PX + (PX)^\intercal (X-PX)\cr & =(X-PX)^\intercal(X-PX) + (PX)^\intercal (PX), \end{align}\] and \(rank(P^\intercal P)=p\) \(rank((I-P)^\intercal (I-P))=n-p,\) so that \[\norm{PX}^2 \sim \chi^2(p), \norm{X-PX}^2 \sim \chi^2(n-p), \quad \norm{PX}^2 \indep \norm{X-PX}^2.\]
How to pick a good value for \(\beta\) in
\[Y = X \beta + E\]
\[\begin{align} \mathbb{E}(Y) &= X\beta \cr X^\intercal \mathbb{E}(Y) &= X^\intercal X\beta \mbox{ if } X \mbox {is full rank} \cr \left( X^\intercal X\right)^{-1} X^\intercal \mathbb{E}(Y) &= \beta \cr \end{align}\]
A good guess would be \(\hat{\beta} = \left( X^\intercal X\right)^{-1} X^\intercal Y\)
Then \(Y - X\hat{\beta}\) is orthogonal to any vector of \(Im(X)= Xv,\) \[\begin{align} \forall v\in \R^n, \quad <Xv, Y-X\hat{\beta}> & = 0\cr \forall v\in \R^n, \quad v^\intercal X^\intercal (Y-X\hat{\beta}) & = 0\cr \forall v\in \R^n, \quad v^\intercal X^\intercal Y & = v^\intercal X^\intercal X\hat{\beta}\cr X^\intercal Y & = X^\intercal X\hat{\beta}\cr \left( X^\intercal X\right)^{-1} X^\intercal Y & = \hat{\beta} \mbox{ if } X \mbox {is full rank} \cr \end{align}\]
Ordinary Least Square estimation: \(\hat{\beta}\), so that \(\norm{Y-X\hat{\beta}}^2\) is minimal.
The log-likelihood of the model \(\ell(\beta, \sigma^2; Y)\) is given by
\[\ell(\beta, \sigma^2; Y) = Cte - \frac{n}{2} \log{\sigma^2} -\frac{1}{2\sigma^2} (\mathbf{Y} - X\beta)^\intercal (\mathbf{Y} - X\beta).\] By maximizing the log-likelihood and after a matrix derivation, we get maximum likelihood estimator (MLE) \(\hat{\beta}\) is solution to \[X^\intercal Y = X^\intercal X \hat{\beta}.\]
The MLE for \(\sigma^2\) is \[\widehat{\sigma^2}_{MLE} = \frac{1}{n} (\mathbf{Y} - X\hat{\beta})^\intercal (\mathbf{Y} - X\hat{\beta})\]
As \(\hat{\beta} = \left(X^\intercal X\right)^{-1} X^\intercal Y\) is a linear combination of a Gaussian vector, \[\hat{\beta} \sim\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma} ),\] with
\(\boldsymbol{\mu} = \left(X^\intercal X\right)^{-1} X^\intercal \mathbb{E}\left(Y\right)=\left(X^\intercal X\right)^{-1} X^\intercal X \beta = \beta\),
and \(\boldsymbol{\Sigma} = \left(X^\intercal X\right)^{-1} X^\intercal \sigma^2 I_n \left( \left(X^\intercal X\right)^{-1} X^\intercal\right)^\intercal = \sigma^2 \left(X^\intercal X\right)^{-1}.\)
\(\mathbb{E}{\hat{\beta}}=\beta\), the estimator is unbiased.
Recall that \(X\hat{\beta}\) is the the orthogonal projection on \(Im(X)\), so that the projector \(P\) is defined by \(P = X\hat{\beta}=X \left(X^\intercal X\right)^{-1} X^\intercal.\)
And \(\norm{Y-X\theta}^2 = \norm{Y-PY}^2 + \norm{PY-X\beta}^2,\) following Cochran theorem
\[\frac{1}{\sigma^2}\norm{Y-PY}^2\sim\chi^2(n-rank(X)), \quad \frac{1}{\sigma^2}\norm{PY-X\theta}^2\sim\chi^2(rank(X)), \quad \norm{Y-PY}^2\indep \norm{PY-X\theta}^2.\] As \(\mathbb{E}\left(\frac{1}{n} \norm{Y-PY}^2 \right) = \frac{n-rank(X)}{n}\) of \(\sigma^2\), the MLE estimator is biased
We prefer the unbiased version \[\widehat{\sigma}^2 = \frac{1}{n-rank(X)} \norm{Y-PY}^2.\]
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.anova import anova_lm
n = X.shape[0]
# linear regression model
lm1 = sm.OLS(Y, X) # OLS: Ordinary Least Squares (Moindres Carrés Ordinaires)
fit1 = lm1.fit() # fit the model
print(fit1.summary())
X_numpy = X.to_numpy()
Y_numpy = Y.to_numpy()
p = np.linalg.matrix_rank(X_numpy)
XT_X = X_numpy.T.dot(X_numpy)
XT_X_inv = np.linalg.inv(XT_X)
hat_beta = XT_X_inv.dot(X_numpy.T.dot(Y_numpy))
print(hat_beta)
P = X_numpy.dot(XT_X_inv.dot(X_numpy.T))
R = Y_numpy - P.dot(Y_numpy)
print(R[range(0,5)])
## to compare with
fit1.resid
hat_sigma2 = R.T.dot(R)/(n-p)
OLS Regression Results
==============================================================================
Dep. Variable: flipper_length_mm R-squared: 0.791
Model: OLS Adj. R-squared: 0.790
Method: Least Squares F-statistic: 626.3
Date: Wed, 09 Oct 2024 Prob (F-statistic): 4.51e-113
Time: 06:59:32 Log-Likelihood: -1090.1
No. Observations: 333 AIC: 2186.
Df Residuals: 330 BIC: 2198.
Df Model: 2
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
const 122.2687 2.864 42.697 0.000 116.635 127.902
body_mass_g 0.0130 0.001 24.034 0.000 0.012 0.014
bill_length_mm 0.5440 0.080 6.822 0.000 0.387 0.701
==============================================================================
Omnibus: 8.498 Durbin-Watson: 1.854
Prob(Omnibus): 0.014 Jarque-Bera (JB): 8.830
Skew: -0.393 Prob(JB): 0.0121
Kurtosis: 2.867 Cond. No. 3.49e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.49e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
[1.22268670e+02 1.30173337e-02 5.44035274e-01]
[-11.35545088 -7.22393167 8.50037362 5.85543388 -1.16252457]
0 -11.355451
1 -7.223932
2 8.500374
4 5.855434
5 -1.162525
...
339 2.304827
340 11.806861
341 -5.393255
342 6.723270
343 -0.719676
Length: 333, dtype: float64
To test whether or not a linear combination \(c\) of parameters equals \(0\), we would like to decide between two assumptions \[H_0 = \left \lbrace c^\intercal \beta = 0\right\rbrace\quad \mbox{vs} \quad H_1 = \left \lbrace c^\intercal \beta \ne 0\right\rbrace.\]
Since \(c^\intercal \hat{\beta} \sim \mathcal{N}\left(c^\intercal \beta, \sigma^2 c^\intercal \left(X^\intercal X\right)^{-1} c\right)\) and \(\sigma^{-2}\norm{Y-PY}^2 \sim \chi^2(n-p),\) then
\[\sqrt{n-p}\frac{c^\intercal\hat{\beta} - c^\intercal{\beta}}{\norm{Y-PY} \sqrt{c^\intercal \left(X^\intercal X\right)^{-1} c}} \sim \mathcal{T}(n-p).\] So that under assumption \(H_0\) \[\sqrt{n-p}\frac{c^\intercal\hat{\beta}}{\norm{Y-PY} \sqrt{c^\intercal \left(X^\intercal X\right)^{-1} c}} \sim \mathcal{T}(n-p).\]
A model \(M_1\) is said to be nested within a more complex model \(M_2\) if all the parameters in \(M_1\) are also present in \(M_2\), but \(M_2\) has additional parameters that \(M_1\).
But also
If \(M_1\) and \(M_2\) are two linear models, with \(M_1\) nested in \(M_2\).
\[M_1:\quad Y = X_1 \beta^1 + E^1; \quad \quad M_2:\quad Y = X_2 \beta^2 + E^2;\] with \(Im(X_1) \subset Im(X_2).\)
Applying Cochran theorem, under \(H_0=\left\lbrace X_1\beta^1 =X_2 \beta^2\right\rbrace,\)
\(\sigma^{-2}\norm{Y - P_2 Y}^2 \sim \chi^2\left(n-rank(X_2)\right)\), \(\sigma^{-2}\norm{P_2 Y - P_1 Y }^2 \sim \chi^2(rank(X_2)-rank(X_1))\) and \(\norm{P_2 Y - P_1 Y }^2 \indep \norm{Y- P_2 Y}^2.\)
Under \(H_0=\left\lbrace X_1\beta^1 =X_2 \beta^2\right\rbrace,\) \[\frac{\frac{\norm{P_2 Y - P_1 Y}^2}{p_2-p_1}}{\frac{\norm{Y- P_2 Y}^2}{n-p_2}}\sim \mathcal{F}_{p_2-p_1, n- p_2}\]
df_resid ssr df_diff ss_diff F Pr(>F)
0 332.0 65218.636637 0.0 NaN NaN NaN
1 330.0 13598.383598 2.0 51620.253039 626.349572 4.508888e-113
Does the flipper length varies between species?
The model
\[Y_{sk} = \mu + \alpha_s + E_{sk},\] where \(s\) stands for the species.
bill_length_mm species_Adelie species_Chinstrap species_Gentoo
0 39.1 1 0 0
1 39.5 1 0 0
2 40.3 1 0 0
4 36.7 1 0 0
5 39.3 1 0 0
bill_length_mm species_Chinstrap species_Gentoo
0 39.1 0 0
1 39.5 0 0
2 40.3 0 0
4 36.7 0 0
5 39.3 0 0
Be very careful in the interpretation
X_unid = penguins_dummies[['species_Adelie', 'species_Gentoo', 'species_Chinstrap']]
X_unid = sm.add_constant(X_unid) # Add the constant to the regressor
p_unid = np.linalg.matrix_rank(X_unid)
print(p_unid)
X = penguins_dummies_c[[ 'species_Gentoo', 'species_Chinstrap']]
X = sm.add_constant(X) # Add the constant to the regressor
p = np.linalg.matrix_rank(X)
print(p)
Y = penguins_dummies['flipper_length_mm']
X_unid = X_unid[['const', 'species_Adelie', 'species_Chinstrap', 'species_Gentoo']]
# linear regression model
lm1 = sm.OLS(Y, X_unid) # OLS: Ordinary Least Squares (Moindres Carrés Ordinaires)
fit1 = lm1.fit() # fit the model
print(fit1.summary())
3
3
OLS Regression Results
==============================================================================
Dep. Variable: flipper_length_mm R-squared: 0.775
Model: OLS Adj. R-squared: 0.773
Method: Least Squares F-statistic: 567.4
Date: Wed, 09 Oct 2024 Prob (F-statistic): 1.59e-107
Time: 06:59:33 Log-Likelihood: -1103.0
No. Observations: 333 AIC: 2212.
Df Residuals: 330 BIC: 2223.
Df Model: 2
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 150.7904 0.289 522.250 0.000 150.222 151.358
species_Adelie 39.3123 0.486 80.950 0.000 38.357 40.268
species_Chinstrap 45.0331 0.641 70.266 0.000 43.772 46.294
species_Gentoo 66.4449 0.520 127.769 0.000 65.422 67.468
==============================================================================
Omnibus: 1.423 Durbin-Watson: 2.268
Prob(Omnibus): 0.491 Jarque-Bera (JB): 1.470
Skew: 0.156 Prob(JB): 0.479
Kurtosis: 2.904 Cond. No. 2.10e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.03e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
sum_sq df F PR(>F)
C(species) 50525.883615 2.0 567.406992 1.587418e-107
Residual 14692.753022 330.0 NaN NaN
\[\hat{\beta} \sim \mathcal{N}(\beta, \sigma^2 \left ( X^\intercal X\right )^{-1} ),\]
So that \[ \sigma^{-1} \left ( X^\intercal X\right )^{1/2} (\hat{\beta}-\beta) \sim \mathcal{N}(0, I_p)\]