Regression is a method for studying the relation between $y$ and $X$, where $X$ is generally considered to be an $k$-dimensional vector,
\begin{eqnarray}
X=\left[
\begin{array}{c}
x_1\\
\vdots\\
x_k
\end{array}
\right]
\end{eqnarray}
and $y$ is a scalar.

Least Squares

In the most basic form of the problem, we are given a sample consisting of $n$ data points
\begin{eqnarray}
(y_1, X_1), \ldots, (y_n,X_n)
\end{eqnarray}
At this stage, we make no assumption about these data points, ie, we do not make any assumptions about whether they are generated by a particular process or if they satisfy some statistics etc.. The problem is to find $\beta_1,\ldots,\beta_k$ in the model
\begin{eqnarray}
y = \beta_1 x_1 +\ldots+\beta_k x_k
\end{eqnarray}
which minimizes the square error
\begin{eqnarray}
L(\beta) = \sum_{i=1}^n (y_i – \beta_1 x_{1,i} \ldots-\beta_k x_{k,i})^2,
\end{eqnarray}
It is convenient to redefine this problem in matrix notation. If we define
\begin{eqnarray}
\beta=\left[
\begin{array}{c}
\beta_1\\
\vdots\\
\beta_k
\end{array}
\right],\quad \mathbb{X}=\left[
\begin{array}{ccc}
x_{1,1}&\ldots&x_{1,k}\\
\vdots&\ddots&\vdots\\
x_{n,1}&\ldots&x_{n,k}
\end{array}
\right],\quad Y=\left[
\begin{array}{c}
y_1\\
\vdots\\
y_k
\end{array}
\right]
\end{eqnarray}
we can express the square error (…) more compactly as
\begin{eqnarray}
L(\beta)&=&(Y-\mathbb{X}\beta)^T (Y-\mathbb{X}\beta)\\
&=&Y^TY-\beta^T \mathbb{X}^T Y – Y^T \mathbb{X}\beta + \beta \mathbb{X}^T \mathbb{X} \beta
\end{eqnarray}
Note that the terms $\mathbb{X}^T Y$ and $Y^T \mathbb{X}\beta$ are scalars and they are transposes of each other, hence they must be equal. Therefore,
\begin{eqnarray}
L(\beta)= Y^TY-2\beta^T \mathbb{X}^T Y + \beta \mathbb{X}^T \mathbb{X} \beta
\end{eqnarray}
To minimize $L$, we have to differentiate wrt $\beta$ and equate the result to zero, as usual
\begin{eqnarray}
\frac{\partial L(\beta)}{\partial \beta} = -2\mathbb{X}^T Y + 2\mathbb{X}^T \mathbb{X} \beta=0
\end{eqnarray}
or
\[
\boxed{\hat{\beta}=(\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T Y}
\]
Note that for $\mathbb{X}^T \mathbb{X}$ to be invertible, $\mathbb{X}$ must be full rank. This assertion can be proven by SVD.
The matrix $H=(\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T$ is known as the “hat matrix”. It is basically an orthogonal projection matrix.

Least Squares: Nonlinear Models

Linear models may not be appropriate for some data. For such data, we can use nonlinear models. But these models must still be linear in $\beta$. We will illustrate this with an example: Let us try to fit the nonlinear model
\begin{eqnarray}
y^2= \beta_1 + \beta_2 x_1 x_2 + \beta_3 x_2^2
\end{eqnarray}
If we have $n$ data points
\begin{eqnarray}
(y_1, x_{1,1}, x_{2,1}), \ldots, (y_n, x_{1,n}, x_{2,n})
\end{eqnarray}
we form the matrices
\begin{eqnarray}
\beta=\left[
\begin{array}{c}
\beta_1\\
\beta_2\\
\beta_3
\end{array}
\right],\quad \mathbb{X}=\left[
\begin{array}{ccc}
1 & x_{1,1}x_{2,1} & x_{2,1}^2\\
\vdots&\vdots&\vdots\\
1 & x_{1,n}x_{2,n} & x_{2,n}^2
\end{array}
\right],\quad Y=\left[
\begin{array}{c}
y_1^2\\
y_2^2\\
y_3^2
\end{array}
\right]
\end{eqnarray}
The rest of the analysis is the same.
The only restriction in LS analysis is that the model must be linear in $\beta$. It may be nonlinear in $X$.

OLS Estimators

Up till now, we havent assumed anything about how data points $(y_i, X_i)$ are generated. Now we will assume that this data is generated by a stochastic model satisfying the following conditions:

The model is linear in parameters $\beta_i$, plus a random variable, $\varepsilon$, which is called the “error” or “noise”. Some examples are
\begin{eqnarray}
y_i=\beta_1 x_{1,i} + \ldots + \beta_k x_{k,i} + \varepsilon_i
\end{eqnarray}
or,
\begin{eqnarray}
y_i=\beta_1 \cos(x_{1,i} x_{2,i}) + \ldots + \beta_k x_{k,i}^2 + \varepsilon_i
\end{eqnarray}
The matrix $\mathbb{X}$ is full rank. In other words, some $X_i$’s must not be a linear combination of other
$X$’s. This condition is known as “no perfect collinearity” and is necessary for the existence of $(\mathbb{X}^T \mathbb{X})^{-1}$
The matrix $\mathbb{X}$ can be deterministic or stochastic. For the first case, we may consider a controlled experiment where,
for example, we measure the response for 1, 2, 3, 4 and 5 grams of dose of a certain medicine. In stochastic case, we try to estimate
stock market index from inflation and GDP.
The noise $\varepsilon_i$ has zero mean, independent of $X_k$: $E(\varepsilon_i|X_k)=0$. This is known as “strict exogeneity”.
Note that strict exogeneity rules out the presence of the lagged dependent variables among the set of explanatory variables, as $y_{i-1}$
will necesserily correlate with $\varepsilon_{i-1}$
Homoskedasticity: The variance of the noise $\varepsilon_i$ is constant independently of $X_k$: $E(\varepsilon_i^2|X_k)=\sigma^2$
The noise is uncorrelated: $E(\varepsilon_i \varepsilon_j )=\sigma^2 \delta_{ij}$

These assumptions about the model are collectively known as the Gauss-Markov assumptions.
Note: If we define
\begin{eqnarray}
\varepsilon =\left[
\begin{array}{c}
\varepsilon_1\\
\vdots\\
\varepsilon_n
\end{array}
\right]
\end{eqnarray}
then the last two Gauss-Markov conditions can be compressed into a single formula as
\begin{eqnarray}
E(\varepsilon \varepsilon^T)=\sigma^2 I
\end{eqnarray}
Assume that we have a sample of $n$ data points produced in accordance with Gauss-Markov assumptions. The problem is to find an estimator
which will estimate the coefficients $\beta$ from the sample.

Theorem : With Gauss-Markov assumptions, the mean and variance of the OLS estimator $\hat{\beta}=(\mathbb{X}^T \mathbb{X})^{-1})\mathbb{X}^T Y$ are:
\[
\boxed{E(\hat{\beta})= \beta \qquad
V(\hat{\beta})= \sigma^2 (\mathbb{X}^T \mathbb{X})^{-1} }
\]
Proof: Note that by (..) $Y=\mathbb{X}\beta+\varepsilon$. Substituting
\begin{eqnarray}
\hat{\beta}&=&(\mathbb{X}^T \mathbb{X})^{-1})\mathbb{X}^T Y\\
&=&(\mathbb{X}^T \mathbb{X})^{-1})\mathbb{X}^T(\mathbb{X}\beta+\varepsilon)\\
&=&\beta+(\mathbb{X}^T \mathbb{X})^{-1})\mathbb{X}^T\varepsilon
\end{eqnarray}
Hence
\begin{eqnarray}
E(\hat{\beta})&=&E[\beta+(\mathbb{X}^T \mathbb{X})^{-1})\mathbb{X}^T\varepsilon]\\
&=& \beta + (\mathbb{X}^T \mathbb{X})^{-1})\mathbb{X}^TE(\varepsilon)\\
&=& \beta
\end{eqnarray}
which proves that LS is unbiased.Note that by (…)
\begin{eqnarray}
\hat{\beta}-\beta=(\mathbb{X}^T \mathbb{X})^{-1})\mathbb{X}^T\varepsilon
\end{eqnarray}
Then
\begin{eqnarray}
V(\hat{\beta})&=&E[(\hat{\beta}-\beta)(\hat{\beta}-\beta)^T]\\
&=& E[(\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T\varepsilon \varepsilon^T\mathbb{X}(\mathbb{X}^T \mathbb{X})^{-1}]\\
&=& (\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^TE[\varepsilon \varepsilon^T]\mathbb{X}(\mathbb{X}^T \mathbb{X})^{-1}
\end{eqnarray}
As $E[\varepsilon \varepsilon^T]=\sigma^2 I$ by Gauss-Markov assumptions, we get
\begin{eqnarray}
V(\hat{\beta})=\sigma^2 (\mathbb{X}^T \mathbb{X})^{-1}
\end{eqnarray}
which gives the desired result.

Theorem (Gauss-Markov): If a model satisfies Gauss-Markov assumptions, the OLS estimator $\hat{\beta}=(\mathbb{X}^T \mathbb{X})^{-1})\mathbb{X}^T Y$ which estimates its parameters $\beta$ is BLUE (ie, best linear unbiased estimator). By best, we mean efficient (ie, with smallest variance).

Proof: LS is linear in $Y$. We have already proved the unbiasedness of LS in the above theorem. What remains is to prove that LS is the best, ie, among all linear estimators, LS has the least variance. Consider another estimator $C$ which is linear
\begin{eqnarray}
\tilde{\beta} = Cy
\end{eqnarray}
and unbiased
\begin{eqnarray}
E(\tilde{\beta}) = E(Cy)=\beta
\end{eqnarray}
Expanding this formula
\begin{eqnarray}
E(CY)&=&E[C(\mathbb{X}\beta + \varepsilon )]\\
&=& C \mathbb{X}\beta \\
&=&\beta
\end{eqnarray}
As $\beta$ is completely general, this gives us an intermediate result:
\begin{eqnarray}
C \mathbb{X} = I
\end{eqnarray}
Now, let us compute $V[\tilde{\beta}]$, using the intermediate result (..) whenever necessary
\begin{eqnarray}
V[\tilde{\beta}]&=&E[(\tilde{\beta}-\beta)(\tilde{\beta}-\beta)^T]\\
&=&E[(CY-\beta)(CY-\beta)^T]\\
&=&E[CYY^TC^T]-E[CY \beta^T]-E[\beta Y^T C^T] +\beta \beta^T\\
&=&E[C((\mathbb{X}\beta + \varepsilon )(\mathbb{X}\beta + \varepsilon )^TC^T]-\beta\beta^T\\
&=&C\mathbb{X}\beta \beta^T \mathbb{X}^T C^T + \sigma^2 CC^T -\beta\beta^T\\
&=&\sigma^2 CC^T
\end{eqnarray}
Redefine $C=(\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T +D$. Then from Eq. ..
\begin{eqnarray}
C\mathbb{X}\beta &=& [(\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T +D]\mathbb{X}\beta\\
&=& \beta + D\mathbb{X}\beta = \beta
\end{eqnarray}
Hence we have $D\mathbb{X}\beta=0$. As $\beta$ is general, we have $D\mathbb{X}=0$ Then
\begin{eqnarray}
CC^T &=& ((\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T +D)((\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T +D)^T\\
&=& ((\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T +D)(\mathbb{X}(\mathbb{X}^T \mathbb{X})^{-1} + D^T)\\
&=& (\mathbb{X}^T \mathbb{X})^{-1} +DD^T
\end{eqnarray}
$DD^T$ is a positive definite matrix. This proves the theorem. $\Box$

Recall the Bias-Variance dilemma. LS estimator has the lowest variance among the estimators which are unbiased. It may be possible to find biased estimators with still lower variance.

The main drawback of OLS estimators is that they are very sensitive to outliers. There are many methods to make them robust. One such method is M-theory. Another useful method is ANOVA.

Omitted Variable Bias

Distribution of OLS Estimators

In order to find confidence intervals or perform hypothesis testing on OLS estimators, we need to know more than the mean and variance
formulas: We need to know the distribution of $\beta$. There are two cases which are important:

We assume the noise term $\varepsilon$ is normally distributed, ie, $\varepsilon \sim N(0,\sigma^2)$. This assumption allows us to
compute the finite size distribution of $\hat{\beta}$
We do not assume any particular distribution for $\varepsilon$, but assume large sample size (ie, as $n\rightarrow\infty$). In this way, we can characterize the distribution of $\hat{\beta}$ via the central limit theorem.

Below, we will investigate both cases.

Normally distributed noise

Note that in this case we have
\begin{eqnarray}
y | X \sim N(X\beta,\sigma^2 )
\end{eqnarray}

Theorem: Subject to all the assumptions given above, OLS estimator is asymptptically normal with
\begin{eqnarray}
\hat{\beta} | \mathbb{X} \sim N(\beta, \sigma^2 (\mathbb{X}^T \mathbb{X})^{-1} )
\end{eqnarray}

Proof: $\hat{\beta}$ is a linear function of $Y$ which is normally distributed. See Eq (..)\\

OLS estimation by Gradient Descent

Stochastic Gradient Descent

Mini-Batches

Model Selection

In the regression model
\begin{eqnarray}
y= \beta_1 x_1 + \beta_2 x_2 + \ldots+\beta_n x_n
\end{eqnarray}
choosig the number $n$ is important. Generally if ve add more and more independent variables to the model bias decreases and variance
increases. This is called overfitting. On the other hand, too few covariates gives high bias, this is called underfitting. Good predictions
results from achieving good balance between bias and variance.\\

$R^2$ test

$R^2$ test measures how much of the variability in data can be explained by a linear model.
\begin{eqnarray}
R^2=\frac{Var(mean)-Var(line)}{Var(mean)}
\end{eqnarray}

Linear Regression