Multiple Linear Regression

Introduction

Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor. In general, the response \(y\) may be related to \(k\) regressor or predictor variables. Instead of fitting a separate simple linear regression model for each predictor, a better approach is to extend the simple linear regression model so that it can directly accommodate multiple predictors. We can do this by giving each predictor a separate slope coefficient in a single model.

We assume that the relationship between features and the target is approximately linear, i.e., the conditional mean $E[Y \vert X=\mathbf{x}]$ can be expressed as a weighted sum of the features $\mathbf{x}$. This setup allows the target value to deviate from its expected value on account of observation noise. Next, we can impose the assumption that any such noise is well behaved, following a Gaussian distribution.

In general, suppose that we have \(k\) distinct predictors. Then the multiple linear regression model takes the form:

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_kX_k + \ldots \tag{1}\]

where \(X_j\) represents the \(j\)th predictor, and \(\beta_j\) quantifies the association between that variable and the response \(Y\).

This model describes a hyperplane in the $k$-dimensional space of the regressor variables $X_j$. The parameter $\beta_j$ represents the expected change in the response $Y$ per unit change in $X_j$ when all of the remaining regressor variables $X_i$, $i \neq j$, are held constant. For this reason, the parameters $\beta_j$, $j = 1, 2, \ldots, p$, are often called partial regression coefficients.

Models that are more complex in structure than Eq. (1) may often still be analyzed by multiple linear regression techniques. For example, consider the cubic polynomial model:

\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon \tag{2}\]

If we let \(X_1 = X\), \(X_2 = X^2\), and \(X_3 = X^3\), then Eq. (2) can be written as:

\[y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon\]

which is a multiple linear regression model with three regressor variables.

Models that include interaction effects may also be analyzed by multiple linear regression methods. For example, suppose that the model is:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_{12} X_1 X_2 + \epsilon \tag{3}\]

If we let \(X_3 = X_1 X_2\) and \(\beta_3 = \beta_{12}\), then Eq. (3) can be written as:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon\]

which is a linear regression model.

Note: Although this model is a linear regression model, the shape of the surface that is generated by the model is not linear. In general, any regression model that is linear in the parameters (the $\beta$'s) is a linear regression model, regardless of the shape of the surface that it generates.

The Normal Distribution and Squared Loss

One way to motivate linear regression with squared loss is to assume that observations arise from noisy measurements, where the noise $\epsilon$ follows the normal distribution $\mathcal{N}(0, \sigma^2)$.Thus, we can now write out the likelihood of seeing a particular $y_i$ for a given $\mathbf{x}_i$ via

\[P(y_i \mid \mathbf{x}_i) = \frac{1}{\sigma \sqrt{2 \pi}} \exp{\left( - \frac{1}{2\sigma^2} \left(y - \mathbf{\beta}^T\mathbf{x}_i \right)^2 \right)}\]

According to the principle of maximum likelihood, the best values of parameters $\mathbf{\beta}$ are those that maximize the likelihood of the entire dataset:

\[P(y \mid \mathbf{X}) = \prod_{i=1}^{n} p(y_i \mid \mathbf{x}_i)\]

The equality follows since all pairs $(\mathbf{x}_i, y_i)$ were drawn independently of each other. We can equivalently minimize the negative log-likelihood, which we can express as follows:

\[- \log P(y \mid \mathbf{X}) = \sum_{i=1}^{n} \frac{1}{2}\log{(2 \pi \sigma^2)} + \frac{1}{2\sigma^2}\left(y_i - \mathbf{\beta}^T\mathbf{x}_i \right)^2\]

If we assume that $\sigma$ is fixed, we can ignore the first term, because it does not depend on $\mathbf{\beta}$. The second term is identical to the squared error loss introduced earlier, except for the multiplicative constant $\frac{1}{\sigma^2}$. Fortunately, the solution does not depend on $\sigma$ either. It follows that minimizing the mean squared error is equivalent to the maximum likelihood estimation of a linear model under the assumption of additive Gaussian noise.

Estimating the Coefficients by the method of Least-Squares

We may write the sample regression model corresponding to Eq. (1) as

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_k x_{ik} + \epsilon_i, \quad i = 1, 2, \ldots, n \tag{4}\]

or

\[y_i = \beta_0 + \sum_{j=1}^{k}\beta_j x_{ij} + \epsilon_i, \quad i = 1, 2, \ldots, n \tag{5}\]

The least-squares function is

\[S = \sum_{i=1}^{n} \epsilon_i^2 = \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^{k} \beta_j x_{ij}\right)^2 \tag{6}\]

The function \(S\) must be minimized with respect to \(\beta_0\), \(\beta_1\), …, \(\beta_k\). The least-squares estimators of \(\beta_0\), \(\beta_1\), …, \(\beta_k\) must satisfy:

\[\frac{\partial S}{\partial \beta_0} = -2 \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^{k}\beta_j x_{ij}\right) = 0 \tag{7}\]

and

\[\frac{\partial S}{\partial \beta_j} = -2 \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^{k}\beta_j x_{ij}\right)x_{ij} = 0, \quad j = 1, 2, \ldots, k \tag{8}\]

Simplifying Eq. (7) and Eq. (8), we obtain the least-squares normal equations. Note that there are \(p = k + 1\) normal equations, one for each of the unknown regression coefficients. The solution to the normal equations will be the least-squares estimators \(\hat{\beta}_0\), \(\hat{\beta}_1\), …, \(\hat{\beta}_k\).

It is more convenient to deal with multiple regression models if they are expressed in matrix notation. This allows a very compact display of the model, data, and results. In matrix notation, the model given by Eq. (1) is:

\[Y = X\beta + \epsilon \tag{9}\]

where:

\(Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix},\) \(X = \begin{bmatrix} 1 & x_{11} & x_{12} & x_{13} & \dots & x_{1k} \\ 1 & x_{21} & x_{22} & x_{23} & \dots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & x_{n3} & \dots & x_{nk} \end{bmatrix},\) \(\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \end{bmatrix},\) \(\epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}\)

In general, \(Y\) is an \(n \times 1\) vector of the observations, \(X\) is an \(n \times p\) matrix of the levels of the regressor variables, \(\beta\) is a \(p \times 1\) vector of the regression coefficients, and \(\epsilon\) is an \(n \times 1\) vector of random errors.

We wish to find the vector of least-squares estimators, \(\hat{\beta}\), that minimizes:

\[S(\beta) = \sum_{i=1}^{n} e_i^2 = (y - X\beta)^T(y - X\beta)\]

Note that \(S(\beta)\) may be expressed as:

\[S(\beta) = y^Ty - \beta^TX^Ty - y^TX\beta + \beta^T X^T X \beta\]

Since \(\beta^T X^Ty\) is a \(1 \times 1\) matrix, or a scalar, and its transpose \((\beta^T X^Ty)^T\) = \(y^TX\beta\) is the same scalar, hence we can write:

\[S(\beta) = y^Ty - 2 \beta^T X^Ty + \beta^TX^TX\beta\]

The least-squares estimators must satisfy:

\[\frac{\partial S}{\partial \beta} = -2X^Ty + 2X^TX\hat{\beta} = 0\]

which simplifies to:

\[X^TX\hat{\beta} = X^Ty \tag{10}\]

Equations (10) are the least - squares normal equations. They are the matrix analogue of the scalar presentation in (7 and 8).

To solve the normal equations, multiply both sides of (10) by the inverse of \(X^TX\). Thus, the least-squares estimator of \(\beta\) is:

\[\hat{\beta} = (X^TX)^{-1}X^Ty \tag{11}\]

provided that the inverse matrix \((X^TX)^{-1}\) exists.

The $(X^TX)^{-1}$ matrix will always exist if the regressors are linearly independent, that is, if no column of the $X$ matrix is a linear combination of the other columns.

The vector of fitted values \(\hat{y}_i\) corresponding to the observed values \(y_i\) is:

\[\hat{y} = X(X^TX)^{-1}X^Ty = Hy \tag{12}\]

The \(n \times n\) matrix \(H = X(X^TX)^{-1}X^T\) is usually called the hat matrix. It maps the vector of observed values into a vector of fitted values.

The difference between the observed value \(y_i\) and the corresponding fitted value \(\hat{y}_i\) is the residual \(e_i = y_i - \hat{y}_i\). The \(n\) residuals masy be conveniently written in matrix notation as:

\[e = y - \hat{y} \tag{13}\]

There are several other ways to express the vector of residuals \(e\) that will prove useful, including:

\[e = y - X\hat{\beta} = y - Hy = (I - H) y\]

Properties of the Least - Squares Estimators

\[E(\hat{\beta}) = E \left[(X^TX)^{-1}X^Ty \right] = E \left[(X^TX)^{-1}X^T(X\beta+\epsilon) \right] = E \left[(X^TX)^{-1}X^TX\beta + (X^TX)^{-1}X^T\epsilon \right] = \beta\]

since \(E(\epsilon) = 0\) and \((X^TX)^{-1}X^TX = I\). Thus, \(\hat{\beta}\) is an unbiased estimator of \(\beta\) if the model is correct.

\[Var(\hat{\beta}) = Var \left[(X^TX)^{-1}X^Ty \right] = (X^TX)^{-1}X^T Var(y) \left[(X^TX)^{-1}X^T \right] ^T = \sigma^2 (X^TX)^{-1}X^TX(X^TX)^{-1} = \sigma^2 (X^TX)^{-1}\]

Therefore, if we let \(C = (X^TX)^{-1}\), the variance of \(\hat{\beta}_j\) is \(\sigma^2C_{jj}\), and the covariance between \(\hat{\beta}_i\) and \(\hat{\beta}_j\) is \(\sigma^2C_{ij}\).

Estimation of \(\sigma^2\)

As in simple linear regression, we may develop an estimator of \(\sigma^2\) from the residual sum of squares:

\[SS_{\text{Res}} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} e_i^2 = e^Te\]

Substituting \(e = y -X\hat{\beta}\), we have:

\[SS_{\text{Res}} = (y - X\hat{\beta})^T(y - X\hat{\beta}) = y^Ty - \hat{\beta}^TX^Ty - y^TX\hat{\beta} + \hat{\beta}^TX^TX\hat{\beta} = y^Ty -2 \hat{\beta}^TX^Ty + \hat{\beta}^TX^TX\hat{\beta}\]

Since \(X^TX\hat{\beta} = X^Ty\), this becomes:

\[SS_{\text{Res}} = y^Ty - \hat{\beta}^TX^Ty\]

The residual sum of squares has \(n − p\) degrees of freedom associated with it since \(p\) parameters are estimated in the regression model. The residual mean square is

\[MS_{\text{Res}} = \frac{SS_{\text{Res}}}{n - p} = \hat{\sigma}^2\]

The Bias-Variance Trade-off

We can write the mean squared error as the sum of the square of the bias, the variance and the irreducible error.

\[\begin{eqnarray} \text{MSE}(\hat{f}, f) & = & E[(\hat{f} - f)^2] \\ & = & E[\hat{f^2}] + E[f^2] - 2 E[\hat{f} f] \\ & = & \text{Var}(\hat{f}) + E^2(\hat{f}) + \text{Var}(f) + E^2(f) - 2 E[\hat{f}] E[f] \\ & = & \left( E[\hat{f}] - E[f] \right)^2 + \text{Var}(\hat{f}) + \text{Var}(f) \\ & = & \text{bias}^2(\hat{f}) + \text{Var}(\hat{f}) + \text{Var}(f) \end{eqnarray}\]

We refer the above formula as bias-variance trade-off. The mean squared error can be divided into three sources of error: the error from high bias, the error from high variance and the irreducible error. The bias error is commonly seen in a simple model (such as a linear regression model), which cannot extract high dimensional relations between the features and the outputs. If a model suffers from high bias error, we often say it is underfitting or lack of flexibilty. The high variance usually results from a too complex model, which overfits the training data. As a result, an overfitting model is sensitive to small fluctuations in the data. If a model suffers from high variance, we often say it is overfitting and lack of generalization. The irreducible error is the result from noise in $f$ itself.

Multicollinearity