Linear Regression

Regression Model (1)

Posted by Fan Gong on Dec 12, 2019

Linear regression is the very first model people will learn in machine learning field. It tries to find the linear relationship between the independent and dependent variables. So this post will work as my review notes for capturing its important concepts.

1. Linear Regression

1.1 Model Assumptions

  1. Independent and dependent variables tends to have linear relationship
  2. All features are not correlated with each other
  3. There should be no correlation between the error terms
  4. The error terms must have constant variance
  5. The error terms also need to belong to normal distribution

What if these assumptions get violated:

  1. If you fit a linear model to a non-linear, non-additive data set, the regression algorithm would fail to capture the trend mathematically, thus resulting in an inefficient model
  2. Multicollinearity: It becomes difficult to find out which variable is actually contributing to predict the response variable; also it produces large standard errors in the related independent variables (also means wider confidence interval)
  3. Autocorrelation: The presence of correlation in error terms drastically reduces model’s accuracy. This usually occurs in time series models where the next instant is dependent on previous instant. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error.
  4. Heteroskedasticity: Generally, non-constant variance arises in presence of outliers or extreme leverage values. Look like, these values get too much weight, thereby disproportionately influences the model’s performance. When this phenomenon occurs, the confidence interval for out of sample prediction tends to be unrealistically wide or narrow.
  5. OLS will not work well. And confidence intervals may become too wide or narrow

I think for all these assumptions, the normality of error term is kind of weak.

1.2 Parameter Estimation

\[\hat{y_i} = b_1x_i+ b_0 + \epsilon\]

There are two ways to estimate the parameters:

  • Ordinary least squares: \(\arg\min_{\beta_0 , \beta_1} \sum(y_i-\hat{y_i})^2\). If we deem it as a function of parameter \(\beta\), we can then do partial derivative for all parameters and makes it equal to 0 to get the optimal estimation.
  • MLE: It is then an inference problem. We know if the error term \(\epsilon \sim N(0,\sigma^2)\) then \(\hat{y_i} \sim N(b_1x_i+ b_0, \sigma^2)\). So the pdf function will be: \[p(y|x,b_0,b_1,\sigma) = \prod_{I=1}^n \frac{1}{2\pi \sigma^2}e^{-\frac{(y_i-b_0-b_1x_i)^2}{2\sigma^2}}\] Since the data are known, so we can deem this pdf as the likelihood function for any parameters. Our goal is then to: \[\arg \max_{b_0,b_1} log \sum_{i=1}^n -(y_i-b_0-b_1x_i)^2\] Which equals to: \[\arg \min_{b_0,b_1}log \sum_{i=1}^n (y_i-b_0-b_1x_i)^2\]

So we can see when the error term \(\epsilon\) belongs to normal distribution, the MLE is totally the same with OLS. However, if the error is not normal, MLE should perform better. This situation happens when certain assumptions for OLS are not satisfied for example the error term is not iid normal distribution.

1.3 Advantages/Disadvantages

Pros:

  • Linear regression is an extremely simple method. It is very easy to use, understand, and explain.
  • The best fit line is the line with minimum error from all the points, it has high efficiency

Cons:

  • Linear regression only models relationships between dependent and independent variables that are linear. It assumes there is a straight-line relationship between them which is incorrect sometimes.
  • Linear regression is very sensitive to the anomalies/outliers in the data
  • If the number of the parameters are greater than the samples, then the model starts to model noise rather than relationship

1.4 Parameter Confidence Interval

I reviewed inference part from this post but I think I did not write a complete one. So here I am trying to re-write it and also combine it with the linear regression knowledge.

The intuition of confidence interval is that we want to know how confident about our estimation. So we want to find a statistic which has a known distribution and contains both the true value and our estimation, in the sense after transformation we can get (take \(\mu\) as an example): \[P(\mu \in [\hat{\mu}-s, \hat{\mu}+s]) \ge 1-\alpha\]

a). Confidence Interval for the mean of a Gaussian

Suppose we have \(n\) observations \(x_i\) where \(x_1,...,x_n \sim N(\mu,\sigma^2)\) and our goal is to estimate the \(\mu\).

  • If \(\sigma\) is known, the MLE of \(\mu\) is the sample mean: \[\hat{\mu}=\frac{1}{n}\sum_{i=1}^n x_i \sim N(\mu, \frac{\sigma^2}{n})\] and we can easily find: \[\frac{\sqrt{n}(\hat{\mu}-\mu)}{\sigma} \sim N(0,1)\] so the \(1-\alpha\) CI is: \[P(\frac{\sqrt{n}(\hat{\mu}-\mu)}{\sigma} \in [-s,s]) = 1-\alpha\] After transformation we can get:\[P(\mu \in [\hat{\mu}-s\sigma \sqrt{n} ,\hat{\mu}+s\sigma \sqrt{n}]) = 1-\alpha\]

  • If \(\sigma\) is unknown, so we first use MLE to get the sample variance: \[\hat{\sigma}= \sqrt{\frac{1}{n}\sum_{I=1}^n(x_i-\bar{x})}\] Then we need to characterize the distribution of \(\frac{\sqrt{n}(\hat{\mu}-\mu)}{\sqrt{\frac{1}{n}\sum_{I=1}^n(x_i-\bar{x})}}\). If we divide both the numerator and denominator by \(\frac{\sigma}{\sqrt{n}}\), then We can see the numerator belongs to normal distribution, and the denominator belongs to \(\chi^2(n-1)\). So overall it belongs to \(t(n-1)\). Then we can use the same way above to calculate the CI

b). Confidence Interval for Linear Regression

Use the logic above, we can also get the confidence interval for all the parameter estimations in linear regression. So I will not write down the math transformation here but it is easy to prove that all the parameters will either belong to normal or t distribution based on whether the error term \(\epsilon\) has the known or unknown \(\sigma\) value.

1.5 Bias-variance Trade-off

We know for all the machine learning models we need to minimize the loss, suppose we choose squared loss here: \[L(y, \hat{y}) = \frac{1}{n}\sum(y_i-\hat{y_i})^2 = E[(y - \hat{y})^2]\]

If we do a trick to add the mean value of \(\hat{y}\) and minus it after, we can get: \[E[(y - \hat{y})^2] = E(y-\bar{\hat{y}}+\bar{\hat{y}} -\hat{y})^2\] To further transform, we get: \[E[(y - \hat{y})^2] = E[(y-\bar{\hat{y}})^2] + E[(\hat{y}-\bar{\hat{y}})^2] = Bias + Varaince\]

Bias: The deviation of average prediction from true value; So to lower bias we tend to give a complicate model.

Variance: The variance of predictions. So lower variance we tend to give a simple model.

So you can see the trade-off here, and a good model will need to have both small bias and variance.

1.6 Some statistics to evaluate the performance

Coefficient of determination (\(R^2\))

\[R^2 = 1 - \frac{SS_{res}}{SS_{tot}}=\frac{SS_{reg}}{SS_{tot}}\] Where \(SS_{res}=\sum(y_i - \hat{y_i})\), \(SS_{reg}=\sum(\hat{y_i}-\bar{y_i})\), \(SS_{tot}=\sum(y_i - \bar{y_i})\)

So we can deem is as the percentage variance explained by the model. But high \(R^2\) may result from overfitting; also, as we increase the number of features, \(R^2\) will always increase because the variance will be explained more.

Adjusted \(R^2\) To solve the problem that \(R^2\) tend to give more complicated models, Adjusted \(R^2\) will penalize you for adding independent variables. \[Adjusted \ R^2 = \frac{\frac{SS_{reg}}{n-p-1}}{\frac{SS_{tot}}{n-1}}\]

AIC and BIC

AIC and BIC are penalized-likelihood criteria.

Suppose that we have a statistical model of some data. Let p be the number of estimated parameters in the model. Let \( {\hat {L}}\) be the maximum value of the likelihood function for the model; \(n\) be the number of data points. Then the AIC and BIC value of the model is the following: \[{AIC} \,=\,2p-2\ln({\hat {L}})\] \[{BIC} \,=\,log(n)p-2\ln({\hat {L}})\]

So despite various subtle theoretical differences, their only difference in practice is the size of the penalty; BIC penalizes model complexity more heavily

2. Regression with Ridge and Lasso

Ridge uses L2 penalty when fitting the model and lasso uses L1 penalty.

  • For ridge: \[\hat{\beta_{ridge}}=\arg\min \sum_{i=1}^{n} L(y_i,\hat{y_i})+ \lambda \sum_{j=1}^p \beta_j^2\]

  • For lasso: \[\hat{\beta_{lasso}}=\arg\min \sum_{i=1}^{n} L(y_i,\hat{y_i})+ \lambda \sum_{j=1}^p |\beta_j|\]

\(\lambda\) is the tuning parameter that decides how much we want to penalize the flexibility of our model. As lambda increases, the impact of the shrinkage penalty grows, and the regression coefficient estimates will approach zero. Selecting a good value of lambda is critical, cross validation comes in handy for this purpose, the coefficient estimates produced by this method are also known as the L1/L2 norm.

Comparison:

  • Ridge will perform better when the response is a function of many predictors, all with coefficients of roughly equal size; Ridge regression also has substantial computational advantages.
  • Lasso will perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero
  • Cannot deal with the imbalanced data problem