Residual In Regression Analysis

Assumptions about Regression Analysis, How to check for those assumptions and Improve our Model.

Yash Agarwal
5 min readFeb 18, 2023
Photo by Ian Schneider on Unsplash

Setup

It’s exactly what the above image says “ Passion Led Us Here”. Even though I know Linear Regression and use it quiet often I can’t help but read it over and over again in the hopes of that I might find something more in it and that’s exactly what happened when I was going through it once again.

The concepts that I came across this time were Assumptions about Regression Analysis, How to check for those Assumption, What to do if those assumptions are violated.

What surprised me the most is in all these years Not only I but all the other Data Scientist’s I have worked with have 2 things in common for Regression Analysis : 1) We all diss it and give up on it after trying once and say it is not that good and move on to the next algorithm immediately. 2) No one even spoke about the concept of Residuals . So either no one knows about these concepts or no one cares enough to apply these concepts or maybe they are not so useful in real application I don’t know I just got to know about them.

So when I read these things, I once again felt fascinated and the need to write this story down in the hopes to provide a new perspective to an overlooked algorithm.

Requirements

To understand this story completely you should possess some basic knowledge of the following:

  • Machine Learning ( specifically Regression Analysis)
  • Statistics
  • Probability

If you do not posses the basic knowledge of these topics it might be difficult for you to understand this completely but I will try my level best to keep it as simple as possible :D

What is Regression Analysis

Well in statistical modelling Regression Analysis is a set of statistical processes for estimating the relationship between a dependent and one or more independent variable. This allows the researchers to estimate the conditional expectation ( focus on the word conditional ) of the dependent variable when the independent variables take on a given set of values.

Regression Analysis is primarily used for 2 purposes

  • Prediction and Forecasting
  • Infer causal relationship between the dependent and independent variables

To use it for prediction which we do in Machine Learning we must carefully justify why existing relationships between the dependent and independent variables have predictive powers for a new context or the relationship has a causal interpretation. So we operate on some assumption about the variables and their relationship and then we can statistically be confident in our prediction.

The general equation of Regression Analysis is

y = f(x,theta) + e

where :

  • y = dependent variable
  • x = independent variable
  • theta = unknown parameters. estimation of these parameters lead us to form a relationship between x and y.
  • f(x, theta) = is the function/model you choose to define the relationship between x and y parameterised by theta. This function generally takes the form of Linear or Logistic Regression Equation which you are aware of.
  • e = error terms

Assumption about Regression Analysis

There are lot of assumptions about regression analysis but I will highlight the main ones over here :

  • The sample is representative of the population at large.
  • The independent variables are measured with no error.
  • Error terms should have constant variance.
  • The residuals are uncorrelated with one another. If violated will affect Regression Coefficients( unknown parameters theta ).
  • There must be no-correlation among independent variables. If variables are correlated it becomes extremely difficult for the model to determine the true effect of independent variables on dependent variables.
  • The dependent variable and the error terms must posses a normal distribution.

Residual Analysis

In this section let’s talk about Residuals and their role in catching some of the violated assumptions and how we can resolve those.

What are Residuals?

Well to put it simply residuals is the difference between the original and the estimated value.

residual = y - prediction

Residuals vs. Fits Plot

When conducting a residual analysis, a “residuals versus fits plot” is the most frequently created plot.It is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers.

In coding terms plot(residuals , prediction).

Ideally, this plot shouldn’t show any pattern. If you see a funnel shape pattern, it suggests your data is suffering from Heteroskedasticity, i.e. the error terms have non-constant variance.

Image from HackerEarth

The image in the bottom left is how a normal Residual vs. Fitted Plot should look like where the points in the graph don’t show any pattern i.e. they are not creating any weird shaped like funnel or log, they are all bounded between a upper and lower limit and that limit is near constant for all the fitted values and you can almost fit a straight line to this graph like a Linear Regression Model.

The plot on the right is clearly showing a Non-Linear Curve which suggest Non-Linearity in the Dataset as well.

Image from HackerEarth

Fix the Violations

So, as far as I know not all violations of the assumptions can be detected by residual vs. fit plots but the following 2 which can be detected and corrected are :

  • Heteroskedasticity: If the plot is funnel shaped the error terms have non-constant variance. In this case transform your dependent variables using like square root , log , square or least squares etc.
  • Non Linearity in Dataset: If the plot is non-linear try transforming your independent variables again using like square root , log , square or least squares etc.

So I will end this story for now, hopefully you learned something new and can start using these concepts when applying Linear and Logistic Regression and not jump to other algo’s immediately and give them a whole hearted chance of succeeding.

--

--