As this introductory slide depicts, linear regression is a way to explain the relationship between a dependent variable and one or more explanatory variables using a straight line. It is a special case of regression analysis. … Linear regression can be used to fit a predictive model to a set of observed values (data points).

Now, what is regression analysis?

In statistics, regression is a statistical process of estimating the relationship among variables.

More specifically, regression analysis helps in understanding how a typical value of the dependant variable changes when any one of the independent variables is changed while the other independent variables are held fixed

Regression is widely used for prediction and forecasting.

Regression also is used to understand the independent variables and explore forms of relationships between them.

So, what is linear regression in simple words?

Suppose we have a set of input variables, X1, X2, X3, and Xn. These variables can be of different types continuous, Discrete or Categorical.

Also, we have an output variable Y. This variable can only be a continuous variable. What our regression model is trying to do is to find a relationship between these input variables that may result in obtaining an estimated value of the output variable. In case we have one input variable hence we call our model a simple linear regression. On the other hand, if we have more than one input variable we call our model multivariate linear regression.

Now, we want to explain the meaning of the “linear” term in linear regression.

The first meaning of linearity is that the conditional expectation of Y, Which is the expected value of Y given some Xs, maybe a nonlinear function like the one we have here but it’s still considered a linear regression model!

That’s because the meaning of linearity in linear regression is not connected with the variables. It’s connected to the parameters itself.

So we may end up to a quadratic, exponential, or even cubic equation but all of these will be considered as linear regression. The reason is they have all the betas to the power of one, no nonlinearity in betas.

So the following equation is denoting a nonlinear regression line because one of the betas, beta2, is raised to the power of 2.

To sum up this point, we now have two different types of regression. The linear regression where all the betas are raised to the power of one and nonlinear regression where all the betas are raised to any power except one.

Now, suppose we have a simple regression problem where we want to find the estimated price of a house based on its size. Our data points are denoted in red crosses and every point is given a house size on the X-axis and its price on the Y-axis.

What we are trying to do is to use our training data subjectively to our learning algorithm, then, our algorithm is trying to find some hypothesis for the output relation between the size of a house and its price. This relationship is explained by this straight line. Our line is just giving an estimated price for every input size.

Now, we want to highlight a very important term in all machine learning and linear regression as well which is the cost function.

The cost function is answering an important question: How can we find a better line that could explain the relationship? And the answer is easy, the best line should have the minimum difference between the estimated price and the actual one. After forming this relationship mathematically we can enhance our line a lot.

Let’s try to figure out some of this mathematical notation. Our cost function is trying to sum up all errors at all data points. The error term is nothing but the difference between the estimated price given in the hypothesis function h and the actual price given in the y term.

Remember that we will have differences on both sides of the line and the change in the sign from positive to negative will affect the calculation of our line a lot. That’s why we need our difference to be directionally agnostic and this can be achieved by raising the whole quantity to the power of two because any negative sign will be ignored when raising any quantity to the power of two.

Hence, what is the purpose of this preceding quantity which is (1/2 m)?

Well, in the process of minimizing our cost function we will differentiate our difference and this will result (1/2m) to be multiplied by 2. So, the remaining part will be (1/m) which is the average error per each data point. Keeping in mind that we have a total m data points.

Now, let’s watch this minimization process closely. In this graph, we can watch that for every hypothesis (h) we will have some cost value and we will repeat this process iteratively to reach the minimum possible cost value hence, the corresponding line (h) will be the best fitting line.

In this graph, you can see a more complex cost function. This may explain how the process of minimizing the cost function can be extremely difficult in some cases.