Simple Linear Regression Fundamentals and Modeling in Python

Kerem Kargın
5 min readMar 25, 2021

In this blog post, I will first try to explain the basics of Simple Linear Regression. Then, we’ll build the model using a dataset with Python. Finally, we’ll evaluate the model by calculating the mean square error. Let’s get started step by step.

Resource: https://en.wikipedia.org/wiki/Linear_regression

What is the Simple Linear Regression?

Simple Linear Regression is a statistical method that helps us describe and analyze the relationship between two variables, one dependent and one independent. In another source, it is defined as follows:

Simple linear regression is used to estimate the relationship between two quantitative variables. You can use simple linear regression when you want to know:

- How strong the relationship is between two variables.

- The value of the dependent variable at a certain value of the independent variable.

As you can see from the definitions, if we want to make a Simple Linear Regression calculation, we must have one dependent and one independent variable.

For example, a basketball player’s salary is the dependent variable. The percentage of successful shots of the same basketball player is the independent variable. The player’s salary may increase or decrease depending on the successful shooting rate during the season. We can describe the concepts of dependent and independent in this way.

The main purpose in Simple Linear Regression is to find the linear function expressing the relationship between dependent and independent variable. So by finding this linear function, we model the relationship between variables. Modeling means expressing the relationships between various concepts mathematically.

Resource: https://towardsdatascience.com/how-are-logistic-regression-ordinary-least-squares-regression-related-1deab32d79f5?gi=a006f2d79fb4

One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.

The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

Simple Linear Regression makes some assumptions about the data. These are:

  • Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change significantly across the values of the independent variable.
  • Independence of observations: the observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among observations.
  • Normality: The data follows a normal distribution.

Linear regression makes one additional assumption:

  • The relationship between the independent and dependent variable is linear: the line of best fit through the data points is straight (rather than a curve or some sort of grouping factor).

Let’s try to understand the math of the Simple Linear Regression now.

Simple Linear Regression Model

We know that Y is the dependent variable and X is the independent variable. Accordingly, the mathematical expression of the Simple Linear Regression model is as follows.

Simple Linear Regression Formula
  • y → Refers to the predictive value.
  • Xi → Refers to the independent variable.
  • β0 → It is the parameter to be found in the data set. It refers to the point where the Simple Linear Regression line intersects the Y axis.
  • β1 → It is the parameter to be found in the data set. Expresses the slope of the Simple Linear Regression line.
  • ϵ → Refers to the error term.

The coefficients of β0 and β1 must be optimum values, for the model we have created to give the most appropriate result. The equation of this model we have established expresses the line belonging to this model in the coordinate plane. β0 and β1 are expressed in the formulation as follows.

Formulas of β0 and β1

The formula β1 has been calculated by some simple derivative operations. If you want to learn how the formula is calculated, you can review the article below.

http://users.stat.ufl.edu/~winner/qmb3250/notespart2.pdf

As a result, our aim is to minimize the error term. Now let’s create a Simple Linear Regression model with Python on a data.

Modeling with Python

Now let’s build a Simple Linear Regression model on a sample data set. And then let’s calculate the square root of the model’s Mean Squared Error This will give us the model error.

First of all, let’s import the necessary libraries.

Then we read the sample data set into a DataFrame from our local storage area. There is incorrect index data in the first column of the data set. We exclude this from the DataFrame with the ilocfunction.

Let’s examine the data in the DataFrame.

Since we are modeling with Simple Linear Regression, we need one independent variable. For this, we choose the “ TV ” variable.

We choose Sales as the dependent variable.

We define the slreg object to be able to set up the Simple Linear Regression model. Then we build the model by fitting the slreg object.

We built the model. The parameters β0 and β1 are important in Simple Linear Regression. We find the coefficient of β0 as follows.

We find the β1 parameter as follows.

Now we need to calculate the error of the model. Thus, we will have a meaningful result. With Predict, we predict the X values that are actually in the model. We save it as y_pred. Later, if you want, you can also see the first 5 predicted observations.

As a result, let’s calculate the mean square error.

The data set we have is quite simple and suitable for understanding the subject. Therefore, our error value was also very low.

Dataset

https://www.kaggle.com/ashydv/advertising-dataset

Finally

First, we examined what is Simple Linear Regression in this blog post. Then we talked about the assumptions of Simple Linear Regression. Mathematically, we examined the model of this algorithm. Finally, we calculated the error value by setting up a Simple Linear Regression model in Python.

--

--

Kerem Kargın

BSc. Industrial Eng. | BI Developer & Machine Learning Practitioner | #BusinessIntelligence #MachineLearning #DataScience