Multiple Linear Regression Fundamentals and Modeling in Python

Kerem Kargın
8 min readMar 29, 2021

--

In this blog post, first, I’ll try to explain the basics of Multiple Linear Regression. Then, I’ll build the model using a dataset with Python. Finally, I’ll evaluate the model by calculating the mean square error. Let’s get started step by step.

Resource: https://medium.com/analytics-vidhya/new-aspects-to-consider-while-moving-from-simple-linear-regression-to-multiple-linear-regression-dad06b3449ff

What is the Multiple Linear Regression?

The main purpose of Multiple Linear Regression is to find the linear function expressing the relationship between dependent and independent variables. Multiple Linear Regression model has one dependent and more than one independent variable. In another source, it is defined as follows:

Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. You can use multiple linear regression when you want to know:

- How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).

- The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

So how is it different from Simple Linear Regression? I can explain the difference as follows. When we make a predictive analysis of real-life problems, we may not be able to make the prediction very well with a single independent variable. Estimating by using a few independent variables will be both easier and more accurate.

For example, suppose you want to estimate the selling price of a car. The dependent variable is the selling price. Imagine having some information about the car to be able to guess the selling price. These are;

  • Kilometer information,
  • Engine power,
  • Year of production,
  • Damage rate

In this case, you have 4 arguments. With these 4 independent variables, you can predict the sales price of the car much more accurately.

So actually we are still looking for a linear relationship like Simple Linear Regression. But we will do this with more arguments.

Multiple Linear Regression is quite common around the world. And it has some assumptions.

Assumptions of Multiple Linear Regression

Multiple Linear Regression has similar assumptions with Simple Linear Regression. These are;

  • Errors are normally distributed.
  • The errors are independent of each other and there is no common correlation between them.
  • The error term variances for each observation are constant.
  • There is no relationship between variables and error terms.
  • There is no problem Multiple Linear Relationshipbetween independent variables.

Let’s try to understand the math of Multiple Linear Regression now.

Multiple Linear Regression Model

Multiple Linear Regression Formula
  • y → The predicted value of the dependent variable.
  • β0 → It is the parameter to be found in the data set. It refers to the point where the Simple Linear Regression line intersects the Y-axis.
  • β1X1 →The regression coefficient (B1) of the first independent variable. (X1) (a.k.a. the effect that increases the value of the independent variable has on the predicted y value)
  • βnXn → The regression coefficient of the last independent variable.
  • ϵ → Refers to the error term.

Modeling with Python

Now let’s build a Multiple Linear Regression model on a sample data set. And then let’s calculate the square root of the model’s Mean Squared Error This will give us the model error.

First, I import the pandas library. Then I save the Advertising dataset in a DataFrame. The first column in this dataset is a bad index column. That’s why I don’t include this column in the DataFrame. I am reviewing the first 5 observations with df.head()

Then I get the independent variable as X into a DataFrame. I did this with the Drop function. I’m saving things outside of Sales in a DataFrame. I save the dependent variable Sales as y in a different DataFrame. As a result of these operations, we separate dependent and independent variables from each other.

I am reviewing the top 5 observations of y and X DataFrames.

Building a model with Statsmodel

After separating the dependent and independent variables; First, we will set up the Multiple Linear Regression model with the Statsmodel. This is a bit of a primitive method. I import the Statsmodel library to install the model. Then I create the lm model object with the OLS method. When we set up a model with the Statsmodel, we obtain a model that we can learn more about.

  • OLS → Ordinary Least Squares

Aside from OLS, there are also two different methods, WLS and GLS.

  • WLS → Weighted Least Squares
  • GLS → Generalized Least Squares

For more information about Statsmodel, you can visit the website.

https://www.statsmodels.org/stable/index.html

When we say model.summary() , we can access all the summary information about the model we have built. The values that are important to us are:

  • R-squared → As the number of variables increases, it swells.
  • Adj. R-squared → It prevents swelling as a result of the increase in the number of variables.
  • Method → It is the method in the Multiple Linear Regression model.
  • coef → The final independent variables are the coefficients.
  • P>|t| → It gives the information whether the coefficient is meaningful or not. If it is less than 0.05, the model is significant.

Also, other data provides important information. But I will not go into too much detail as I talk about Multiple Linear Regression in the current blog post. If you wish, you can research it yourself.

Building a model with Scikit Learn

Now we come to the more important part for us. We will install the Multiple Linear Regression model with the Scikit Learn library.

First, I import LinearRegression from the Scikit Learn library. Then I create the lm model object with LinearRegression. Then we fit the model with the object lm.

We use intercept_ to see the constant coefficient of the model.

We use coef_ to see the coefficients for the model’s independent variables.

We interpret the coefficients as follows. For example, we found the value 0.04576465 for TV. Assuming that other variables are fixed, a one-unit increase in TV expenditures will cause an average increase of 0.04576465 units on the dependent variable (i.e. Sales).

Now let’s move on to the predicting part with the model we have established.

Model Prediction

First of all, I give new data in an array so that the model can make predictions. Then I transpose this array. Because the data we enter are data belonging to one independent variable for each column. It should not be contained in a single column.

As a result, all we have to do to predict is to give the new data as an argument into the predict function.

With the Multiple Linear Regression model we established, we estimated that the sales would be 6.15 units when we made an advertisement of 30 units for TV, 10 units for Radio, and 45 units for newspapers.

We measure the success of the model with model.score() .We calculate this for dependent and independent variables as follows.

Everything seems to be fine for the model. But at the moment we don’t know how much error the model has. Now let’s calculate the average error square between the actual sales values in the dataset and the sales values we estimate. We will use the mean_squared_error function for this.

The mean_squared_errorfunction gets the real y values as the first argument and the estimated y values as the second argument.

We calculate the square root of the mean square error as follows.

MSE: Simply, mean square error tells you how close a regression curve is to a set of points. The MSE measures the performance of a machine learning model, the predictor, is always positive, and it can be said that predictors with an MSE value close to zero perform better.

RMSE: It is a quadratic metric that is frequently used to find the distance between the predictive values and the actual values of a machine learning model and measures the magnitude of the error. The standard deviation of the RMSE estimation errors (residues). That is, the residuals are a measure of how far the regression line is from the data points; RMSE is a measure of how far these residues spread. In other words, it tells you how dense that data is around the line that best fits the data. The RMSE value can range from 0 to ∞. Negative oriented scores, i.e. predictors with lower values, perform better. A zero RMSE value means the model made no mistakes. RMSE has the advantage of punishing large errors more, so it may be better suited to some situations. RMSE prevents the unwanted use of absolute values in many mathematical calculations.

Model Tuning

We do the Tuning process to maximize the machine learning model against over-learning and high variance.

What is Model Tuning?

Tuning is usually a trial-and-error process by which you change some hyperparameters (for example, the number of trees in a tree-based algorithm or the value of alpha in a linear algorithm), run the algorithm on the data again, then compare its performance on your validation set in order to determine which set of hyperparameters results in the most accurate model.

For Model Tuning; First, we split the data set into train and test. We do this with train_test_split. With test_size in the train_test_split function, we determine what percentage of the data set we have will be the test set. random_stateabout different divisions to be made in the data set. If we do not enter a value, each time we run the model, we calculate with different pieces of data.

Again, as we know, we set up the model on the train data set using the lm model object.

Then we calculate the Mean Squared Error separately for Train and Test data.

k-fold Cross-Validation

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

I calculate the MSE and RMSE values by performing 10-fold Cross-Validation on the model I have installed. In this case, we calculate 10 different errors. Since we say cv = 10, the train set is divided into 10 different parts. First, the model is set up with 9 selected parts, then the model is estimated with the remaining 1 piece. This process is calculated 10 times for different parts each time. As a result, we get a single test error by taking the average of these 10 errors.

Dataset: https://www.kaggle.com/ashydv/advertising-dataset

Finally

First, we examined what is Multiple Linear Regression in this blog post. Then we talked about the assumptions of Multiple Linear Regression. Mathematically, we examined the model of this algorithm. Then, we calculated the error value by setting up a Multiple Linear Regression model in Python. Finally, we tuned the model and calculated the validated error values using k-fold Cross-Validation.

--

--

Kerem Kargın
Kerem Kargın

Written by Kerem Kargın

BSc. Industrial Eng. | BI Developer & Machine Learning Practitioner | #BusinessIntelligence #MachineLearning #DataScience