Chat on WhatsApp

Machine Learning with simple linear regression using Jupyter Notebook and Python

Machine Learning with simple linear regression using Python in Jupyter Notebook

 · 7 min read

In this blog, we look at how to build linear regression model using Python Programming in Jupyter Book.

You may download all code from our git hub repo here

In my viewpoint, understanding the linear regression concept is the key in developing linear regression model.


So what is linear regression?

Simple linear regression is a type of statistical technique used in generating model which show relationship between two variables.


You can imagine these variables as predictor variables and outcome variables. Predictor variable is known as independent variable and outcome variable known as dependent variable.


In this blog, we aims to generate the machine learning model based linear regression based on assumption that there exist a a linear relationship between the independent variable and dependent variable.


The goal of simple linear regression is to find the best-fitting line, represents the linear relationship.


In simple linear regression, we are looking for a line of the form:


Y = β0 + β1X + ε


Where:


Y represents the dependent variable(output variable).

X represents the independent variable(input variable.

β0 is the y-intercept of the line, which represents the predicted value of Y when X is equal to zero.

β1 is the slope of the line, which represents the change in Y for a one-unit change in X.

ε represents the error term, which captures the variability in the data that is not explained by the linear relationship between X and Y.


In linear algebra, we use method called least square to find best fitted line.

Values of β0 and β1 are estimated by minimizing sum of the squares differences between the observed Y values and the predicted values(X values) on the line.


Once the line is fitted to the data, we use the simple linear regression model to make predictions on new data points by substituting the values of X into the equation.


Quality of the fit and the predictive power of the model is determined by measuring the coefficient of determination (R-squared) and the standard error


Steps in simple linear regression model generation


Step One : Read and know your data

Step Two : Visualize the data

Step Three: Carry out actual model generation using simple linear regression

Step Four: Residual analysis

Step Five: Predict output variable(Y value) based on new data(X value)


Step One: Read and know your data


Most of the time spend data scientists generate on this step.

Ensuring the quality and reliability of the data is crucial for machine learning. It may involve validating data integrity, checking for duplicate records, verifying data consistency, and addressing any data quality issues. Careful inspection and validation, may time-consuming, particularly for large datasets.


I often find that real-world datasets contain missing values, outliers, inconsistent formatting, and many other issues. Before start training a machine learning model, you should preprocess and clean the data, and these task usually involve handling missing values, removing outliers, and resolving inconsistencies. The data cleaning process can be time-consuming, especially if the dataset is complex, large or poorly structured.


Machine learning models often require data to be preprocessed or transformed into a suitable format.


Reading data

During this step, we import python libraries that support data manipulation.


With data import libraries, we import and read a company data. The sample company data can be download here


Our sample data consist of company monthly sales and marketing expenditure.


# import libraries for data import
import pandas as pd
import numpy as np




# read csv file using pandas




Marketing_Sales_Data=pd.read_csv("MarketingExpenses_SalesRevenue.cs


The data read from CSV is displayed as


Knowing Data

Understanding the characteristics and relationships within the data is an essential step in machine learning


To know more about the data, we are interested in its shape, data description and info

To do that, 

Following code is used

# Know the data shape, info and data descriptions
Marketing_Sales_Data.shape
Marketing_Sales_Data.info()
Marketing_Sales_Data.describe()


Dataset has shape of (129, 5) which is 129 rows and 5 columns.


Dataset info shows. it's data type, and non-null status. If any null values exist, we need to do data manipulation. But luckily, this dataset do not know any null values.




Any major difference in data values, get to know from dataset describe



Here we did not notice any sudden spike in data values which leads to assumption that data values in columns are consistent in dataset.


Step Two: Visualize Data

To visualize data, we use two libraries. One is matplotlib and seaborn

We will see the which columns has high correlation to Sales revenue by generating separate pairplot for each column(Socialmedia_ads Printmedia_ads Google_ads Promotions).


We need to eliminate the independent variables(influencing values) which do not have any correlation or low correlations with dependent variable(outcome value)


If get any error as module not found, you should install both matplotlib and seaborn python modules in virtual environment.

Installation can be done in anaconda prompt or from terminal

conda install seaborn
conda install matplotlib


import matplotlib.pyplot as plt
import seaborn as sns
sns.pairplot(Marketing_Sales_Data, x_vars=['Socialmedia_ads','Printmedia_ads','Google_ads','Promotions'], y_vars=['Sales_revenue'], kind='scatter', aspect=1, size=3)
plt.show()


To understand the data patterns, outliners always good to generate scatter plot between two numeric variables.

The pairplot from code shows the scatter plot for each 4 variables with respect to 'sales revenue'



Since not able to decide any linear relationship exist or not from above scatter chart,

generate heatmap using seaborn library

sns.heatmap(Marketing_Sales_Data.corr(), cmap='YlGnBu', annot=True)
plt.show()


Generated heatmap as follows


We see that, social media ads has highest correlation with respect to sales revenue. Correlation = 0.9 which is close to 1 is high.


So, we are more interested in social media ad as feature variable for our linear regression model


linear equation is 

Y = β0 + β1X + ε


So, we apply Y = β0 + β1 * Social media ad + ε


Here, β1 is the model of coefficient or referred as the slope

Next step is the actual linear regression model generation by finding the β0 and β1


Step Three: Carry out actual model generation using simple linear regression


Model generation is done in 4 stages


  1. Creating X and Y
  2. Generating Training set and Test set
  3. Training the model
  4. Evaluate generated model


Creating X and Y

X denote the independent variable(influencing) and Y denote the dependent variable

X= Social media ad

Y= Sales revenue


# Create the X and Y values
X=Marketing_Sales_Data['Socialmedia_ads']
Y=Marketing_Sales_Data['Sales_revenue']


Generating Training set and Test set

We split the dataset into training set and test test. We build the model on the top of training set and perform testing set.


Splitting the dataset into training set and test set into 0.75:0.25 ratio which is 75% allocated to train data and 25% allocated to test data.


Splitting of dataset using the train_test_split method from the sklearn.model_selection library


Make sure sklearn module is installed on environment. Else install sklearn from anaconda terminal

pip install scikit-learn


Once scikit-learn module installed


from sklearn.model_selection import train_test_split
X_train,x_test,Y_train,y_test= train_test_split(X,Y,train_size=.7,test_size=.3, random_state=100)
X_train
Y_train

We got X_train as :


Y_train as follows



Building and training model

To build a simple linear regression model. 2 packages used which are

  1. statsmodel
  2. sklearn


Building model

statsmodel package used to build model by importing statsmodel.api library

if statsmodel is not installed in virtual environment add from Anaconda prompt


conda install -c conda-forge statsmodels


statsmodel library is able to generate line which intersect through origin. But it is important to note that simple linear regression has intercept value based on the equation Y = β0 + β1X + ε


In above, β0 is the intercept value which must be added

# third step - build and generate model
import statsmodels.api as sm
X_train_with_const=sm.add_constant(X_train)
X_train_with_const


By using Ordinary Least Square method(OLS) in statsmodels, we can generate best fir line. By generating best fit line, we could get to know the


bestFitLine=sm.OLS(Y_train,X_train_with_const).fit()


Once we’ve added constant, we can fit the regression line using OLS (Ordinary Least Square) method present in the statsmodel. After that, we’ll see the β0 and β1(coefficient or slope) parameters.


Here β1 is .056132 and β0 is 10.090400



We are also interested in summary of bit fitted regression line


We should know whether is model suits for prediction by checking

So, the statistics we are mainly concerned with to determine whether the model is viable or not are:


  1. Coefficient and its p-value(significance)
  2. R-squared value
  3. F-statistic and its significance



R-squared value

R-squared value is .821 which is there is 82.1 variance in sales revenue can be obtained or explained by the social media ad spending which is a good to know for the company.


F-statistic and its significance

Prob (F-statistic) is 7.25e-37 which is has lower p value which shows the statistical significance of the generated linear regression model


Coefficient and P-value

Coefficient is 0.0561 and its p-value on 0.000 which is very lower and it indicate that coefficient is statically significant. That means social media ads statically significant with sales revenue.


Sales Revenue = 10.0904 + (0.0561)* Social media ad



Fit is significant and we can go ahead and generate the scalar plot of sales revenue with respect to social media ad using intercept and Coefficient


plt.scatter(X_train,Y_train)
plt.plot(X_train,10.0904 + 0.0561*X_train ,'r')
plt.show



We have successfully build the model with the training dataset


Step4: Residual analysis



Simple linear regression has to follow number of assumptions and one of the assumption is errors are normally distributed.

We find the error terms(Residuals) by predicting the Y value(Sales revenue) from X value(Social media ads)


Error = Actual 'y' value- predicted 'y' value


To predict 'y' value, use predict attribute from model using training dataset

Calculate residual from difference between 'y_train' and 'y_train_prediction'


y_train_prediction= bestFitLine.predict(X_train_with_const)
y_train_prediction


We need to see, residuals is seems to be normal distribution, it is by checking histogram of residual.

#check normal distribution in residual using histogram
residualFigure=plt.figure()
sns.distplot(residual, bins=15)
plt.title("Error terms", fontsize=20)
plt.xlabel("Y_train-y_train_prediction",fontsize= 20)


Histogram of residual


Histogram shows the data is normally distributed as the data is symmetrical on both side around the mean. Mean is '0' and data on left side decreasing from the mean and data on right side increasing from the mean with bell shaped histogram.


We  check any patterns exist


# check any specific pattern exist on residual

plt.scatter(X_train,residual)

plt.show()



Scatter plot not shows any patterns


Since the residual not follow any patterns and normally distributed, goo to go with linear regression model evaluation suing the test data


Step 5: Evaluating model using test data

In this step we predict the sales revenue with test data. Same like adding a constant to training data set, here we add a constant to test data set to predict y values which is done using predict attribute in statsmodel.


x_test_add_contant=sm.add_constant(x_test)
y_test_prediction=bestFitLine.predict(x_test_add_contant)
y_test_prediction


Now we find out the

R² of predicted y-values which can be done by help of r2_score library from sklearn.metrics package.


# find R square
from sklearn.metrics import r2_score
# Checking the R-squared value
r_squared_value = r2_score(y_test, y_test_prediction)
r_squared_value

We get R² value as 0.8061808171303232


Earlier, we have got R² value for our training set as .821 which is very close to R² value test data. Only difference is .821- .080

Since only 2% difference in R² value of training data and test data, the model is stable.

So we can conclude that, the generated model can predict unseen data.


Best fit line for test data seen below


In my experince, data science blogs like this can often be time-consuming and lengthy due to the technical nature of the subject matter. However, undoubtedly worth the investment of time and effort as it offer practical insights.


Happy coding.



No comments yet.

Add a comment
Ctrl+Enter to add comment