# Machine Learning with simple linear regression using Jupyter Notebook and Python

Machine Learning with simple linear regression using Python in Jupyter Notebook

In this blog, we look at how to build linear regression model using Python Programming in Jupyter Book.

You may download all code from our git hub repo here

In my viewpoint, understanding the linear regression concept is the key in developing linear regression model.

**So what is linear regression? **

*Simple linear regression is a type of statistical technique used in generating model which show relationship between two variables.*

You can imagine these variables as predictor variables and outcome variables. Predictor variable is known as **independent variable** and outcome variable known as **dependent variable.**

In this blog, we aims to generate the machine learning model based linear regression based on assumption that there exist a a linear relationship between the independent variable and dependent variable.

The goal of simple linear regression is to find the best-fitting line, represents the linear relationship.

In simple linear regression, we are looking for a line of the form:

**Y = β0 + β1X + ε**

Where:

Y represents the dependent variable(output variable).

X represents the independent variable(input variable.

β0 is the y-intercept of the line, which represents the predicted value of Y when X is equal to zero.

β1 is the slope of the line, which represents the change in Y for a one-unit change in X.

ε represents the error term, which captures the variability in the data that is not explained by the linear relationship between X and Y.

In linear algebra, we use method called least square to find best fitted line.

Values of β0 and β1 are estimated by minimizing sum of the squares differences between the observed Y values and the predicted values(X values) on the line.

Once the line is fitted to the data, we use the simple linear regression model to make predictions on new data points by substituting the values of X into the equation.

Quality of the fit and the predictive power of the model is determined by measuring the coefficient of determination (R-squared) and the standard error

**Steps in simple linear regression model generation**

**Step One **: Read and know your data

**Step Two** : Visualize the data

**Step Three**: Carry out actual model generation using simple linear regression

**Step Four**: Residual analysis

**Step Five**: Predict output variable(Y value) based on new data(X value)

__Step One: Read and know your data__

Most of the time spend data scientists generate on this step.

Ensuring the quality and reliability of the data is crucial for machine learning. It may involve validating data integrity, checking for duplicate records, verifying data consistency, and addressing any data quality issues. Careful inspection and validation, may time-consuming, particularly for large datasets.

I often find that real-world datasets contain missing values, outliers, inconsistent formatting, and many other issues. Before start training a machine learning model, you should preprocess and clean the data, and these task usually involve handling missing values, removing outliers, and resolving inconsistencies. The data cleaning process can be time-consuming, especially if the dataset is complex, large or poorly structured.

Machine learning models often require data to be preprocessed or transformed into a suitable format.

**Reading data**

During this step, we import python libraries that support data manipulation.

With data import libraries, we import and read a company data. The sample company data can be download here

Our sample data consist of company monthly sales and marketing expenditure.

# import libraries for data importimport pandas as pdimport numpy as np# read csv file using pandasMarketing_Sales_Data=pd.read_csv("MarketingExpenses_SalesRevenue.cs

The data read from CSV is displayed as

**Knowing Data**

Understanding the characteristics and relationships within the data is an essential step in machine learning

To know more about the data, we are interested in its shape, data description and info

To do that,

Following code is used

# Know the data shape, info and data descriptionsMarketing_Sales_Data.shapeMarketing_Sales_Data.info()Marketing_Sales_Data.describe()

Dataset has shape of (129, 5) which is 129 rows and 5 columns.

Dataset info shows. it's data type, and non-null status. If any null values exist, we need to do data manipulation. But luckily, this dataset do not know any null values.

Any major difference in data values, get to know from dataset describe

Here we did not notice any sudden spike in data values which leads to assumption that data values in columns are consistent in dataset.

**Step Two: Visualize Data**

To visualize data, we use two libraries. One is matplotlib and seaborn

We will see the which columns has high correlation to Sales revenue by generating separate pairplot for each column(Socialmedia_ads Printmedia_ads Google_ads Promotions).

We need to eliminate the independent variables(influencing values) which do not have any correlation or low correlations with dependent variable(outcome value)

If get any error as module not found, you should install both matplotlib and seaborn python modules in virtual environment.

Installation can be done in anaconda prompt or from terminal

conda install seabornconda install matplotlib

import matplotlib.pyplot as pltimport seaborn as snssns.pairplot(Marketing_Sales_Data, x_vars=['Socialmedia_ads','Printmedia_ads','Google_ads','Promotions'], y_vars=['Sales_revenue'], kind='scatter', aspect=1, size=3)plt.show()

To understand the data patterns, outliners always good to generate scatter plot between two numeric variables.

The pairplot from code shows the scatter plot for each 4 variables with respect to 'sales revenue'

Since not able to decide any linear relationship exist or not from above scatter chart,

generate heatmap using seaborn library

sns.heatmap(Marketing_Sales_Data.corr(), cmap='YlGnBu', annot=True)plt.show()

Generated heatmap as follows

We see that, social media ads has highest correlation with respect to sales revenue. Correlation = 0.9 which is close to 1 is high.

So, we are more interested in social media ad as feature variable for our linear regression model

linear equation is

Y = β0 + β1X + ε

So, we apply Y = β0 + β1 * Social media ad + ε

Here, β1 is the model of coefficient or referred as the slope

Next step is the actual linear regression model generation by finding the β0 and β1

**Step Three: Carry out actual model generation using simple linear regression**

Model generation is done in 4 stages

- Creating X and Y
- Generating Training set and Test set
- Training the model
- Evaluate generated model

**Creating X and Y **

X denote the independent variable(influencing) and Y denote the dependent variable

X= Social media ad

Y= Sales revenue

# Create the X and Y valuesX=Marketing_Sales_Data['Socialmedia_ads']Y=Marketing_Sales_Data['Sales_revenue']

**Generating Training set and Test set**

We split the dataset into training set and test test. We build the model on the top of training set and perform testing set.

Splitting the dataset into training set and test set into 0.75:0.25 ratio which is 75% allocated to train data and 25% allocated to test data.

Splitting of dataset using the train_test_split method from the sklearn.model_selection library

Make sure sklearn module is installed on environment. Else install sklearn from anaconda terminal

pip install scikit-learn

Once scikit-learn module installed

from sklearn.model_selection import train_test_splitX_train,x_test,Y_train,y_test= train_test_split(X,Y,train_size=.7,test_size=.3, random_state=100)X_trainY_train

We got X_train as :

Y_train as follows

**Building and training model**

To build a simple linear regression model. 2 packages used which are

- statsmodel
- sklearn

**Building model**

statsmodel package used to build model by importing statsmodel.api library

if statsmodel is not installed in virtual environment add from Anaconda prompt

conda install -c conda-forge statsmodels

statsmodel library is able to generate line which intersect through origin. But it is important to note that simple linear regression has intercept value based on the equation Y = β0 + β1X + ε

In above, β0 is the intercept value which must be added

# third step - build and generate modelimport statsmodels.api as smX_train_with_const=sm.add_constant(X_train)X_train_with_const

By using Ordinary Least Square method(OLS) in statsmodels, we can generate best fir line. By generating best fit line, we could get to know the

bestFitLine=sm.OLS(Y_train,X_train_with_const).fit()

Once we’ve added constant, we can fit the regression line using OLS (Ordinary Least Square) method present in the statsmodel. After that, we’ll see the β0 and β1(coefficient or slope) parameters.

Here β1 is .056132 and β0 is 10.090400

We are also interested in summary of bit fitted regression line

We should know whether is model suits for prediction by checking

So, the statistics we are mainly concerned with to determine whether the model is viable or not are:

- Coefficient and its p-value(significance)
- R-squared value
- F-statistic and its significance

**R-squared value**

R-squared value is .821 which is there is 82.1 variance in sales revenue can be obtained or explained by the social media ad spending which is a good to know for the company.

**F-statistic and its significance**

Prob (F-statistic) is 7.25e-37 which is has lower p value which shows the statistical significance of the generated linear regression model

**Coefficient and P-value**

Coefficient is 0.0561 and its p-value on 0.000 which is very lower and it indicate that coefficient is statically significant. That means social media ads statically significant with sales revenue.

Sales Revenue = 10.0904 + (0.0561)* Social media ad

Fit is significant and we can go ahead and generate the scalar plot of sales revenue with respect to social media ad using intercept and Coefficient

plt.scatter(X_train,Y_train)plt.plot(X_train,10.0904 + 0.0561*X_train ,'r')plt.show

We have successfully build the model with the training dataset

**Step4: Residual analysis**

****

Simple linear regression has to follow number of assumptions and one of the assumption is errors are normally distributed.

We find the error terms(Residuals) by predicting the Y value(Sales revenue) from X value(Social media ads)

Error = Actual 'y' value- predicted 'y' value

To predict 'y' value, use predict attribute from model using training dataset

Calculate residual from difference between 'y_train' and 'y_train_prediction'

y_train_prediction= bestFitLine.predict(X_train_with_const)y_train_prediction

We need to see, residuals is seems to be normal distribution, it is by checking histogram of residual.

#check normal distribution in residual using histogramresidualFigure=plt.figure()sns.distplot(residual, bins=15)plt.title("Error terms", fontsize=20)plt.xlabel("Y_train-y_train_prediction",fontsize= 20)

Histogram of residual

Histogram shows the data is normally distributed as the data is symmetrical on both side around the mean. Mean is '0' and data on left side decreasing from the mean and data on right side increasing from the mean with bell shaped histogram.

We check any patterns exist

# check any specific pattern exist on residual

plt.scatter(X_train,residual)

plt.show()

Scatter plot not shows any patterns

Since the residual not follow any patterns and normally distributed, goo to go with linear regression model evaluation suing the test data

**Step 5: Evaluating model using test data **

In this step we predict the sales revenue with test data. Same like adding a constant to training data set, here we add a constant to test data set to predict y values which is done using predict attribute in statsmodel.

x_test_add_contant=sm.add_constant(x_test)y_test_prediction=bestFitLine.predict(x_test_add_contant)y_test_prediction

Now we find out the

R² of predicted y-values which can be done by help of r2_score library from sklearn.metrics package.

# find R squarefrom sklearn.metrics import r2_score# Checking the R-squared valuer_squared_value = r2_score(y_test, y_test_prediction)r_squared_value

We get R² value as 0.8061808171303232

Earlier, we have got R² value for our training set as .821 which is very close to R² value test data. Only difference is .821- .080

Since only 2% difference in R² value of training data and test data, the model is stable.

So we can conclude that, the generated model can predict unseen data.

Best fit line for test data seen below

In my experince, data science blogs like this can often be time-consuming and lengthy due to the technical nature of the subject matter. However, undoubtedly worth the investment of time and effort as it offer practical insights.

Happy coding.

No comments yet. Login to start a new discussion Start a new discussion