Machine Learning with simple linear regression using Jupyter Notebook and Python
Machine Learning with simple linear regression using Python in Jupyter Notebook
In this blog, we look at how to build linear regression model using Python Programming in Jupyter Book.
You may download all code from our git hub repo here
In my viewpoint, understanding the linear regression concept is the key in developing linear regression model.
So what is linear regression?
Simple linear regression is a type of statistical technique used in generating model which show relationship between two variables.
You can imagine these variables as predictor variables and outcome variables. Predictor variable is known as independent variable and outcome variable known as dependent variable.
In this blog, we aims to generate the machine learning model based linear regression based on assumption that there exist a a linear relationship between the independent variable and dependent variable.
The goal of simple linear regression is to find the best-fitting line, represents the linear relationship.
In simple linear regression, we are looking for a line of the form:
Y = β0 + β1X + ε
Y represents the dependent variable(output variable).
X represents the independent variable(input variable.
β0 is the y-intercept of the line, which represents the predicted value of Y when X is equal to zero.
β1 is the slope of the line, which represents the change in Y for a one-unit change in X.
ε represents the error term, which captures the variability in the data that is not explained by the linear relationship between X and Y.
In linear algebra, we use method called least square to find best fitted line.
Values of β0 and β1 are estimated by minimizing sum of the squares differences between the observed Y values and the predicted values(X values) on the line.
Once the line is fitted to the data, we use the simple linear regression model to make predictions on new data points by substituting the values of X into the equation.
Quality of the fit and the predictive power of the model is determined by measuring the coefficient of determination (R-squared) and the standard error
Steps in simple linear regression model generation
Step One : Read and know your data
Step Two : Visualize the data
Step Three: Carry out actual model generation using simple linear regression
Step Four: Residual analysis
Step Five: Predict output variable(Y value) based on new data(X value)
Step One: Read and know your data
Most of the time spend data scientists generate on this step.
Ensuring the quality and reliability of the data is crucial for machine learning. It may involve validating data integrity, checking for duplicate records, verifying data consistency, and addressing any data quality issues. Careful inspection and validation, may time-consuming, particularly for large datasets.
I often find that real-world datasets contain missing values, outliers, inconsistent formatting, and many other issues. Before start training a machine learning model, you should preprocess and clean the data, and these task usually involve handling missing values, removing outliers, and resolving inconsistencies. The data cleaning process can be time-consuming, especially if the dataset is complex, large or poorly structured.
Machine learning models often require data to be preprocessed or transformed into a suitable format.
During this step, we import python libraries that support data manipulation.
With data import libraries, we import and read a company data. The sample company data can be download here
Our sample data consist of company monthly sales and marketing expenditure.
# import libraries for data importimport pandas as pdimport numpy as np# read csv file using pandasMarketing_Sales_Data=pd.read_csv("MarketingExpenses_SalesRevenue.cs
The data read from CSV is displayed as
Understanding the characteristics and relationships within the data is an essential step in machine learning
To know more about the data, we are interested in its shape, data description and info
To do that,
Following code is used
# Know the data shape, info and data descriptionsMarketing_Sales_Data.shapeMarketing_Sales_Data.info()Marketing_Sales_Data.describe()
Dataset has shape of (129, 5) which is 129 rows and 5 columns.
Dataset info shows. it's data type, and non-null status. If any null values exist, we need to do data manipulation. But luckily, this dataset do not know any null values.
Any major difference in data values, get to know from dataset describe
Here we did not notice any sudden spike in data values which leads to assumption that data values in columns are consistent in dataset.
Step Two: Visualize Data
To visualize data, we use two libraries. One is matplotlib and seaborn
We will see the which columns has high correlation to Sales revenue by generating separate pairplot for each column(Socialmedia_ads Printmedia_ads Google_ads Promotions).
We need to eliminate the independent variables(influencing values) which do not have any correlation or low correlations with dependent variable(outcome value)
If get any error as module not found, you should install both matplotlib and seaborn python modules in virtual environment.
Installation can be done in anaconda prompt or from terminal
conda install seabornconda install matplotlib
import matplotlib.pyplot as pltimport seaborn as snssns.pairplot(Marketing_Sales_Data, x_vars=['Socialmedia_ads','Printmedia_ads','Google_ads','Promotions'], y_vars=['Sales_revenue'], kind='scatter', aspect=1, size=3)plt.show()
To understand the data patterns, outliners always good to generate scatter plot between two numeric variables.
The pairplot from code shows the scatter plot for each 4 variables with respect to 'sales revenue'
Since not able to decide any linear relationship exist or not from above scatter chart,
generate heatmap using seaborn library
sns.heatmap(Marketing_Sales_Data.corr(), cmap='YlGnBu', annot=True)plt.show()
Generated heatmap as follows
We see that, social media ads has highest correlation with respect to sales revenue. Correlation = 0.9 which is close to 1 is high.
So, we are more interested in social media ad as feature variable for our linear regression model
linear equation is
Y = β0 + β1X + ε
So, we apply Y = β0 + β1 * Social media ad + ε
Here, β1 is the model of coefficient or referred as the slope
Next step is the actual linear regression model generation by finding the β0 and β1
Step Three: Carry out actual model generation using simple linear regression
Model generation is done in 4 stages
- Creating X and Y
- Generating Training set and Test set
- Training the model
- Evaluate generated model
Creating X and Y
X denote the independent variable(influencing) and Y denote the dependent variable
X= Social media ad
Y= Sales revenue
# Create the X and Y valuesX=Marketing_Sales_Data['Socialmedia_ads']Y=Marketing_Sales_Data['Sales_revenue']
Generating Training set and Test set
We split the dataset into training set and test test. We build the model on the top of training set and perform testing set.
Splitting the dataset into training set and test set into 0.75:0.25 ratio which is 75% allocated to train data and 25% allocated to test data.
Splitting of dataset using the train_test_split method from the sklearn.model_selection library
Make sure sklearn module is installed on environment. Else install sklearn from anaconda terminal
pip install scikit-learn
Once scikit-learn module installed
from sklearn.model_selection import train_test_splitX_train,x_test,Y_train,y_test= train_test_split(X,Y,train_size=.7,test_size=.3, random_state=100)X_trainY_train
We got X_train as :
Y_train as follows
Building and training model
To build a simple linear regression model. 2 packages used which are
statsmodel package used to build model by importing statsmodel.api library
if statsmodel is not installed in virtual environment add from Anaconda prompt
conda install -c conda-forge statsmodels
statsmodel library is able to generate line which intersect through origin. But it is important to note that simple linear regression has intercept value based on the equation Y = β0 + β1X + ε
In above, β0 is the intercept value which must be added
# third step - build and generate modelimport statsmodels.api as smX_train_with_const=sm.add_constant(X_train)X_train_with_const
By using Ordinary Least Square method(OLS) in statsmodels, we can generate best fir line. By generating best fit line, we could get to know the
Once we’ve added constant, we can fit the regression line using OLS (Ordinary Least Square) method present in the statsmodel. After that, we’ll see the β0 and β1(coefficient or slope) parameters.
Here β1 is .056132 and β0 is 10.090400
We are also interested in summary of bit fitted regression line
We should know whether is model suits for prediction by checking
So, the statistics we are mainly concerned with to determine whether the model is viable or not are:
- Coefficient and its p-value(significance)
- R-squared value
- F-statistic and its significance
R-squared value is .821 which is there is 82.1 variance in sales revenue can be obtained or explained by the social media ad spending which is a good to know for the company.
F-statistic and its significance
Prob (F-statistic) is 7.25e-37 which is has lower p value which shows the statistical significance of the generated linear regression model
Coefficient and P-value
Coefficient is 0.0561 and its p-value on 0.000 which is very lower and it indicate that coefficient is statically significant. That means social media ads statically significant with sales revenue.
Sales Revenue = 10.0904 + (0.0561)* Social media ad
Fit is significant and we can go ahead and generate the scalar plot of sales revenue with respect to social media ad using intercept and Coefficient
plt.scatter(X_train,Y_train)plt.plot(X_train,10.0904 + 0.0561*X_train ,'r')plt.show
We have successfully build the model with the training dataset
Step4: Residual analysis
Simple linear regression has to follow number of assumptions and one of the assumption is errors are normally distributed.
We find the error terms(Residuals) by predicting the Y value(Sales revenue) from X value(Social media ads)
Error = Actual 'y' value- predicted 'y' value
To predict 'y' value, use predict attribute from model using training dataset
Calculate residual from difference between 'y_train' and 'y_train_prediction'
We need to see, residuals is seems to be normal distribution, it is by checking histogram of residual.
#check normal distribution in residual using histogramresidualFigure=plt.figure()sns.distplot(residual, bins=15)plt.title("Error terms", fontsize=20)plt.xlabel("Y_train-y_train_prediction",fontsize= 20)
Histogram of residual
Histogram shows the data is normally distributed as the data is symmetrical on both side around the mean. Mean is '0' and data on left side decreasing from the mean and data on right side increasing from the mean with bell shaped histogram.
We check any patterns exist
# check any specific pattern exist on residual
Scatter plot not shows any patterns
Since the residual not follow any patterns and normally distributed, goo to go with linear regression model evaluation suing the test data
Step 5: Evaluating model using test data
In this step we predict the sales revenue with test data. Same like adding a constant to training data set, here we add a constant to test data set to predict y values which is done using predict attribute in statsmodel.
Now we find out the
R² of predicted y-values which can be done by help of r2_score library from sklearn.metrics package.
# find R squarefrom sklearn.metrics import r2_score# Checking the R-squared valuer_squared_value = r2_score(y_test, y_test_prediction)r_squared_value
We get R² value as 0.8061808171303232
Earlier, we have got R² value for our training set as .821 which is very close to R² value test data. Only difference is .821- .080
Since only 2% difference in R² value of training data and test data, the model is stable.
So we can conclude that, the generated model can predict unseen data.
Best fit line for test data seen below
In my experince, data science blogs like this can often be time-consuming and lengthy due to the technical nature of the subject matter. However, undoubtedly worth the investment of time and effort as it offer practical insights.