Chat on WhatsApp

How split train data and test data in machine learning

In machine learning, the performance of unseen future data to be determined by generating models based on training data set. One of the way

 · 2 min read

In machine learning, the performance of unseen future data to be determined by generating models based on training data set.

Machine learning algorithm needs to be feed with data.  Training data is feed into a machine learning algorithm which helps in training the algorithm.


So anyone wondering from where we get this training dataset?


The training data  come from the same original dataset


The training dataset act as input to the the algorithm to learn and generate the model.


We also have another data set called test dataset which act as input to the model after generating the model. Test data used to test the model to check model works as expected. Test dataset also gets from same original dataset.




We have only single data set and to create training dataset and test dataset we need to split the data set.


One of the way library which enable us to split a dataset is scikit-learn. Train_test_split function from its model_selection module from scikit-learn, provide

convenient way to split a dataset into training and testing subsets.


We will conduct an in-depth analysis of how scikit-learn splits the data:


Import libraries

from sklearn.model_selection import train_test_split
import pandas as pd


Make sure the scikit_learn is installed on virtual environment.


If scikit_learn is not listed in virtual environment, install it from Anaconda Prompt

conda install scikit_learn


Read CSV file contain dataset


Df= pd.read_csv("MarketingExpenses_SalesRevenue.csv")


Read the independent and dependent variable into X and Y variable respectively.  


X=Df['Socialmedia_ads']
Y=Df['Sales_revenue']


We typically split the data into a training set and a testing set.

Syntax for splitting data set is


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, test_size=0.2, random_state=42)



'X' represents feature data (input variables).

'y' represents the target variable (output variable).

train_size is part of data allocated to training set

test_size is part of data allocated to testing set


X_train,x_test, Y_train,y_test=train_test_split(X, Y,train_size=.75,test_size=.25, random_state=100)


Above code generate  four datasets:


  1. X_train is feature data for the training set.
  2. X_test is feature data for the testing set.
  3. y_train is  target variable data for the training set.
  4. y_test: is target variable data for the testing set.


For example X_train

'X' is generally  a Numpy array, but  Python lists also allowed. X should be  should be 2-dimensional, if it it 1-dimensional Numpy array, it has be converted by reshaping the data.  Reshaping done with the code .reshape(-1,1).


'y' consist of the vector of target values(output) which is usually  1-dimensional Numpy array, even though lists and 2 dimension array allowed.



Allocation of data done in decimal/float format. For example, 0.2 denote 20%. If it is decimal value like '20', which is 20 samples.

It is optional to put train_size and if train_size is not available, remaining fraction of the data is allocated to the training set.


The 'random_state' is an optional parameter to seed the random number generator, ensuring reproducibility of the split. it means th random number randomly selecting the observations to go into the training set or test set.


Randomized Splitting:

By default, train_test_split function shuffles the data before splitting.

This parameter enable randomly shuffling data before splitting at same time, ensuring the distribution of classes or patterns is similar between the training and testing sets.




No comments yet.

Add a comment
Ctrl+Enter to add comment