How split train data and test data in machine learning
In machine learning, the performance of unseen future data to be determined by generating models based on training data set. One of the way
In machine learning, the performance of unseen future data to be determined by generating models based on training data set.
Machine learning algorithm needs to be feed with data. Training data is feed into a machine learning algorithm which helps in training the algorithm.
So anyone wondering from where we get this training dataset?
The training data come from the same original dataset
The training dataset act as input to the the algorithm to learn and generate the model.
We also have another data set called test dataset which act as input to the model after generating the model. Test data used to test the model to check model works as expected. Test dataset also gets from same original dataset.
We have only single data set and to create training dataset and test dataset we need to split the data set.
One of the way library which enable us to split a dataset is scikit-learn. Train_test_split function from its model_selection module from scikit-learn, provide
convenient way to split a dataset into training and testing subsets.
We will conduct an in-depth analysis of how scikit-learn splits the data:
from sklearn.model_selection import train_test_splitimport pandas as pd
Make sure the scikit_learn is installed on virtual environment.
If scikit_learn is not listed in virtual environment, install it from Anaconda Prompt
conda install scikit_learn
Read CSV file contain dataset
Read the independent and dependent variable into X and Y variable respectively.
We typically split the data into a training set and a testing set.
Syntax for splitting data set is
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, test_size=0.2, random_state=42)
'X' represents feature data (input variables).
'y' represents the target variable (output variable).
train_size is part of data allocated to training set
test_size is part of data allocated to testing set
X_train,x_test, Y_train,y_test=train_test_split(X, Y,train_size=.75,test_size=.25, random_state=100)
Above code generate four datasets:
- X_train is feature data for the training set.
- X_test is feature data for the testing set.
- y_train is target variable data for the training set.
- y_test: is target variable data for the testing set.
For example X_train
'X' is generally a Numpy array, but Python lists also allowed. X should be should be 2-dimensional, if it it 1-dimensional Numpy array, it has be converted by reshaping the data. Reshaping done with the code .reshape(-1,1).
'y' consist of the vector of target values(output) which is usually 1-dimensional Numpy array, even though lists and 2 dimension array allowed.
Allocation of data done in decimal/float format. For example, 0.2 denote 20%. If it is decimal value like '20', which is 20 samples.
It is optional to put train_size and if train_size is not available, remaining fraction of the data is allocated to the training set.
The 'random_state' is an optional parameter to seed the random number generator, ensuring reproducibility of the split. it means th random number randomly selecting the observations to go into the training set or test set.
By default, train_test_split function shuffles the data before splitting.
This parameter enable randomly shuffling data before splitting at same time, ensuring the distribution of classes or patterns is similar between the training and testing sets.