How to Divide Training and Testing Sets in Machine Learning

5 min readMar 2, 2021

I tried to organize machine learning concepts as easily as possible. First of all, what is machine learning? It means to teach a machine, and you can think of it in two ways depending on how you want to teach the machine.

Supervised Learning
Unsupervised Learning

Let’s take a quick look at the differences.

Supervised Learning

Supervised learning is possible when a label (correct answer) is predetermined in the data. So, if you give an input value to the program, the machine predicts the output value.

For example, playing music to a child who has never heard of music, “This is rock, this is EDM, this is jazz.” If you teach it this way, it’s like saying, “This is jazz!” Of course, this may not be a suitable example. There is no fixed answer for the classification of music genres. Anyway…

Supervised learning can be further subdivided into Regression and Classification.

Regression: Predicting successive values, numbers, or ‘how much’
Classification: Predicting non-contiguous values, that is, ‘what’

In other words, the music genre prediction mentioned above becomes a ‘classification’ task. Classification is really actively used, especially in the field of image processing.

Unsupervised Learning

Unsupervised learning is learning the unique structure or pattern of data based on data for which no label (correct answer) has been determined. Clustering is a typical non-supervised discipline.

Clustering: Grouping by finding patterns and structures in data
Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.

Note that people who are not familiar with machine learning often cannot clearly distinguish between classification and clustering. Classification is the correct answer, and clustering is about grouping similar things together. It’s completely different.

We can make predictions or classifications through machine learning. However, it is important to ask yourself how accurate this prediction or classification is. I made a classification model, but I don’t know how correct or wrong the prediction will be.

Fortunately in supervised learning, there is already labeled (with correct answers) data, so you can use it to test the accuracy of the algorithm.

When dividing data to verify the effectiveness of a machine learning model, you usually need to understand three concepts as follows.

Training Set
Validation Set
Testing Set

Among them, the validation set and the test set are actually similar concepts.

Training Set And Validation Set

The training set is literally the data the algorithm will learn.

Once the model is trained using this training data set, the model’s prediction/classification accuracy can then be calculated through a validation set. In fact, you know the actual label for all validation sets, that is, the correct answer, but you pretend not to. So, it is input to the classification/prediction model as if it were new data. This is possible because these data were excluded when actually training. By simply taking the predicted/classified values and comparing them with the answers you actually have, you can finally know the accuracy.

Of course, accuracy is not an universal indicator. When judging the effectiveness of a machine learning algorithm, there are methods such as checking the precision, the recall, and the F1 score, which is the harmonic average of precision and recall. I’ll post about these indicators later if I have the chance. Let’s move on.

How To Split The Data

It is not easy to determine which ratio to divide the training set vs. testing set. If the training set is too small, the algorithm may not be enough to learn effectively. On the other hand, if the testing set is too small, the accuracy, precision, recall, and F1 scores calculated through these differ greatly from each other, so it may be difficult to trust.

In general, it is recommended to use 80% of the total data as training and 20% as testing.

How to split the data in R:

library(tidymodels)dat_split <- initial_split(dat, prop = 0.8, strata = target) 
train <- training(dat_split) 
test <- testing(dat_split)

How to split the data in Python:

from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2, random_state = 0)

N-Fold Cross-Validation

If there is not enough data, dividing by 80/20 still results in a large amount of variance. One solution to this is to perform N-Fold cross-validation, that is, to run the entire process N times and average the accuracy from all runs.

For example, if you cross-validate 10 times, it can be expressed as the figure below.

Naturally, it can be said that the average value of the accuracy obtained in this way represents the average performance of the model better.

And since it’s cumbersome to do this one by one, R and python provide a function.

library(rsample)folds <- vfold_cv(train, v = 5, strata = target)

Python:

from sklearn.model_selection import KFold kf = KFold(n_splits=2, random_state=None, shuffle=False)

Model Performance Improvement And Testing Set

Fine-tuning the parameters of a model can improve the performance of that model. For example, in the K-Nearest Neighbors algorithm, you can see how the accuracy varies when increasing or decreasing K.

If the model’s performance is somewhat satisfactory, you can put in a test set. This is real data. You can create a value you are curious about which prediction/classification will occur, or it may be newly acquired data, or it may be data that was deliberately separated before creating a model. In some ways, it is similar to a validation set, except that it is never included when building or tuning the model.

Anyway, by comparing the predicted/classified values of the model with the actual values in this test set, the algorithm calculates the accuracy, precision, recall, and F1 score. You will be able to understand if it is done.