Importance of Data Normalization

5 min readMar 10, 2021

Why do we need Data Normalization?

Machine learning algorithms find patterns in data by comparing features of data. Those algorithms such as Distance-Based Algorithms, Gradient Descent Based Algorithms expect the features to be scaled. When the scale of the features in data is severely different, it becomes a problem.

For example, consider data that contains information about housing. The features such as the number of rooms and how long ago they were built could be included. And let’s say that we try to predict which house is the most suitable through a machine learning algorithm. Then, when comparing each data point, the data is completely dominated by features with a larger scale, that is, how long ago it was built.

Let’s look at the chart below.

We expect machine learning algorithms to recognize how big the difference is between a one-room house and a 20-room house. (Isn’t it common sense?)

However, as can be seen from the figure, if two houses are built at the same time, the two data points are located very close together. Things like the number of rooms don’t really matter.

So, the goal of normalization is to ensure that all data points are reflected on the same scale (importance).

If the above data is normalized by MIN-MAX Normalization, it appears as follows.

Two ways to normalize data

There are other normalization methods besides this, but the following two are the most widely used methods.

Min-Max Normalization
Z-Score Normalization

Each has its pros and cons, so you need to be able to understand exactly how they work and decide when and how to normalize.

Let’s find out one by one.

1. Min-Max Normalization

Min-max normalization is the most common way to normalize data. For every feature, it converts each to a minimum value of 0, a maximum value of 1, and the other values to values between 0 and 1.

For example, if the minimum value of a feature is 20 and the maximum value is 40, then 30 is just the middle, so it is converted to 0.5.

If you do min-max normalization for the value of X, you can use the following codes in R and Python.

normalize <- function(x){
  min_max <- (x - min(x)) / (max(x) - min(x))
  return(min_max)
}# To get a vector, use apply instead of lapply
as.data.frame(apply(df$name, normalize))

Python:

def min_max_normalize(x):
    normalized = []
    
    for value in x:
        normalized_num = (value - min(x)) / (max(x) - min(x))
        normalized.append(normalized_num)
    
    return normalized

However, there is a fatal drawback to Min-Max Normalization. It means that it is affected too much by outliers.

For example, what if there are 100 values, of which 99 are between 0 and 40, and the other is 100. Then, all 99 values are converted to values between 0 and 0.4.

As shown in the picture, it is as follows.

Looking at the figure above, normalization is effectively applied on the y-axis, but there are still problems on the x-axis. If you compare the points of the data in this state, the influence of the y-axis is bound to be dominant.

To compensate for these shortcomings, Z-Score Normalization should be considered.

2. Z-Score Normalization

Z-Score Normalization is a data normalization strategy that avoids outlier problems.

If the feature value matches the average, it is normalized to 0, but if it is less than the average, it is negative, and if it is greater than the average, it appears as positive. The size of negative and positive numbers calculated at this time is determined by the standard deviation of the feature. So, if the standard deviation of the data is large (the value is widely spread), the normalized value approaches zero.

normalize <- function(x){
  z_score <- (x - mean(x)) / sd(x)
  return(z_score)
}# To get a vector, use apply instead of lapply
as.data.frame(apply(df$name, normalize))

Python:

def z_score_normalize(x):
    normalized = []

    for value in x:
        normalized_num = (value - np.mean(x)) / np.std(x)
        normalized.append(normalized_num)

    return normalized

Take a look at the picture below. This is the data we used for the Min-Max Normalization, but this time we used the Z-Score Normalization.

Although the data still looks skewed, almost all of the points are between -2 and 2 on the x-axis and y-axis. It is now expressed on a somewhat similar scale. It’s still not exactly the same as there is still one guy over 5 on the x-axis, but it can still be said that the problem with the Min-Max Normalization has been solved.

Summary

Data normalization is an essential concept in machine learning. This is because even with very good data, if normalization is missed, certain features can completely dominate other features. It’s like throwing away almost all information! Anyway, let’s use the following two methods properly for normalization.

Min-Max Normalization: All features have the same scale, but do not handle outliers well.
Z-Score Normalization: handles outliers well, but does not produce data that is normalized to exactly the same scale.