Importance of Data Normalization

Photo by Markus Spiske

Why do we need Data Normalization?

Machine learning algorithms find patterns in data by comparing features of data. Those algorithms such as Distance-Based Algorithms, Gradient Descent Based Algorithms expect the features to be scaled. When the scale of the features in data is severely different, it becomes a problem.

Two ways to normalize data

There are other normalization methods besides this, but the following two are the most widely used methods.

  • Z-Score Normalization

1. Min-Max Normalization

Min-max normalization is the most common way to normalize data. For every feature, it converts each to a minimum value of 0, a maximum value of 1, and the other values to values between 0 and 1.

normalize <- function(x){
min_max <- (x - min(x)) / (max(x) - min(x))
return(min_max)
}
# To get a vector, use apply instead of lapply
as.data.frame(apply(df$name, normalize))
def min_max_normalize(x):
normalized = []

for value in x:
normalized_num = (value - min(x)) / (max(x) - min(x))
normalized.append(normalized_num)

return normalized

2. Z-Score Normalization

Z-Score Normalization is a data normalization strategy that avoids outlier problems.

normalize <- function(x){
z_score <- (x - mean(x)) / sd(x)
return(z_score)
}
# To get a vector, use apply instead of lapply
as.data.frame(apply(df$name, normalize))
def z_score_normalize(x):
normalized = []

for value in x:
normalized_num = (value - np.mean(x)) / np.std(x)
normalized.append(normalized_num)

return normalized

Summary

Data normalization is an essential concept in machine learning. This is because even with very good data, if normalization is missed, certain features can completely dominate other features. It’s like throwing away almost all information! Anyway, let’s use the following two methods properly for normalization.

  • Z-Score Normalization: handles outliers well, but does not produce data that is normalized to exactly the same scale.

Data Scientist | Artificial Intelligence Builder | Front End Developer