Feature Scaling
Why do scale the data in Machine Learning or Data Science?

Feature Scaling
What is Feature Scaling?
Feature Scaling is the process of bringing all of our features to the same or very similar ranges of values or distribution. — Machine Learning Engineering by Andriy Burkov URL
Why do we need Feature Scaling?
Most of the Machine Learning Algorithms show
significantly better results
when the features are transformed into the same or very similar range,i.e. a fixed scale
.To understand the importance of feature scaling, we are going to use the
diabetes
dataset from the Source.Here we are considering a problem of estimating a quantitative measure of diabetes disease progression one year after baseline using the ten baseline variables,
age
,sex
,body mass index
,average blood pressure
, and sixblood serum measurements
.Dataset Description
age
: age in yearssex
: genderbmi
: body mass indexbp
: average blood pressures1
: tc, total serum cholesterols2
: ldl, low-density lipoproteinss3
: hdl, high-density lipoproteinss4
: tch, total cholesterol / HDLs5
: ltg, possibly log of serum triglycerides levels6
: glu, blood sugar leveltarget
: a quantitative measure of diabetes disease progression one year after baseline
Import Necessary Packages
# Import Necessary Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
Dataset Description
# Load the Dataset
X, targets = load_diabetes(
return_X_y=True, # Return Input Features and Target
as_frame=True, # Return Input Features and Target as Pandas Dataframe
scaled=False # Return Input Features and Target is NOT Scaled
)
print(f"The input features are of type {type(X)}")
print(f"The target is of type {type(targets)}")
The input features are of type <class 'pandas.core.frame.DataFrame'>
The target is of type <class 'pandas.core.series.Series'>
# Check a Sample from the Dataset
X.head(10)
age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 59.0 | 2.0 | 32.1 | 101.0 | 157.0 | 93.2 | 38.0 | 4.00 | 4.8598 | 87.0 |
1 | 48.0 | 1.0 | 21.6 | 87.0 | 183.0 | 103.2 | 70.0 | 3.00 | 3.8918 | 69.0 |
2 | 72.0 | 2.0 | 30.5 | 93.0 | 156.0 | 93.6 | 41.0 | 4.00 | 4.6728 | 85.0 |
3 | 24.0 | 1.0 | 25.3 | 84.0 | 198.0 | 131.4 | 40.0 | 5.00 | 4.8903 | 89.0 |
4 | 50.0 | 1.0 | 23.0 | 101.0 | 192.0 | 125.4 | 52.0 | 4.00 | 4.2905 | 80.0 |
5 | 23.0 | 1.0 | 22.6 | 89.0 | 139.0 | 64.8 | 61.0 | 2.00 | 4.1897 | 68.0 |
6 | 36.0 | 2.0 | 22.0 | 90.0 | 160.0 | 99.6 | 50.0 | 3.00 | 3.9512 | 82.0 |
7 | 66.0 | 2.0 | 26.2 | 114.0 | 255.0 | 185.0 | 56.0 | 4.55 | 4.2485 | 92.0 |
8 | 60.0 | 2.0 | 32.1 | 83.0 | 179.0 | 119.4 | 42.0 | 4.00 | 4.4773 | 94.0 |
9 | 29.0 | 1.0 | 30.0 | 85.0 | 180.0 | 93.4 | 43.0 | 4.00 | 5.3845 | 88.0 |
# Check Descriptive Statistics of the Dataset
X.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age | 442.0 | 48.518100 | 13.109028 | 19.0000 | 38.2500 | 50.00000 | 59.0000 | 79.000 |
sex | 442.0 | 1.468326 | 0.499561 | 1.0000 | 1.0000 | 1.00000 | 2.0000 | 2.000 |
bmi | 442.0 | 26.375792 | 4.418122 | 18.0000 | 23.2000 | 25.70000 | 29.2750 | 42.200 |
bp | 442.0 | 94.647014 | 13.831283 | 62.0000 | 84.0000 | 93.00000 | 105.0000 | 133.000 |
s1 | 442.0 | 189.140271 | 34.608052 | 97.0000 | 164.2500 | 186.00000 | 209.7500 | 301.000 |
s2 | 442.0 | 115.439140 | 30.413081 | 41.6000 | 96.0500 | 113.00000 | 134.5000 | 242.400 |
s3 | 442.0 | 49.788462 | 12.934202 | 22.0000 | 40.2500 | 48.00000 | 57.7500 | 99.000 |
s4 | 442.0 | 4.070249 | 1.290450 | 2.0000 | 3.0000 | 4.00000 | 5.0000 | 9.090 |
s5 | 442.0 | 4.641411 | 0.522391 | 3.2581 | 4.2767 | 4.62005 | 4.9972 | 6.107 |
s6 | 442.0 | 91.260181 | 11.496335 | 58.0000 | 83.2500 | 91.00000 | 98.0000 | 124.000 |
list(X['age'][0:10])
[59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
- Few observations from the descriptive statistics
- The
age
feature is in range [18, 79], indicating patients ranging from 18 to 79 years old. - The
bmi
feature is in range [18, 42], indicating patients with body mass index of 18 to 42. - The
s1
feature is in range [97, 301], indicating patients with total serum cholesterol of 97 to 301. - …
- The
- As we can see here, every feature has a different range.
- When we use these features to build a Machine Learning model, the learning algorithm won’t differentiate that values 18-79 and 97-301 represent two different things
age
ands1-total serum cholesterol
. It will end up treating them both as numbers. - As the numbers for total serum cholesterol i.e. 97-301 are much bigger in value compared to the numbers representing the age, the learning algorithm might end up giving more importance to total serum cholesterol over the age,
regardless of which variable is actually more helpful
in generating predictions. - To avoid such an issue we prefer to transform the features into the same or very similar range,
i.e. a fixed scale
.
Different Types of Feature Scaling
Normalization (Min-Max Scaling)
andStandardization (Standard Scaling)
are the two of the most widely used methods for feature scaling.- Normalization transforms each feature to a range of [0 - 1]. On the other hand, standardization scales each input variable by subtracting the mean and dividing by the standard deviation, resulting in a distribution (almost!) with a mean of zero and a standard deviation of one.
Normalization
Let’s consider a sample from our dataset with $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0}$. We can use following equation to Normalize the data. After Normalization, our sample transformed to $age = {0.73, 0.51, 1.0, 0.02, 0.55, 0.0, 0.27, 0.88, 0.76, 0.12}$.
$$s’ = \frac{s - \min(S)}{\max(S) - \min(S)} $$
age_sample = list(X['age'][:10])
normalized_age = [((age - min(age_sample))/(max(age_sample) - min(age_sample))) for age in age_sample]
normalized_age = [round(age, 2) for age in normalized_age]
print(f"First 10 Age Values before Normalization: {age_sample}")
print(f"First 10 Age Values after Normalization: {normalized_age}")
First 10 Age Values before Normalization: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
First 10 Age Values after Normalization: [0.73, 0.51, 1.0, 0.02, 0.55, 0.0, 0.27, 0.88, 0.76, 0.12]
In this example, we can see that the data point with 72 years of age is scaled to 1.0 as 72 is the maximum number out of those 10 samples of age from our dataset. Similarly data point with 23 years of age is scaled to 0.0 as 23 is the minimum number out of those 10 samples of age from our dataset.
Here key point to observe is that the scaling operation is executed based on the minimum and maximum of those 10 samples and not all the samples of the dataset.
Instead of the range $[0, 1]$, if we are interested to transform in some arbitrary range $[a, b]$ we can use following equation to Normalize the data.
$$s’ = a + \frac{\big(s - \min(S)\big) \big(b - a\big)}{\max(S) - \min(S)}$$
For example, we can transform $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0}$ in the range $[-1, 1]$ to get scaled dataset $age = {0.47, 0.02, 1.0, -0.96, 0.1, -1.0, -0.47, 0.76, 0.51, -0.76}$.
a, b = -1, 1
age_sample = list(X['age'][:10])
normalized_age = []
for age in age_sample:
numerator = (age - min(age_sample))*(b - a)
denominator = max(age_sample) - min(age_sample)
normalized_age.append(a + (numerator/denominator))
normalized_age = [round(a, 2) for a in normalized_age]
print(f"First 10 Age Values before Normalization: {age_sample}")
print(f"First 10 Age Values after Normalization: {normalized_age}")
First 10 Age Values before Normalization: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
First 10 Age Values after Normalization: [0.47, 0.02, 1.0, -0.96, 0.1, -1.0, -0.47, 0.76, 0.51, -0.76]
- In this example, we can see that the data point with 72 years of age is scaled to 1.0 as 72 is the maximum number out of those 10 samples of age from our dataset. Similarly data point with 23 years of age is scaled to -1.0 as 23 is the minimum number out of those 10 samples of age from our dataset.
- Here key point to observe is that the scaling operation is executed based on the minimum and maximum of those 10 samples and not all the samples of the dataset.
Standardization
Let’s consider a sample from our dataset with $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0}$. We can use following equation to Standardize the data. After Standardization, our sample transformed to $age = {0.73, 0.08, 1.5, -1.34, 0.2, -1.4, -0.63, 1.14, 0.79, -1.05}$.
$$s’ = \frac{s - mean(S)}{std(S)}$$
age_sample = list(X['age'][:10])
standardize_age = [((age - np.average(age_sample))/np.std(age_sample)) for age in age_sample]
standardize_age = [round(age, 2) for age in standardize_age]
print(f"First 10 Age Values before Standardization: {age_sample}")
print(f"First 10 Age Values after Standardization: {standardize_age}")
First 10 Age Values before Standardization: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
First 10 Age Values after Standardization: [0.73, 0.08, 1.5, -1.34, 0.2, -1.4, -0.63, 1.14, 0.79, -1.05]
Robust Scaling
- Standardization scales the data such that the mean of values after scaling becomes zero and the standard deviation of values after scaling becomes one. This way it transforms the data such that it follows the
standard normal distribution
. - It uses
mean
andstandard deviation
of original data to perform scaling. Usually Mean and Standard Deviation is very sensitive tooutliers
.
outliers
. are the values on the edge of the distribution that may have a low probability of occurrence, yet are overrepresented for some reason. Outliers can skew a probability distribution and make data scaling using standardization difficult as the calculated mean and standard deviation will be skewed by the presence of the outliers. $-$ Jason Brownlee from Machine Learning Mastery URL.
Meadian
i.e.50th Percentile
is less sensitive to outliers and similarlyInter-Quartile Range (IQR)
i.e.IQR = (75th Percentile - 25th Percentile)
is also less sensitive to outliers.- Robust Scaling uses Median and IQR to scale the data.
$$s’ = \frac{s - median(S)}{IQR(S)}$$
Let’s consider a sample from our dataset with $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0, 8.0, 10.0, 5.0}$. We can use the above equation to scale the data. After scaling, our sample transformed to $age = {0.64, 0.33, 1.0, -0.33, 0.39, -0.36, 0.0, 0.83, 0.67, -0.19, -0.78, -0.72, -0.86}$.
Here the key point to observe is that we have purposefully added three some outlier samples (8.0, 10.0, and 5.0) in the age
feature.
age_sample = list(X['age'][:10])
age_sample.extend([8.0, 10.0, 5.0])
IQR = np.subtract(*np.percentile(age_sample, [75, 25]))
robust_scaled_age = [((age - np.median(age_sample))/IQR) for age in age_sample]
robust_scaled_age = [round(age, 2) for age in robust_scaled_age]
print(f"First 10 Age Values before Robust Scaling: {age_sample}")
print(f"First 10 Age Values after Robust Scaling: {robust_scaled_age}")
First 10 Age Values before Robust Scaling: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0, 8.0, 10.0, 5.0]
First 10 Age Values after Robust Scaling: [0.64, 0.33, 1.0, -0.33, 0.39, -0.36, 0.0, 0.83, 0.67, -0.19, -0.78, -0.72, -0.86]
How to Choose Scaling Type?
Even though there are no fix rules for selecting a particular scaler, broadly the selection depends on Outliers
and Understanding of Features
.
The selection of feature scaling depends on couple of factors:
Understanding of Features
There some features where
Min
andMax
values from the dataset might not correspond to the actual possibleMin
andMax
values for a feature. From statistical perspective,Min
andMax
of sample doesn’t always guarantees a good estimation of theMin
andMax
of population. In such cases,Standardization
orRobustScaling
would be a better choice overNormalization
.For example, in our dataset, the minimum
age
is 19 years. We are intending to use this dataset to build a model which can predict a quantitative measure of diabetes disease progression one year after baseline. If we useNormalization
for scaling, we are assuming that we will always receive patients aged 19 years or older. In future, if we receive a patient who is younger than the 19 years, the scaledage
value for that patient will be a negative number and doesn’t align with the original idea of scalingage
in range [0, 1].This can negatively impact the predictions of the model as model has never seen a data sample with negative age value during training process.
Similarly, the maximum
age
in our dataset is 79 years. If we receive a patient who is older than the 79 years, the scaledage
value for that patient will be a greater than 1 which doesn’t align with the original idea of scalingage
in range [0, 1].On other end, there could be features where it is easy to estimate
Min
andMax
of population just from the sample. For example any form of customer star ratings is usually represented in the range of [0 - 5] stars. Here there is no scope of receiving a rating less than 0 or more than 5. In this case it is easy to estimateMin
andMax
of population and could be based on our understanding of the feature. In such cases, we can useNormalization
for scaling. Digital Images are another such example, where we can use Normalization to scale the data.
Outliers
- Usually descriptive statistics such
Min
,Max
,Mean
andStandard deviation
are very sensitive to outliers and can change significantly by a small presence of outliers in the data. On the other end, descriptive statistics such asMedian
andInter-Quartile Range
is less sensitive to outliers. - Robust Scaler is one which uses
Median
andInter-Quartile Range
to scale the data. Because of this it is less sensitive to outliers. - So if the input features has significantly higher number of outliers, it is always better to use Robust Scaler.
- Usually descriptive statistics such
Impact of Scaling
In this section we will compare different types of scaler on our dataset.
def plot_scaling_comparison(data, scaled_data, column, title):
fig, axs = plt.subplots(
nrows=2,
ncols=2,
figsize=(8, 8),
gridspec_kw={"height_ratios": (.20, .80)},
dpi=100,
constrained_layout=False
)
fig.suptitle(title)
bplot = sns.boxplot(data=data, x=column, ax=axs[0][0])
hplot = sns.histplot(data=data, x=column, ax=axs[1][0], kde=True, bins='sqrt')
hplot.vlines(x=[np.mean(data[column]), np.median(data[column])], ymin=hplot.get_ylim()[0], ymax=hplot.get_ylim()[1], ls='--', colors=['tab:green', 'tab:red'], lw=2)
bplot = sns.boxplot(data=scaled_data, x=column, ax=axs[0][1])
hplot = sns.histplot(data=scaled_data, x=column, ax=axs[1][1], kde=True, bins='sqrt')
hplot.vlines(x=[np.mean(scaled_data[column]), np.median(scaled_data[column])], ymin=hplot.get_ylim()[
0], ymax=hplot.get_ylim()[1], ls='--', colors=['tab:green', 'tab:red'], lw=2)
axs[0][0].set(xlabel='')
axs[0][0].set_facecolor('white')
axs[1][0].set_facecolor('white')
axs[0][1].set(xlabel='')
axs[0][1].set_facecolor('white')
axs[1][1].set_facecolor('white')
normalization_scaler = MinMaxScaler()
normalized_X = pd.DataFrame(normalization_scaler.fit_transform(X), columns=X.columns)
standard_scaler = StandardScaler()
standardized_X = pd.DataFrame(standard_scaler.fit_transform(X), columns=X.columns)
robust_scaler = RobustScaler()
robust_scaled_X = pd.DataFrame(
robust_scaler.fit_transform(X), columns=X.columns)
normalized_X.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age | 442.0 | 0.491968 | 0.218484 | 0.0 | 0.320833 | 0.516667 | 0.666667 | 1.0 |
sex | 442.0 | 0.468326 | 0.499561 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 1.0 |
bmi | 442.0 | 0.346107 | 0.182567 | 0.0 | 0.214876 | 0.318182 | 0.465909 | 1.0 |
bp | 442.0 | 0.459817 | 0.194807 | 0.0 | 0.309859 | 0.436620 | 0.605634 | 1.0 |
s1 | 442.0 | 0.451668 | 0.169647 | 0.0 | 0.329657 | 0.436275 | 0.552696 | 1.0 |
s2 | 442.0 | 0.367725 | 0.151460 | 0.0 | 0.271165 | 0.355578 | 0.462649 | 1.0 |
s3 | 442.0 | 0.360889 | 0.167977 | 0.0 | 0.237013 | 0.337662 | 0.464286 | 1.0 |
s4 | 442.0 | 0.291996 | 0.182010 | 0.0 | 0.141044 | 0.282087 | 0.423131 | 1.0 |
s5 | 442.0 | 0.485560 | 0.183366 | 0.0 | 0.357542 | 0.478062 | 0.610446 | 1.0 |
s6 | 442.0 | 0.503942 | 0.174187 | 0.0 | 0.382576 | 0.500000 | 0.606061 | 1.0 |
standardized_X.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age | 442.0 | 8.037814e-18 | 1.001133 | -2.254290 | -0.784172 | 0.113172 | 0.800500 | 2.327895 |
sex | 442.0 | 1.607563e-16 | 1.001133 | -0.938537 | -0.938537 | -0.938537 | 1.065488 | 1.065488 |
bmi | 442.0 | 1.004727e-16 | 1.001133 | -1.897929 | -0.719625 | -0.153132 | 0.656952 | 3.585718 |
bp | 442.0 | 1.060991e-15 | 1.001133 | -2.363050 | -0.770650 | -0.119214 | 0.749368 | 2.776058 |
s1 | 442.0 | -2.893613e-16 | 1.001133 | -2.665411 | -0.720020 | -0.090841 | 0.596193 | 3.235851 |
s2 | 442.0 | -1.245861e-16 | 1.001133 | -2.430626 | -0.638249 | -0.080291 | 0.627442 | 4.179278 |
s3 | 442.0 | -1.326239e-16 | 1.001133 | -2.150883 | -0.738296 | -0.138431 | 0.616239 | 3.809072 |
s4 | 442.0 | -1.446806e-16 | 1.001133 | -1.606102 | -0.830301 | -0.054499 | 0.721302 | 3.894331 |
s5 | 442.0 | 2.250588e-16 | 1.001133 | -2.651040 | -0.698949 | -0.040937 | 0.681851 | 2.808722 |
s6 | 442.0 | 2.371155e-16 | 1.001133 | -2.896390 | -0.697549 | -0.022657 | 0.586922 | 2.851075 |
robust_scaled_X.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age | 442.0 | -0.071417 | 0.631760 | -1.493976 | -0.566265 | 0.0 | 0.433735 | 1.397590 |
sex | 442.0 | 0.468326 | 0.499561 | 0.000000 | 0.000000 | 0.0 | 1.000000 | 1.000000 |
bmi | 442.0 | 0.111241 | 0.727263 | -1.267490 | -0.411523 | 0.0 | 0.588477 | 2.716049 |
bp | 442.0 | 0.078429 | 0.658633 | -1.476190 | -0.428571 | 0.0 | 0.571429 | 1.904762 |
s1 | 442.0 | 0.069017 | 0.760617 | -1.956044 | -0.478022 | 0.0 | 0.521978 | 2.527473 |
s2 | 442.0 | 0.063437 | 0.790977 | -1.856957 | -0.440832 | 0.0 | 0.559168 | 3.365410 |
s3 | 442.0 | 0.102198 | 0.739097 | -1.485714 | -0.442857 | 0.0 | 0.557143 | 2.914286 |
s4 | 442.0 | 0.035124 | 0.645225 | -1.000000 | -0.500000 | 0.0 | 0.500000 | 2.545000 |
s5 | 442.0 | 0.029647 | 0.725039 | -1.890285 | -0.476544 | 0.0 | 0.523456 | 2.063775 |
s6 | 442.0 | 0.017639 | 0.779413 | -2.237288 | -0.525424 | 0.0 | 0.474576 | 2.237288 |
column='bmi'
plot_scaling_comparison(X, normalized_X, column=column, title="Original Data - Normalized Data")
plot_scaling_comparison(X, standardized_X, column=column,
title="Original Data - Standardized Data")
plot_scaling_comparison(X, robust_scaled_X, column=column,
title="Original Data - Robust Scaled Data")