Feature Scaling

Why do scale the data in Machine Learning or Data Science?

Image credit: Unsplash

Feature Scaling

What is Feature Scaling?

Feature Scaling is the process of bringing all of our features to the same or very similar ranges of values or distribution. — Machine Learning Engineering by Andriy Burkov URL

Why do we need Feature Scaling?

  • Most of the Machine Learning Algorithms show significantly better results when the features are transformed into the same or very similar range, i.e. a fixed scale.

  • To understand the importance of feature scaling, we are going to use the diabetes dataset from the Source.

  • Here we are considering a problem of estimating a quantitative measure of diabetes disease progression one year after baseline using the ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements.

  • Dataset Description

    • age: age in years
    • sex : gender
    • bmi: body mass index
    • bp: average blood pressure
    • s1: tc, total serum cholesterol
    • s2: ldl, low-density lipoproteins
    • s3: hdl, high-density lipoproteins
    • s4: tch, total cholesterol / HDL
    • s5: ltg, possibly log of serum triglycerides level
    • s6: glu, blood sugar level
    • target: a quantitative measure of diabetes disease progression one year after baseline

Import Necessary Packages

# Import Necessary Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

Dataset Description

# Load the Dataset
X, targets = load_diabetes(
    return_X_y=True, # Return Input Features and Target
    as_frame=True, # Return Input Features and Target as Pandas Dataframe
    scaled=False # Return Input Features and Target is NOT Scaled
)
print(f"The input features are of type {type(X)}")
print(f"The target is of type {type(targets)}")
The input features are of type <class 'pandas.core.frame.DataFrame'>
The target is of type <class 'pandas.core.series.Series'>
# Check a Sample from the Dataset
X.head(10)

agesexbmibps1s2s3s4s5s6
059.02.032.1101.0157.093.238.04.004.859887.0
148.01.021.687.0183.0103.270.03.003.891869.0
272.02.030.593.0156.093.641.04.004.672885.0
324.01.025.384.0198.0131.440.05.004.890389.0
450.01.023.0101.0192.0125.452.04.004.290580.0
523.01.022.689.0139.064.861.02.004.189768.0
636.02.022.090.0160.099.650.03.003.951282.0
766.02.026.2114.0255.0185.056.04.554.248592.0
860.02.032.183.0179.0119.442.04.004.477394.0
929.01.030.085.0180.093.443.04.005.384588.0
# Check Descriptive Statistics of the Dataset
X.describe().T

countmeanstdmin25%50%75%max
age442.048.51810013.10902819.000038.250050.0000059.000079.000
sex442.01.4683260.4995611.00001.00001.000002.00002.000
bmi442.026.3757924.41812218.000023.200025.7000029.275042.200
bp442.094.64701413.83128362.000084.000093.00000105.0000133.000
s1442.0189.14027134.60805297.0000164.2500186.00000209.7500301.000
s2442.0115.43914030.41308141.600096.0500113.00000134.5000242.400
s3442.049.78846212.93420222.000040.250048.0000057.750099.000
s4442.04.0702491.2904502.00003.00004.000005.00009.090
s5442.04.6414110.5223913.25814.27674.620054.99726.107
s6442.091.26018111.49633558.000083.250091.0000098.0000124.000
list(X['age'][0:10])
[59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
  • Few observations from the descriptive statistics
    • The age feature is in range [18, 79], indicating patients ranging from 18 to 79 years old.
    • The bmi feature is in range [18, 42], indicating patients with body mass index of 18 to 42.
    • The s1 feature is in range [97, 301], indicating patients with total serum cholesterol of 97 to 301.
  • As we can see here, every feature has a different range.
  • When we use these features to build a Machine Learning model, the learning algorithm won’t differentiate that values 18-79 and 97-301 represent two different things age and s1-total serum cholesterol. It will end up treating them both as numbers.
  • As the numbers for total serum cholesterol i.e. 97-301 are much bigger in value compared to the numbers representing the age, the learning algorithm might end up giving more importance to total serum cholesterol over the age, regardless of which variable is actually more helpful in generating predictions.
  • To avoid such an issue we prefer to transform the features into the same or very similar range, i.e. a fixed scale.

Different Types of Feature Scaling

  • Normalization (Min-Max Scaling) and Standardization (Standard Scaling) are the two of the most widely used methods for feature scaling.
  • Normalization transforms each feature to a range of [0 - 1]. On the other hand, standardization scales each input variable by subtracting the mean and dividing by the standard deviation, resulting in a distribution (almost!) with a mean of zero and a standard deviation of one.

Normalization

Let’s consider a sample from our dataset with $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0}$. We can use following equation to Normalize the data. After Normalization, our sample transformed to $age = {0.73, 0.51, 1.0, 0.02, 0.55, 0.0, 0.27, 0.88, 0.76, 0.12}$.

$$s’ = \frac{s - \min(S)}{\max(S) - \min(S)} $$

age_sample = list(X['age'][:10])
normalized_age = [((age - min(age_sample))/(max(age_sample) - min(age_sample))) for age in age_sample]
normalized_age = [round(age, 2) for age in normalized_age]
print(f"First 10 Age Values before Normalization: {age_sample}")
print(f"First 10 Age Values after Normalization: {normalized_age}")
First 10 Age Values before Normalization: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
First 10 Age Values after Normalization: [0.73, 0.51, 1.0, 0.02, 0.55, 0.0, 0.27, 0.88, 0.76, 0.12]
  • In this example, we can see that the data point with 72 years of age is scaled to 1.0 as 72 is the maximum number out of those 10 samples of age from our dataset. Similarly data point with 23 years of age is scaled to 0.0 as 23 is the minimum number out of those 10 samples of age from our dataset.

  • Here key point to observe is that the scaling operation is executed based on the minimum and maximum of those 10 samples and not all the samples of the dataset.

  • Instead of the range $[0, 1]$, if we are interested to transform in some arbitrary range $[a, b]$ we can use following equation to Normalize the data.

$$s’ = a + \frac{\big(s - \min(S)\big) \big(b - a\big)}{\max(S) - \min(S)}$$

For example, we can transform $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0}$ in the range $[-1, 1]$ to get scaled dataset $age = {0.47, 0.02, 1.0, -0.96, 0.1, -1.0, -0.47, 0.76, 0.51, -0.76}$.

a, b = -1, 1
age_sample = list(X['age'][:10])
normalized_age = []
for age in age_sample:
    numerator = (age - min(age_sample))*(b - a)
    denominator = max(age_sample) - min(age_sample)
    normalized_age.append(a + (numerator/denominator))
normalized_age = [round(a, 2) for a in normalized_age]
print(f"First 10 Age Values before Normalization: {age_sample}")
print(f"First 10 Age Values after Normalization: {normalized_age}")
First 10 Age Values before Normalization: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
First 10 Age Values after Normalization: [0.47, 0.02, 1.0, -0.96, 0.1, -1.0, -0.47, 0.76, 0.51, -0.76]
  • In this example, we can see that the data point with 72 years of age is scaled to 1.0 as 72 is the maximum number out of those 10 samples of age from our dataset. Similarly data point with 23 years of age is scaled to -1.0 as 23 is the minimum number out of those 10 samples of age from our dataset.
  • Here key point to observe is that the scaling operation is executed based on the minimum and maximum of those 10 samples and not all the samples of the dataset.

Standardization

Let’s consider a sample from our dataset with $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0}$. We can use following equation to Standardize the data. After Standardization, our sample transformed to $age = {0.73, 0.08, 1.5, -1.34, 0.2, -1.4, -0.63, 1.14, 0.79, -1.05}$.

$$s’ = \frac{s - mean(S)}{std(S)}$$

age_sample = list(X['age'][:10])
standardize_age = [((age - np.average(age_sample))/np.std(age_sample)) for age in age_sample]
standardize_age = [round(age, 2) for age in standardize_age]
print(f"First 10 Age Values before Standardization: {age_sample}")
print(f"First 10 Age Values after Standardization: {standardize_age}")
First 10 Age Values before Standardization: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
First 10 Age Values after Standardization: [0.73, 0.08, 1.5, -1.34, 0.2, -1.4, -0.63, 1.14, 0.79, -1.05]

Robust Scaling

  • Standardization scales the data such that the mean of values after scaling becomes zero and the standard deviation of values after scaling becomes one. This way it transforms the data such that it follows the standard normal distribution.
  • It uses mean and standard deviation of original data to perform scaling. Usually Mean and Standard Deviation is very sensitive to outliers.

outliers. are the values on the edge of the distribution that may have a low probability of occurrence, yet are overrepresented for some reason. Outliers can skew a probability distribution and make data scaling using standardization difficult as the calculated mean and standard deviation will be skewed by the presence of the outliers. $-$ Jason Brownlee from Machine Learning Mastery URL.

  • Meadian i.e. 50th Percentile is less sensitive to outliers and similarly Inter-Quartile Range (IQR) i.e. IQR = (75th Percentile - 25th Percentile) is also less sensitive to outliers.
  • Robust Scaling uses Median and IQR to scale the data.

$$s’ = \frac{s - median(S)}{IQR(S)}$$

Let’s consider a sample from our dataset with $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0, 8.0, 10.0, 5.0}$. We can use the above equation to scale the data. After scaling, our sample transformed to $age = {0.64, 0.33, 1.0, -0.33, 0.39, -0.36, 0.0, 0.83, 0.67, -0.19, -0.78, -0.72, -0.86}$.

Here the key point to observe is that we have purposefully added three some outlier samples (8.0, 10.0, and 5.0) in the age feature.

age_sample = list(X['age'][:10])
age_sample.extend([8.0, 10.0, 5.0])
IQR = np.subtract(*np.percentile(age_sample, [75, 25]))
robust_scaled_age = [((age - np.median(age_sample))/IQR) for age in age_sample]
robust_scaled_age = [round(age, 2) for age in robust_scaled_age]
print(f"First 10 Age Values before Robust Scaling: {age_sample}")
print(f"First 10 Age Values after Robust Scaling: {robust_scaled_age}")
First 10 Age Values before Robust Scaling: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0, 8.0, 10.0, 5.0]
First 10 Age Values after Robust Scaling: [0.64, 0.33, 1.0, -0.33, 0.39, -0.36, 0.0, 0.83, 0.67, -0.19, -0.78, -0.72, -0.86]

How to Choose Scaling Type?

Even though there are no fix rules for selecting a particular scaler, broadly the selection depends on Outliers and Understanding of Features.

The selection of feature scaling depends on couple of factors:

  1. Understanding of Features

    • There some features where Min and Max values from the dataset might not correspond to the actual possible Min and Max values for a feature. From statistical perspective, Min and Max of sample doesn’t always guarantees a good estimation of the Min and Max of population. In such cases, Standardization or RobustScaling would be a better choice over Normalization.

    • For example, in our dataset, the minimum age is 19 years. We are intending to use this dataset to build a model which can predict a quantitative measure of diabetes disease progression one year after baseline. If we use Normalization for scaling, we are assuming that we will always receive patients aged 19 years or older. In future, if we receive a patient who is younger than the 19 years, the scaled age value for that patient will be a negative number and doesn’t align with the original idea of scaling age in range [0, 1].

    • This can negatively impact the predictions of the model as model has never seen a data sample with negative age value during training process.

    • Similarly, the maximum age in our dataset is 79 years. If we receive a patient who is older than the 79 years, the scaled age value for that patient will be a greater than 1 which doesn’t align with the original idea of scaling age in range [0, 1].

    • On other end, there could be features where it is easy to estimate Min and Max of population just from the sample. For example any form of customer star ratings is usually represented in the range of [0 - 5] stars. Here there is no scope of receiving a rating less than 0 or more than 5. In this case it is easy to estimate Min and Max of population and could be based on our understanding of the feature. In such cases, we can use Normalization for scaling. Digital Images are another such example, where we can use Normalization to scale the data.

  2. Outliers

    • Usually descriptive statistics such Min, Max, Mean and Standard deviation are very sensitive to outliers and can change significantly by a small presence of outliers in the data. On the other end, descriptive statistics such as Median and Inter-Quartile Range is less sensitive to outliers.
    • Robust Scaler is one which uses Median and Inter-Quartile Range to scale the data. Because of this it is less sensitive to outliers.
    • So if the input features has significantly higher number of outliers, it is always better to use Robust Scaler.

Impact of Scaling

In this section we will compare different types of scaler on our dataset.

def plot_scaling_comparison(data, scaled_data, column, title):
    fig, axs = plt.subplots(
        nrows=2,
        ncols=2,
        figsize=(8, 8),
        gridspec_kw={"height_ratios": (.20, .80)},
        dpi=100,
        constrained_layout=False
    )
    fig.suptitle(title)
    
    bplot = sns.boxplot(data=data, x=column, ax=axs[0][0])
    hplot = sns.histplot(data=data, x=column, ax=axs[1][0], kde=True, bins='sqrt')
    hplot.vlines(x=[np.mean(data[column]), np.median(data[column])], ymin=hplot.get_ylim()[0], ymax=hplot.get_ylim()[1], ls='--', colors=['tab:green', 'tab:red'], lw=2)
    
    bplot = sns.boxplot(data=scaled_data, x=column, ax=axs[0][1])
    hplot = sns.histplot(data=scaled_data, x=column, ax=axs[1][1], kde=True, bins='sqrt')
    hplot.vlines(x=[np.mean(scaled_data[column]), np.median(scaled_data[column])], ymin=hplot.get_ylim()[
                 0], ymax=hplot.get_ylim()[1], ls='--', colors=['tab:green', 'tab:red'], lw=2)
    
    axs[0][0].set(xlabel='')
    axs[0][0].set_facecolor('white')
    axs[1][0].set_facecolor('white')
    axs[0][1].set(xlabel='')
    axs[0][1].set_facecolor('white')
    axs[1][1].set_facecolor('white')
   
normalization_scaler = MinMaxScaler()
normalized_X = pd.DataFrame(normalization_scaler.fit_transform(X), columns=X.columns)

standard_scaler = StandardScaler()
standardized_X = pd.DataFrame(standard_scaler.fit_transform(X), columns=X.columns)

robust_scaler = RobustScaler()
robust_scaled_X = pd.DataFrame(
    robust_scaler.fit_transform(X), columns=X.columns)
normalized_X.describe().T

countmeanstdmin25%50%75%max
age442.00.4919680.2184840.00.3208330.5166670.6666671.0
sex442.00.4683260.4995610.00.0000000.0000001.0000001.0
bmi442.00.3461070.1825670.00.2148760.3181820.4659091.0
bp442.00.4598170.1948070.00.3098590.4366200.6056341.0
s1442.00.4516680.1696470.00.3296570.4362750.5526961.0
s2442.00.3677250.1514600.00.2711650.3555780.4626491.0
s3442.00.3608890.1679770.00.2370130.3376620.4642861.0
s4442.00.2919960.1820100.00.1410440.2820870.4231311.0
s5442.00.4855600.1833660.00.3575420.4780620.6104461.0
s6442.00.5039420.1741870.00.3825760.5000000.6060611.0
standardized_X.describe().T

countmeanstdmin25%50%75%max
age442.08.037814e-181.001133-2.254290-0.7841720.1131720.8005002.327895
sex442.01.607563e-161.001133-0.938537-0.938537-0.9385371.0654881.065488
bmi442.01.004727e-161.001133-1.897929-0.719625-0.1531320.6569523.585718
bp442.01.060991e-151.001133-2.363050-0.770650-0.1192140.7493682.776058
s1442.0-2.893613e-161.001133-2.665411-0.720020-0.0908410.5961933.235851
s2442.0-1.245861e-161.001133-2.430626-0.638249-0.0802910.6274424.179278
s3442.0-1.326239e-161.001133-2.150883-0.738296-0.1384310.6162393.809072
s4442.0-1.446806e-161.001133-1.606102-0.830301-0.0544990.7213023.894331
s5442.02.250588e-161.001133-2.651040-0.698949-0.0409370.6818512.808722
s6442.02.371155e-161.001133-2.896390-0.697549-0.0226570.5869222.851075
robust_scaled_X.describe().T

countmeanstdmin25%50%75%max
age442.0-0.0714170.631760-1.493976-0.5662650.00.4337351.397590
sex442.00.4683260.4995610.0000000.0000000.01.0000001.000000
bmi442.00.1112410.727263-1.267490-0.4115230.00.5884772.716049
bp442.00.0784290.658633-1.476190-0.4285710.00.5714291.904762
s1442.00.0690170.760617-1.956044-0.4780220.00.5219782.527473
s2442.00.0634370.790977-1.856957-0.4408320.00.5591683.365410
s3442.00.1021980.739097-1.485714-0.4428570.00.5571432.914286
s4442.00.0351240.645225-1.000000-0.5000000.00.5000002.545000
s5442.00.0296470.725039-1.890285-0.4765440.00.5234562.063775
s6442.00.0176390.779413-2.237288-0.5254240.00.4745762.237288
column='bmi'
plot_scaling_comparison(X, normalized_X, column=column, title="Original Data - Normalized Data")

png

plot_scaling_comparison(X, standardized_X, column=column,
                        title="Original Data - Standardized Data")

png

plot_scaling_comparison(X, robust_scaled_X, column=column,
                        title="Original Data - Robust Scaled Data")

png

Viral Thakar
Viral Thakar
Machine Learning Engineer

My research interests include machine learning, computer vision and social innovations.