# Feature Scaling

Why do scale the data in Machine Learning or Data Science?

# Feature Scaling

## What is Feature Scaling?

Feature Scaling is the process of bringing all of our features to the same or very similar ranges of values or distribution. —

Machine Learning Engineering by Andriy BurkovURL

## Why do we need Feature Scaling?

Most of the Machine Learning Algorithms show

`significantly better results`

when the features are transformed into the same or very similar range,`i.e. a fixed scale`

.To understand the importance of feature scaling, we are going to use the

`diabetes`

dataset from the Source.Here we are considering a problem of estimating a quantitative measure of diabetes disease progression one year after baseline using the ten baseline variables,

`age`

,`sex`

,`body mass index`

,`average blood pressure`

, and six`blood serum measurements`

.**Dataset Description**`age`

: age in years`sex`

: gender`bmi`

: body mass index`bp`

: average blood pressure`s1`

: tc, total serum cholesterol`s2`

: ldl, low-density lipoproteins`s3`

: hdl, high-density lipoproteins`s4`

: tch, total cholesterol / HDL`s5`

: ltg, possibly log of serum triglycerides level`s6`

: glu, blood sugar level`target`

: a quantitative measure of diabetes disease progression one year after baseline

### Import Necessary Packages

```
# Import Necessary Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
```

### Dataset Description

```
# Load the Dataset
X, targets = load_diabetes(
return_X_y=True, # Return Input Features and Target
as_frame=True, # Return Input Features and Target as Pandas Dataframe
scaled=False # Return Input Features and Target is NOT Scaled
)
print(f"The input features are of type {type(X)}")
print(f"The target is of type {type(targets)}")
```

```
The input features are of type <class 'pandas.core.frame.DataFrame'>
The target is of type <class 'pandas.core.series.Series'>
```

```
# Check a Sample from the Dataset
X.head(10)
```

age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 59.0 | 2.0 | 32.1 | 101.0 | 157.0 | 93.2 | 38.0 | 4.00 | 4.8598 | 87.0 |

1 | 48.0 | 1.0 | 21.6 | 87.0 | 183.0 | 103.2 | 70.0 | 3.00 | 3.8918 | 69.0 |

2 | 72.0 | 2.0 | 30.5 | 93.0 | 156.0 | 93.6 | 41.0 | 4.00 | 4.6728 | 85.0 |

3 | 24.0 | 1.0 | 25.3 | 84.0 | 198.0 | 131.4 | 40.0 | 5.00 | 4.8903 | 89.0 |

4 | 50.0 | 1.0 | 23.0 | 101.0 | 192.0 | 125.4 | 52.0 | 4.00 | 4.2905 | 80.0 |

5 | 23.0 | 1.0 | 22.6 | 89.0 | 139.0 | 64.8 | 61.0 | 2.00 | 4.1897 | 68.0 |

6 | 36.0 | 2.0 | 22.0 | 90.0 | 160.0 | 99.6 | 50.0 | 3.00 | 3.9512 | 82.0 |

7 | 66.0 | 2.0 | 26.2 | 114.0 | 255.0 | 185.0 | 56.0 | 4.55 | 4.2485 | 92.0 |

8 | 60.0 | 2.0 | 32.1 | 83.0 | 179.0 | 119.4 | 42.0 | 4.00 | 4.4773 | 94.0 |

9 | 29.0 | 1.0 | 30.0 | 85.0 | 180.0 | 93.4 | 43.0 | 4.00 | 5.3845 | 88.0 |

```
# Check Descriptive Statistics of the Dataset
X.describe().T
```

count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|

age | 442.0 | 48.518100 | 13.109028 | 19.0000 | 38.2500 | 50.00000 | 59.0000 | 79.000 |

sex | 442.0 | 1.468326 | 0.499561 | 1.0000 | 1.0000 | 1.00000 | 2.0000 | 2.000 |

bmi | 442.0 | 26.375792 | 4.418122 | 18.0000 | 23.2000 | 25.70000 | 29.2750 | 42.200 |

bp | 442.0 | 94.647014 | 13.831283 | 62.0000 | 84.0000 | 93.00000 | 105.0000 | 133.000 |

s1 | 442.0 | 189.140271 | 34.608052 | 97.0000 | 164.2500 | 186.00000 | 209.7500 | 301.000 |

s2 | 442.0 | 115.439140 | 30.413081 | 41.6000 | 96.0500 | 113.00000 | 134.5000 | 242.400 |

s3 | 442.0 | 49.788462 | 12.934202 | 22.0000 | 40.2500 | 48.00000 | 57.7500 | 99.000 |

s4 | 442.0 | 4.070249 | 1.290450 | 2.0000 | 3.0000 | 4.00000 | 5.0000 | 9.090 |

s5 | 442.0 | 4.641411 | 0.522391 | 3.2581 | 4.2767 | 4.62005 | 4.9972 | 6.107 |

s6 | 442.0 | 91.260181 | 11.496335 | 58.0000 | 83.2500 | 91.00000 | 98.0000 | 124.000 |

```
list(X['age'][0:10])
```

```
[59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
```

- Few observations from the descriptive statistics
- The
`age`

feature is in range [18, 79], indicating patients ranging from 18 to 79 years old. - The
`bmi`

feature is in range [18, 42], indicating patients with body mass index of 18 to 42. - The
`s1`

feature is in range [97, 301], indicating patients with total serum cholesterol of 97 to 301. - …

- The
- As we can see here, every feature has a different range.
- When we use these features to build a Machine Learning model, the learning algorithm won’t differentiate that values 18-79 and 97-301 represent two different things
`age`

and`s1-total serum cholesterol`

. It will end up treating them both as numbers. - As the numbers for total serum cholesterol i.e. 97-301 are much bigger in value compared to the numbers representing the age, the learning algorithm might end up giving more importance to total serum cholesterol over the age,
`regardless of which variable is actually more helpful`

in generating predictions. **To avoid such an issue we prefer to transform the features into the same or very similar range,**`i.e. a fixed scale`

.

## Different Types of Feature Scaling

and`Normalization (Min-Max Scaling)`

are the two of the most widely used methods for feature scaling.`Standardization (Standard Scaling)`

- Normalization transforms each feature to a range of [0 - 1]. On the other hand, standardization scales each input variable by subtracting the mean and dividing by the standard deviation, resulting in a distribution (almost!) with a mean of zero and a standard deviation of one.

### Normalization

Let’s consider a sample from our dataset with $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0}$. We can use following equation to Normalize the data. After Normalization, our sample transformed to $age = {0.73, 0.51, 1.0, 0.02, 0.55, 0.0, 0.27, 0.88, 0.76, 0.12}$.

$$s’ = \frac{s - \min(S)}{\max(S) - \min(S)} $$

```
age_sample = list(X['age'][:10])
normalized_age = [((age - min(age_sample))/(max(age_sample) - min(age_sample))) for age in age_sample]
normalized_age = [round(age, 2) for age in normalized_age]
print(f"First 10 Age Values before Normalization: {age_sample}")
print(f"First 10 Age Values after Normalization: {normalized_age}")
```

```
First 10 Age Values before Normalization: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
First 10 Age Values after Normalization: [0.73, 0.51, 1.0, 0.02, 0.55, 0.0, 0.27, 0.88, 0.76, 0.12]
```

In this example, we can see that the data point with 72 years of age is scaled to 1.0 as 72 is the maximum number out of those 10 samples of age from our dataset. Similarly data point with 23 years of age is scaled to 0.0 as 23 is the minimum number out of those 10 samples of age from our dataset.

Here key point to observe is that the scaling operation is executed based on the minimum and maximum of those 10 samples and not all the samples of the dataset.

Instead of the range $[0, 1]$, if we are interested to transform in some arbitrary range $[a, b]$ we can use following equation to Normalize the data.

$$s’ = a + \frac{\big(s - \min(S)\big) \big(b - a\big)}{\max(S) - \min(S)}$$

For example, we can transform $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0}$ in the range $[-1, 1]$ to get scaled dataset $age = {0.47, 0.02, 1.0, -0.96, 0.1, -1.0, -0.47, 0.76, 0.51, -0.76}$.

```
a, b = -1, 1
age_sample = list(X['age'][:10])
normalized_age = []
for age in age_sample:
numerator = (age - min(age_sample))*(b - a)
denominator = max(age_sample) - min(age_sample)
normalized_age.append(a + (numerator/denominator))
normalized_age = [round(a, 2) for a in normalized_age]
print(f"First 10 Age Values before Normalization: {age_sample}")
print(f"First 10 Age Values after Normalization: {normalized_age}")
```

```
First 10 Age Values before Normalization: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
First 10 Age Values after Normalization: [0.47, 0.02, 1.0, -0.96, 0.1, -1.0, -0.47, 0.76, 0.51, -0.76]
```

- In this example, we can see that the data point with 72 years of age is scaled to 1.0 as 72 is the maximum number out of those 10 samples of age from our dataset. Similarly data point with 23 years of age is scaled to -1.0 as 23 is the minimum number out of those 10 samples of age from our dataset.
- Here key point to observe is that the scaling operation is executed based on the minimum and maximum of those 10 samples and not all the samples of the dataset.

### Standardization

Let’s consider a sample from our dataset with $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0}$. We can use following equation to Standardize the data. After Standardization, our sample transformed to $age = {0.73, 0.08, 1.5, -1.34, 0.2, -1.4, -0.63, 1.14, 0.79, -1.05}$.

$$s’ = \frac{s - mean(S)}{std(S)}$$

```
age_sample = list(X['age'][:10])
standardize_age = [((age - np.average(age_sample))/np.std(age_sample)) for age in age_sample]
standardize_age = [round(age, 2) for age in standardize_age]
print(f"First 10 Age Values before Standardization: {age_sample}")
print(f"First 10 Age Values after Standardization: {standardize_age}")
```

```
First 10 Age Values before Standardization: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0]
First 10 Age Values after Standardization: [0.73, 0.08, 1.5, -1.34, 0.2, -1.4, -0.63, 1.14, 0.79, -1.05]
```

### Robust Scaling

- Standardization scales the data such that the mean of values after scaling becomes zero and the standard deviation of values after scaling becomes one. This way it transforms the data such that it follows the
`standard normal distribution`

. - It uses
`mean`

and`standard deviation`

of original data to perform scaling. Usually Mean and Standard Deviation is very sensitive to.`outliers`

. are the values on the edge of the distribution that may have a low probability of occurrence, yet are overrepresented for some reason. Outliers can skew a probability distribution and make data scaling using standardization difficult as the calculated mean and standard deviation will be skewed by the presence of the outliers. $-$ Jason Brownlee from Machine Learning Mastery URL.`outliers`

`Meadian`

i.e.`50th Percentile`

is less sensitive to outliers and similarly`Inter-Quartile Range (IQR)`

i.e.`IQR = (75th Percentile - 25th Percentile)`

is also less sensitive to outliers.- Robust Scaling uses Median and IQR to scale the data.

$$s’ = \frac{s - median(S)}{IQR(S)}$$

Let’s consider a sample from our dataset with $age = {59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0, 8.0, 10.0, 5.0}$. We can use the above equation to scale the data. After scaling, our sample transformed to $age = {0.64, 0.33, 1.0, -0.33, 0.39, -0.36, 0.0, 0.83, 0.67, -0.19, -0.78, -0.72, -0.86}$.

Here the key point to observe is that we have purposefully added three some outlier samples (8.0, 10.0, and 5.0) in the `age`

feature.

```
age_sample = list(X['age'][:10])
age_sample.extend([8.0, 10.0, 5.0])
IQR = np.subtract(*np.percentile(age_sample, [75, 25]))
robust_scaled_age = [((age - np.median(age_sample))/IQR) for age in age_sample]
robust_scaled_age = [round(age, 2) for age in robust_scaled_age]
print(f"First 10 Age Values before Robust Scaling: {age_sample}")
print(f"First 10 Age Values after Robust Scaling: {robust_scaled_age}")
```

```
First 10 Age Values before Robust Scaling: [59.0, 48.0, 72.0, 24.0, 50.0, 23.0, 36.0, 66.0, 60.0, 29.0, 8.0, 10.0, 5.0]
First 10 Age Values after Robust Scaling: [0.64, 0.33, 1.0, -0.33, 0.39, -0.36, 0.0, 0.83, 0.67, -0.19, -0.78, -0.72, -0.86]
```

## How to Choose Scaling Type?

Even though there are no fix rules for selecting a particular scaler, broadly the selection depends on `Outliers`

and `Understanding of Features`

.

The selection of feature scaling depends on couple of factors:

**Understanding of Features**There some features where

`Min`

and`Max`

values from the dataset might not correspond to the actual possible`Min`

and`Max`

values for a feature. From statistical perspective,`Min`

and`Max`

of sample doesn’t always guarantees a good estimation of the`Min`

and`Max`

of population. In such cases,`Standardization`

or`RobustScaling`

would be a better choice over`Normalization`

.For example, in our dataset, the minimum

`age`

is 19 years. We are intending to use this dataset to build a model which can predict a quantitative measure of diabetes disease progression one year after baseline. If we use`Normalization`

for scaling, we are assuming that we will always receive patients aged 19 years or older. In future, if we receive a patient who is younger than the 19 years, the scaled`age`

value for that patient will be a negative number and doesn’t align with the original idea of scaling`age`

in range [0, 1].This can negatively impact the predictions of the model as model has never seen a data sample with negative age value during training process.

Similarly, the maximum

`age`

in our dataset is 79 years. If we receive a patient who is older than the 79 years, the scaled`age`

value for that patient will be a greater than 1 which doesn’t align with the original idea of scaling`age`

in range [0, 1].On other end, there could be features where it is easy to estimate

`Min`

and`Max`

of population just from the sample. For example any form of customer star ratings is usually represented in the range of [0 - 5] stars. Here there is no scope of receiving a rating less than 0 or more than 5. In this case it is easy to estimate`Min`

and`Max`

of population and could be based on our understanding of the feature. In such cases, we can use`Normalization`

for scaling. Digital Images are another such example, where we can use Normalization to scale the data.

**Outliers**- Usually descriptive statistics such
`Min`

,`Max`

,`Mean`

and`Standard deviation`

are very sensitive to outliers and can change significantly by a small presence of outliers in the data. On the other end, descriptive statistics such as`Median`

and`Inter-Quartile Range`

is less sensitive to outliers. - Robust Scaler is one which uses
`Median`

and`Inter-Quartile Range`

to scale the data. Because of this it is less sensitive to outliers. - So if the input features has significantly higher number of outliers, it is always better to use Robust Scaler.

- Usually descriptive statistics such

## Impact of Scaling

In this section we will compare different types of scaler on our dataset.

```
def plot_scaling_comparison(data, scaled_data, column, title):
fig, axs = plt.subplots(
nrows=2,
ncols=2,
figsize=(8, 8),
gridspec_kw={"height_ratios": (.20, .80)},
dpi=100,
constrained_layout=False
)
fig.suptitle(title)
bplot = sns.boxplot(data=data, x=column, ax=axs[0][0])
hplot = sns.histplot(data=data, x=column, ax=axs[1][0], kde=True, bins='sqrt')
hplot.vlines(x=[np.mean(data[column]), np.median(data[column])], ymin=hplot.get_ylim()[0], ymax=hplot.get_ylim()[1], ls='--', colors=['tab:green', 'tab:red'], lw=2)
bplot = sns.boxplot(data=scaled_data, x=column, ax=axs[0][1])
hplot = sns.histplot(data=scaled_data, x=column, ax=axs[1][1], kde=True, bins='sqrt')
hplot.vlines(x=[np.mean(scaled_data[column]), np.median(scaled_data[column])], ymin=hplot.get_ylim()[
0], ymax=hplot.get_ylim()[1], ls='--', colors=['tab:green', 'tab:red'], lw=2)
axs[0][0].set(xlabel='')
axs[0][0].set_facecolor('white')
axs[1][0].set_facecolor('white')
axs[0][1].set(xlabel='')
axs[0][1].set_facecolor('white')
axs[1][1].set_facecolor('white')
```

```
normalization_scaler = MinMaxScaler()
normalized_X = pd.DataFrame(normalization_scaler.fit_transform(X), columns=X.columns)
standard_scaler = StandardScaler()
standardized_X = pd.DataFrame(standard_scaler.fit_transform(X), columns=X.columns)
robust_scaler = RobustScaler()
robust_scaled_X = pd.DataFrame(
robust_scaler.fit_transform(X), columns=X.columns)
```

```
normalized_X.describe().T
```

count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|

age | 442.0 | 0.491968 | 0.218484 | 0.0 | 0.320833 | 0.516667 | 0.666667 | 1.0 |

sex | 442.0 | 0.468326 | 0.499561 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 1.0 |

bmi | 442.0 | 0.346107 | 0.182567 | 0.0 | 0.214876 | 0.318182 | 0.465909 | 1.0 |

bp | 442.0 | 0.459817 | 0.194807 | 0.0 | 0.309859 | 0.436620 | 0.605634 | 1.0 |

s1 | 442.0 | 0.451668 | 0.169647 | 0.0 | 0.329657 | 0.436275 | 0.552696 | 1.0 |

s2 | 442.0 | 0.367725 | 0.151460 | 0.0 | 0.271165 | 0.355578 | 0.462649 | 1.0 |

s3 | 442.0 | 0.360889 | 0.167977 | 0.0 | 0.237013 | 0.337662 | 0.464286 | 1.0 |

s4 | 442.0 | 0.291996 | 0.182010 | 0.0 | 0.141044 | 0.282087 | 0.423131 | 1.0 |

s5 | 442.0 | 0.485560 | 0.183366 | 0.0 | 0.357542 | 0.478062 | 0.610446 | 1.0 |

s6 | 442.0 | 0.503942 | 0.174187 | 0.0 | 0.382576 | 0.500000 | 0.606061 | 1.0 |

```
standardized_X.describe().T
```

count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|

age | 442.0 | 8.037814e-18 | 1.001133 | -2.254290 | -0.784172 | 0.113172 | 0.800500 | 2.327895 |

sex | 442.0 | 1.607563e-16 | 1.001133 | -0.938537 | -0.938537 | -0.938537 | 1.065488 | 1.065488 |

bmi | 442.0 | 1.004727e-16 | 1.001133 | -1.897929 | -0.719625 | -0.153132 | 0.656952 | 3.585718 |

bp | 442.0 | 1.060991e-15 | 1.001133 | -2.363050 | -0.770650 | -0.119214 | 0.749368 | 2.776058 |

s1 | 442.0 | -2.893613e-16 | 1.001133 | -2.665411 | -0.720020 | -0.090841 | 0.596193 | 3.235851 |

s2 | 442.0 | -1.245861e-16 | 1.001133 | -2.430626 | -0.638249 | -0.080291 | 0.627442 | 4.179278 |

s3 | 442.0 | -1.326239e-16 | 1.001133 | -2.150883 | -0.738296 | -0.138431 | 0.616239 | 3.809072 |

s4 | 442.0 | -1.446806e-16 | 1.001133 | -1.606102 | -0.830301 | -0.054499 | 0.721302 | 3.894331 |

s5 | 442.0 | 2.250588e-16 | 1.001133 | -2.651040 | -0.698949 | -0.040937 | 0.681851 | 2.808722 |

s6 | 442.0 | 2.371155e-16 | 1.001133 | -2.896390 | -0.697549 | -0.022657 | 0.586922 | 2.851075 |

```
robust_scaled_X.describe().T
```

count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|

age | 442.0 | -0.071417 | 0.631760 | -1.493976 | -0.566265 | 0.0 | 0.433735 | 1.397590 |

sex | 442.0 | 0.468326 | 0.499561 | 0.000000 | 0.000000 | 0.0 | 1.000000 | 1.000000 |

bmi | 442.0 | 0.111241 | 0.727263 | -1.267490 | -0.411523 | 0.0 | 0.588477 | 2.716049 |

bp | 442.0 | 0.078429 | 0.658633 | -1.476190 | -0.428571 | 0.0 | 0.571429 | 1.904762 |

s1 | 442.0 | 0.069017 | 0.760617 | -1.956044 | -0.478022 | 0.0 | 0.521978 | 2.527473 |

s2 | 442.0 | 0.063437 | 0.790977 | -1.856957 | -0.440832 | 0.0 | 0.559168 | 3.365410 |

s3 | 442.0 | 0.102198 | 0.739097 | -1.485714 | -0.442857 | 0.0 | 0.557143 | 2.914286 |

s4 | 442.0 | 0.035124 | 0.645225 | -1.000000 | -0.500000 | 0.0 | 0.500000 | 2.545000 |

s5 | 442.0 | 0.029647 | 0.725039 | -1.890285 | -0.476544 | 0.0 | 0.523456 | 2.063775 |

s6 | 442.0 | 0.017639 | 0.779413 | -2.237288 | -0.525424 | 0.0 | 0.474576 | 2.237288 |

```
column='bmi'
```

```
plot_scaling_comparison(X, normalized_X, column=column, title="Original Data - Normalized Data")
```

```
plot_scaling_comparison(X, standardized_X, column=column,
title="Original Data - Standardized Data")
```

```
plot_scaling_comparison(X, robust_scaled_X, column=column,
title="Original Data - Robust Scaled Data")
```