Rescaling Data¶
One issue with classification algorithms is that some of them are biased depending on how close data points are in their parameter space. For example, annual CEO salaries may range between $300 thousand to $30 million, but there isn’t much difference between a CEO making $29 million and one making $30 million. By contrast, length of CEO tenures will often be between 1 - 20 years. I should be able to use both characteristics together to classify CEOs, however if I’m basing my classification on how “close” two data points are in parameter space, the differences of millions in salary will dominate over differences of individual years.
When we use a classification algorithm that relies on distances between points, we need to make sure those distances are on appropriately-similar scales.
Simple Rescaling¶
The simplest rescaling one can do is to take a range of data and map it onto a zero-to-one scale. Take for example the following data:
In [1]: ages = [44.9, 35.1, 28.2, 19.4, 28.9, 33.5, 22.0, 21.7, 30.9, 27.9]
In [2]: heights = [70.4, 61.7, 75.3, 66.8, 66.9, 61.3, 61.7, 74.4, 76.5, 60.7]
In [3]: import pandas as pd
In [4]: data_df = pd.DataFrame({"ages": ages, "heights": heights})
The range of ages spans across 25.5 years
In [5]: print(data_df.ages.max() - data_df.ages.min())
25.5
while heights (in inches) go across 15.8 inches
In [6]: print(data_df.heights.max() - data_df.heights.min())
15.8
These metrics are clearly not on the same scale. We can put them on the same scale by making their minimum be zero and their maximum be one. The procedure is as follows:
- Subtract from every item in a column the minimum of that column
In [7]: tmp_ages = data_df.ages - data_df.ages.min()
- Divide the resulting values by the maximum of those values.
In [8]: scaled_ages = tmp_ages / tmp_ages.max()
In [9]: print(scaled_ages.min(), scaled_ages.max())
0.0 1.0
Because we always want to avoid changing our source data, let’s make new columns for these rescaled values.
In [10]: tmp_heights = data_df.heights - data_df.heights.min()
In [11]: scaled_heights = tmp_heights / tmp_heights.max()
In [12]: data_df["scaled_ages"] = scaled_ages
In [13]: data_df["scaled_heights"] = scaled_heights
Let’s check that our scaling hasn’t changed the overall distribution of data by visualizing it.
In [14]: import matplotlib.pyplot as plt
In [15]: plt.figure()
Out[15]: <matplotlib.figure.Figure at 0x1076d9da0>
In [16]: plt.subplot(1, 2, 1)