Supervised Classification

Consider an average e-mail inbox. This inbox receives any number of messages from various sources many times per day. Some of those sources aren’t folks that you want to hear from, so you manually filter them away. Most emails you get with “Special Offer Guarantee!” in the body can go straight to spam. Some are marked “important!”, typically ones from emails ending in “@mycompany.com”. Others still are just regular emails that you may look into at some point in the future.

If you’re fortunate, your e-mail client will start to learn from what you’ve been doing with your messages and try to emulate that activity. It’ll look at all the things you’ve labeled as “spam” or “trash” and infer what constitutes a message to be labeled as such from the patterns within. Similarly, “important” messages will likely have certain similarities that would identify them ahead of time, without you having to mark it yousrelf.

If your email client has any sort of message-filtering built in it’s making use of a Supervised Classification algorithm, using data that you DO know about to infer labels for data that you DON’T know about.

Common supervised classification algorithms include:

  • Regression algorithms
  • Support Vector Machines
  • K-Nearest Neighbors
  • Gaussian Processes
  • Neural Networks
  • Naive Bayes
  • Decision Trees

We’ll talk about the ones in bold here. You’re encouraged to explore the rest on your own.

The Look of Labeled Data

What constitutes a label, or a class, is entirely dependent upon the question being asked. Consider this data about survivors of the Titanic:

In [1]: import pandas as pd

In [2]: data = pd.read_csv("../downloads/titanic_data.csv")

In [3]: print(data)
   PassengerId  Survived  Pclass                                                Name     Sex   Age  SibSp Parch            Ticket     Fare Cabin Embarked
0            1         0       3                             Braund, Mr. Owen Harris    male  22.0      1     0         A/5 21171   7.2500   NaN        S
1            2         1       1   Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1     0          PC 17599  71.2833   C85        C
2            3         1       3                              Heikkinen, Miss. Laina  female  26.0      0     0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1        Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1     0            113803  53.1000  C123        S
4            5         0       3                            Allen, Mr. William Henry    male  35.0      0     0            373450   8.0500   NaN        S
5            6         0       3                                    Moran, Mr. James    male   NaN      0     0            330877   8.4583   NaN        Q
6            7         0       1                             McCarthy, Mr. Timothy J    male  54.0      0     0             17463  51.8625   E46        S
7            8         0       3                      Palsson, Master. Gosta Leonard    male   2.0      3     1            349909  21.0750   NaN        S
8            9         1       3   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female  27.0      0     2            347742  11.1333   NaN        S
9           10         1       2                 Nasser, Mrs. Nicholas (Adele Achem)  female  14.0      1     0            237736  30.0708   NaN        C
...

If my question were “Who lived and who died on the Titanic?” then clearly the class would be “Survived”, and I would write an algorithm that most efficiently predicted survival. However, if my question had been, “What’s the best determinant of where a passenger embarked from?” my classes would be entirely different and would pull from the “Embarked” column.

When writing your own machine learning algorithms, you can tailor them to interpret along the classes given in your data. This means your classes can be “1, 2, 3” or “A, B, C” or “fruit, vegetable, meat”, or whatever. However, if you were to use the algorithms present in Python’s scikit-learn package for Data Science and Machine Learning, you’d have to translate the classes to numerical information.