The Decision Tree Classifier

A decision tree classifier is an algorithm that uses branches of divisions in parameter space to classify data. Training data is used to construct the tree, and any new data that the tree is applied to is classified based on what was set by the training data. Divisions occur one characteristic at a time, so classification ends up following a human-understandable sequence boolean decisions (e.g. class A people are shorter than 5 foot 3, and class B people are taller).

A visualization of a decision tree A visualization of the classification regions

The Algorithm

For every step in the tree:

  1. There is data \(D\), with each row having characteristics (\(x_1\), \(x_2\), ..., \(x_N\)) and label \(y\)
  2. Within one characteristic \(x_j\), \(D\) will be divided along a decision boundary \(t\). The data will be split into \(D_{left}\) and \(D_{right}\).
\[D_{left} = D[\text{where } x_j \le t]\]\[\begin{split}D_{right} = D[\text{where } x_j > t]\end{split}\]
  1. The best decision boundary will be where the following function is at a minimum:
\[G(D) = \frac{n_{left}}{N_D} \cdot H(D_{left}) + \frac{n_{right}}{N_D} \cdot H(D_{right})\]

where...

\[H(D) = p(1) \cdot (1 - p(1)) + p(2) \cdot (1 - p(2))\]

is the measurement of impurity (i.e. this region is dominated by one class or a mix of multiple classes), and

\[p(k) = \frac{n_D(\text{class} = k)}{N_D}\]

is the fraction of data in the region in class k.

  1. Split the data at the best available decision boundary, and for each portion (left and right) return to step 1.
  2. Iterate on steps 1-4 until you hit you hit your minimum “leaf” size, your maximum tree depth, or \(N_D = 1\).

Advantages

  • Requires very little data preparation/manipulation
  • Prediction on test data is O(log n)
  • Can work on numerical and categorical data

Disadvantages

  • This does not generalize data well, and is very prone to overfitting, especially as life sizes decrease.
  • Cannot handle data with missing values

Attributes

  • max_depth: the maximum number of decisions any branch can take
  • min_leaf_size: the minimum number of data points acceptable before stopping iteration

Operations

  • clf.fit(training_data): construct the tree and build up the decision chains. Returns nothing.
  • clf.predict(test_data) -> test_data + classification: predict the classification of new data given the decision chains built already built. Returns the data given, as well as the labels output by the tree.