Decision Trees & Random Forests Experiment

Objectives

Predict survival of Titanic passangers.

Data Analysis

Selection of Features

  • Must be discrete –> discretize continuous variables
In [3]:
temperatures =  [-40.0, -20.5, -15.13, 0.00, 15.0, 32.0, 66.0, 98.6, 212.0]
temps_discrete = {'frozen':[-40.0, -20.5, -15.13, 0.00], 'cold':[15.0, 32.0], 'nice':[66.0, 98.6], 'hot':[212.0]}

Experiment Heueristics (Design)

Evaluation

Representation

Decision Tree (ID3)

  • Entropy:

H(p) = -p * log2(p) if p <> 0 else: 0

  • Information Gain:

gain(S, F) = H(S) - sum([len(S[f]/len(S) * H(s[f] for f in F])

Weather can be rainy or sunny.
S is a sampling of days which I wear a hoodie.
S = [9+, 5-]
S[rainy] = [6+, 2-]
S[sunny] = [3+, 3-]

\(Gain(S, hoodie) = (H(9/14) + H(5/14)) - 8/14(-6/8* log2(6/8) + log2(2/8) * -2/8) - 6/14*1 = 0.048\)

What about temperature cold[6+, 1-], hot[3+, 4-]?

gain(S, temperature) = .151

Which is closer to the root of the problem?

Bagging

Uniformly random sampling with replacement.

Tuning

Experiment

forest = RandomForestClassifier(n_estimators=100) forest = forest.fit( train_data[0::,1::], train_data[0::,0] ) output = forest.predict(test_data).astype(int)

In [6]:
#1. open test.csv & clean
#2. predict on test data
#3. convert predictions to datframe
'''df_result = pandas.DataFrame(results[:,0:2], columns=['PassengerId', 'Survived'])'''
#4. dump csv
'''df_result.to_csv('titanic.csv', index=False) '''
Out[6]:
"df_result.to_csv('titanic.csv', index=False) "

Conclusions

Recommendation