Decision Trees & Random Forests Experiment¶
Objectives¶
Predict survival of Titanic passangers.
Data Analysis¶
Selection of Features¶
- Must be discrete –> discretize continuous variables
In [3]:
temperatures = [-40.0, -20.5, -15.13, 0.00, 15.0, 32.0, 66.0, 98.6, 212.0]
temps_discrete = {'frozen':[-40.0, -20.5, -15.13, 0.00], 'cold':[15.0, 32.0], 'nice':[66.0, 98.6], 'hot':[212.0]}
Experiment Heueristics (Design)¶
Evaluation¶
Representation¶
Decision Tree (ID3)¶
- Entropy:
H(p) = -p * log2(p) if p <> 0 else: 0
- Information Gain:
gain(S, F) = H(S) - sum([len(S[f]/len(S) * H(s[f] for f in F])
Weather can be rainy or sunny.
S is a sampling of days which I wear a hoodie.
S = [9+, 5-]
S[rainy] = [6+, 2-]
S[sunny] = [3+, 3-]
\(Gain(S, hoodie) = (H(9/14) + H(5/14)) - 8/14(-6/8* log2(6/8) + log2(2/8) * -2/8) - 6/14*1 = 0.048\)
What about temperature cold[6+, 1-], hot[3+, 4-]?
gain(S, temperature) = .151
Which is closer to the root of the problem?
Bagging¶
Uniformly random sampling with replacement.
Tuning¶
Experiment¶
forest = RandomForestClassifier(n_estimators=100) forest = forest.fit( train_data[0::,1::], train_data[0::,0] ) output = forest.predict(test_data).astype(int)
In [6]:
#1. open test.csv & clean
#2. predict on test data
#3. convert predictions to datframe
'''df_result = pandas.DataFrame(results[:,0:2], columns=['PassengerId', 'Survived'])'''
#4. dump csv
'''df_result.to_csv('titanic.csv', index=False) '''
Out[6]:
"df_result.to_csv('titanic.csv', index=False) "