Decision Trees & Random Forests Experiment¶

Objectives¶

Predict survival of Titanic passangers.

Data Analysis¶

Selection of Features¶

Must be discrete –> discretize continuous variables

In [3]:

temperatures =  [-40.0, -20.5, -15.13, 0.00, 15.0, 32.0, 66.0, 98.6, 212.0]
temps_discrete = {'frozen':[-40.0, -20.5, -15.13, 0.00], 'cold':[15.0, 32.0], 'nice':[66.0, 98.6], 'hot':[212.0]}

Experiment Heueristics (Design)¶

Evaluation¶

Representation¶

Decision Tree (ID3)¶

Entropy:

H(p) = -p * log2(p) if p <> 0 else: 0

Information Gain:

gain(S, F) = H(S) - sum([len(S[f]/len(S) * H(s[f] for f in F])

Weather can be rainy or sunny.
S is a sampling of days which I wear a hoodie.
S = [9+, 5-]
S[rainy] = [6+, 2-]
S[sunny] = [3+, 3-]

\(Gain(S, hoodie) = (H(9/14) + H(5/14)) - 8/14(-6/8* log2(6/8) + log2(2/8) * -2/8) - 6/14*1 = 0.048\)

What about temperature cold[6+, 1-], hot[3+, 4-]?

gain(S, temperature) = .151

Which is closer to the root of the problem?

Bagging¶

Uniformly random sampling with replacement.

https://en.wikipedia.org/wiki/Bootstrap_aggregating

Tuning¶

Experiment¶

forest = RandomForestClassifier(n_estimators=100) forest = forest.fit( train_data[0::,1::], train_data[0::,0] ) output = forest.predict(test_data).astype(int)

In [6]:

#1. open test.csv & clean
#2. predict on test data
#3. convert predictions to datframe
'''df_result = pandas.DataFrame(results[:,0:2], columns=['PassengerId', 'Survived'])'''
#4. dump csv
'''df_result.to_csv('titanic.csv', index=False) '''

Out[6]:

"df_result.to_csv('titanic.csv', index=False) "

Decision Trees & Random Forests Experiment¶

Objectives¶

Data Analysis¶

Selection of Features¶

Experiment Heueristics (Design)¶

Evaluation¶

Representation¶

Decision Tree (ID3)¶

Bagging¶

Tuning¶

Experiment¶

Conclusions¶

Recommendation¶

Table Of Contents

Previous topic

Next topic