K-Means Experiment¶
Objectives¶
Predict survival of Titanic passangers.
Which features most accurately predict the outcome?
Data Analysis¶
In [31]:
import pandas
import numpy
from sklearn.cross_validation import train_test_split
from sklearn.cluster import KMeans
from pprint import pprint
MY_TITANIC_TRAIN = '/media/removable/data/train_titanic.csv'
MY_TITANIC_TEST = '/media/removable/data/test_titanic.csv'
titanic_dataframe = pandas.read_csv(MY_TITANIC_TRAIN, header=0)
- fix missing values
In [32]:
titanic_dataframe = titanic_dataframe.dropna()
- statistics & shape
Selection of Features¶
- Must have a mean –> Remove categorical data
In [33]:
titanic_dataframe.drop(['Name', 'Ticket', 'Cabin', 'Embarked', 'Sex'], axis=1, inplace=True)
#print('length: {0} '.format(len(titanic_dataframe)))
#print(titanic_dataframe.head(5))
length: 183
PassengerId Survived Pclass Age SibSp Parch Fare
1 2 1 1 38 1 0 71.2833
3 4 1 1 35 1 0 53.1000
6 7 0 1 54 0 0 51.8625
10 11 1 3 4 1 1 16.7000
11 12 1 1 58 0 0 26.5500
- discrete vs. continuous
In [34]:
print(2.2 * 3.0 == 6.6)
print(3.3 * 2.0 == 6.6)
False
True
Oh look –floats bite.
Experiment Heueristics (Design)¶
Evaluation¶
Confusion Matrix: https://en.wikipedia.org/wiki/Confusion_matrix
Confusion Matrix Clarification: http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
Mean F Score: https://www.kaggle.com/wiki/MeanFScore
- \(F_1 = 2 * \frac{precision * recall}{ precision + recall}\)
- \(precision = \frac{tp}{tp+fp}\)
- \(recall = \frac{tp}{tp+fn}\)
Representation¶
K-means
Distance = Euclidean (yes I mispelled this in KNN.ipynb)
Data: 60% Train, 10% Validation, 30% Test
In [35]:
train, test = train_test_split(titanic_dataframe, test_size = 0.2)
y = train['Survived']
X = train[2:]
Optimization¶
- vary numerical features used
- vary K
- vary initialization
Experiment¶
In [36]:
k = 2
kmeans = KMeans(n_clusters=k)
results = kmeans.fit_predict(X.values, y.values)
print(results)
[0 0 1 1 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1 1 1
0 0 0 1 0 1 1 1 0 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 0 0 0 1 1 0 0
1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 1
1 1 0 0 0 1 0 1 1 0 1 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1]
- Prepare and upload to Kaggle
In [38]:
#1. open test.csv & clean
#2. predict on test data
#3. convert predictions to datframe
'''df_result = pandas.DataFrame(results[:,0:2], columns=['PassengerId', 'Survived'])'''
#4. dump csv
'''df_result.to_csv('titanic.csv', index=False) '''
Out[38]:
"df_result.to_csv('titanic.csv', index=False) "