Overview¶

Install¶

Condas is virtualenv for data science.

It will install all the packages necessary for our foray into machine learning & stats and build our environments.

In order to help us prevent network slow downs during lectures on Mondday, it would be great if you could install your root environment.

Here is a work-flow (from a data scientist): http://stiglerdiet.com/blog/2015/Nov/24/my-python-environment-workflow-with-conda/

Here is the documentation for conda: http://conda.pydata.org/docs/

Here is the cheatsheet: http://conda.pydata.org/docs/_downloads/conda-cheatsheet.pdf

Download https://www.continuum.io/downloads for your OS
bash [AndaconadaFile].sh
cd /path/to/anaconda/bin; source activate root;cd
```
conda list -e > [MY_SPEC_FILE].txt
```
remove lines contain conda and anaconda from [MY_SPEC_FILE].txt

conda create --name [ENV_NAME] --file [MY_SPEC_FILE].txt

Welcome¶

Kaggle: https://www.kaggle.com/c/titanic
Pods
Create your team
Setup Your team repo and associate

Structure¶

Data Analysis¶

Data Structure¶

Trees

Algorithms¶

KNN / K-Means (Vector Quantization)
Decision Tress
Linear/Logistic Regression
Multivariate Regression

Tools¶

numpy, scipy
scikit-learn
jupyter notebook
matplotlib, seaborn
pandas
pytest-ipynb
sql

Lab¶

pods (recitation)
pairs (alternate daily)

Experiment¶

1. Hypothesis | Aim | Objectives¶

Predict survival of Titanic passangers.
Which features most accurately predict the outcome?
Which machine learning algorithms are most accurate?

2. Data Analysis¶

cleaning¶

nulls/unknowns
aggregate fields
noise

Pre-processing¶

Formatting, Sampling

codebook¶

Princeton Codebook: (http://dss.princeton.edu/online_help/analysis/codebook.htm)
CDC: http://www.cdc.gov/hiv/pdf/library_software_answr_codebook.pdf
McGill Medicine: http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/CodebookCookbook.html
Kaggle Titanic: https://www.kaggle.com/c/titanic/data

statistics & shape¶

Nick

3. Selection of Features¶

input X¶

{independent(causality)|predictor(correlated)|explanatory(statistically dependent)|Feature}

Class, Sex, Age, Siblings, ParCh, SibSp, Embarked, Cabin

output Y¶

{dependent|predicted|response|Outcome}

Survived

factors and indicators¶

{catagorical feature} and {dummy variables}

Class, Sex, Cabin, Embarked

4. Experiment Heueristics (Design)¶

Evaluation¶

Titanic: https://www.kaggle.com/c/titanic/details/evaluation
ROC: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
RMSE: https://www.kaggle.com/wiki/RootMeanSquaredError)
Log Loss: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/details/evaluation
Mean F Score: https://www.kaggle.com/wiki/MeanFScore

Representation¶

Data: 60% Train, 10% Validation, 30% Test
Algorithms

Optimization¶

off-the-rack
consignment
thrift-store

5. Experiment¶

\(Learning = Representation + Optimization\)¶

~Pedro Domingos https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

6. Conclusions¶

A well measured experiment...

7. Recommendation¶

Success comes from failures too...