Overview¶
Install¶
Condas is virtualenv for data science.
It will install all the packages necessary for our foray into machine learning & stats and build our environments.
In order to help us prevent network slow downs during lectures on Mondday, it would be great if you could install your root environment.
Here is a work-flow (from a data scientist): http://stiglerdiet.com/blog/2015/Nov/24/my-python-environment-workflow-with-conda/
Here is the documentation for conda: http://conda.pydata.org/docs/
Here is the cheatsheet: http://conda.pydata.org/docs/_downloads/conda-cheatsheet.pdf
Download https://www.continuum.io/downloads for your OS
bash [AndaconadaFile].sh
cd /path/to/anaconda/bin; source activate root;cd
conda list -e > [MY_SPEC_FILE].txt
remove lines contain conda and anaconda from [MY_SPEC_FILE].txt
conda create --name [ENV_NAME] --file [MY_SPEC_FILE].txt
Welcome¶
- Kaggle: https://www.kaggle.com/c/titanic
- Pods
- Create your team
- Setup Your team repo and associate
Structure¶
Data Analysis¶
Data Structure¶
- Trees
Algorithms¶
- KNN / K-Means (Vector Quantization)
- Decision Tress
- Linear/Logistic Regression
- Multivariate Regression
Tools¶
- numpy, scipy
- scikit-learn
- jupyter notebook
- matplotlib, seaborn
- pandas
- pytest-ipynb
- sql
Lab¶
- pods (recitation)
- pairs (alternate daily)
Experiment¶
1. Hypothesis | Aim | Objectives¶
- Predict survival of Titanic passangers.
- Which features most accurately predict the outcome?
- Which machine learning algorithms are most accurate?
2. Data Analysis¶
cleaning¶
- nulls/unknowns
- aggregate fields
- noise
Pre-processing¶
- Formatting, Sampling
codebook¶
- Princeton Codebook: (http://dss.princeton.edu/online_help/analysis/codebook.htm)
- CDC: http://www.cdc.gov/hiv/pdf/library_software_answr_codebook.pdf
- McGill Medicine: http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/CodebookCookbook.html
- Kaggle Titanic: https://www.kaggle.com/c/titanic/data
statistics & shape¶
Nick
3. Selection of Features¶
input X¶
{independent(causality)|predictor(correlated)|explanatory(statistically dependent)|Feature}
- Class, Sex, Age, Siblings, ParCh, SibSp, Embarked, Cabin
4. Experiment Heueristics (Design)¶
Evaluation¶
- Titanic: https://www.kaggle.com/c/titanic/details/evaluation
- ROC: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
- RMSE: https://www.kaggle.com/wiki/RootMeanSquaredError)
- Log Loss: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/details/evaluation
- Mean F Score: https://www.kaggle.com/wiki/MeanFScore
Representation¶
- Data: 60% Train, 10% Validation, 30% Test
- Algorithms
Optimization¶
- off-the-rack
- consignment
- thrift-store
5. Experiment¶
\(Learning = Representation + Optimization\)¶
~Pedro Domingos https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
6. Conclusions¶
- A well measured experiment...
7. Recommendation¶
- Success comes from failures too...