Working with Imperfect Data¶
The messiest data I’ve encountered has been user-entered. The next best thing to that is publicly-available data, so for this assignment we’re going to find some.
Tasks¶
Pick a major US city and download some data with at least 200 rows in it. Most major cities have open data portals where you can download simple ”.csv” files. For example:
Describe your data set in your README.md file and why you chose it, along with a link to where you obtained your set. Clean and normalize your data however you wish in a Jupyter Notebook. Document every (any) step you take to pre-process data before asking questions about it. Pose 5+ numerical questions about the data and answer those questions within your Jupyter notebook. Produce at least one figure relevant to the questions you’ve asked.
Submitting Your Work¶
Create an imperfect-data
branch in your machine-learning
repository.
When you’re done posing and answering your own questions about your data, reset your Notebook and run all of the cells from top to bottom.
When it’s done, your cells should start with [1]
and every sequent cell should have numbering [N + 1]
.
When your Notebook is clean of unnecessary comments and is well organized from top to bottom, push your Notebook and only your Notebook (not the .ipynb-checkpoints
directory or your data) to GitHub and open a pull request to your master
branch.
Submit the link to that pull request to Canvas.
When you’re done, merge your imperfect-data
branch to master
.