Data Munging¶
Real world data is rarely clean or well-structured. The data scientist must know how and when to clean their data, and what it means to even have data that’s clean. We’ll get into that later when we talk about validating your data and accounting for bad rows.
Pandas¶
To even begin to clean your data well, it helps to have it well-organized.
Pandas is a package that’s great for keeping your data well organized, giving you access to fast NumPy-like math capabilities with Excel-like access to your data in visible columns and rows.
It accomplishes this through a new data structure called the DataFrame.
DataFrames are fairly easy to build from a standard Python dict
.
Pandas is not part of the Python standard library, so it was included in the data_science_requirements.txt
file that was downloaded yesterday.
If you want to download it yourself for other environments just pip install pandas
.
In [1]: import pandas as pd
In [2]: from datetime import datetime
In [3]: fmt = "%b %d, %Y"
In [4]: finances = {
...: "Name": ["Pablo", "Marcel", "Lisa", "Joanne"],
...: "Assets": [120000, 80000, 110000, 230000],
...: "Debts": [90000, 80000, 30000, 50000],
...: "Updated": [
...: datetime.strptime("Jun 10, 2011", fmt),
...: datetime.strptime("Dec 30, 2005", fmt),
...: datetime.strptime("May 4, 2000", fmt),
...: datetime.strptime("Feb 16, 2007", fmt),
...: ],
...: "Total Rating": [3.5, 2.5, 4.0, 5.0]
...: }
...:
In [5]: finances_df = pd.DataFrame(finances)
In [6]: print(finances_df)
Assets Debts Name Total Rating Updated
0 120000 90000 Pablo 3.5 2011-06-10
1 80000 80000 Marcel 2.5 2005-12-30
2 110000 30000 Lisa 4.0 2000-05-04
3 230000 50000 Joanne 5.0 2007-02-16
When creating a DataFrame from a dictionary, Pandas takes the keys from the dictionary and turns them into column names.
You can then access the columns in a dict
-like fashion with bracket notation, or in Javascript-like dot notation.
In [7]: print(finances_df["Total Rating"])
0 3.5
1 2.5
2 4.0
3 5.0
Name: Total Rating, dtype: float64
In [8]: print(finances_df.Debts)