Introduction to PandasΒΆ
Date: April 4th, 2016
Class: Code 401: Software Development with Python
Pandas is a great tool for reading, storing, and analyzing data. Let’s walk through it.
In [1]:
import pandas as pd
It’s simple to read data into pandas if you know what the format of your dataset looks like. If it’s comma-separated (csv), then you can use the pandas read_csv module with its default settings. The name of the data file is “titanic_data.csv”. Change the path to the data depending on where it’s saved on your machine.
In [2]:
infile = "../downloads/titanic_data.csv"
your_data = pd.read_csv(infile)
Just like in the command line, you can use “head” to look at the first few rows of data.
In [3]:
your_data.head()
Out[3]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Pandas preserves the column names in the data file to go along with
every column. read_csv can also load files delimited in other ways by
setting the sep
parameter. For tab-separated data, it’s
sep='\t'
. For pipes, it’s sep='|'
. And so on.
Each row in this data set corresponds to a passenger, and each column contains details for that passenger (Sex, Age, etc.). The object that the data is read into is called a DataFrame. The column furthest to the left is the index of each row in this DataFrame.
One of the great things about DataFrames is that you can return data from the DataFrame by the name of the column. So, if I wanted to look specifically at the genders of all the passengers I can access it like I would in a dictionary.
In [4]:
your_data["Sex"][:10]
Out[4]:
0 male
1 female
2 female
3 female
4 male
5 male
6 male
7 male
8 female
9 female
Name: Sex, dtype: object
You can do the same thing using dot notation. The difference is that the dictionary notation allows for column names with spaces and dashes. Dot notation is best for simple column names, and in the case of this data set works here.
In [5]:
your_data.Age[:10]
Out[5]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
You can retrieve multiple columns at once using something similar to dictionary notation, though instead of providing keys, you provide a list of keys.
In [6]:
your_data[["Sex", "Age"]][:10]
Out[6]:
Sex | Age | |
---|---|---|
0 | male | 22.0 |
1 | female | 38.0 |
2 | female | 26.0 |
3 | female | 35.0 |
4 | male | 35.0 |
5 | male | NaN |
6 | male | 54.0 |
7 | male | 2.0 |
8 | female | 27.0 |
9 | female | 14.0 |
As you’ve seen thus far, you can slice DataFrames like you can slice lists. You can also slice the DataFrame using boolean criteria. For example, we can look at data for just male passengers on the Titanic.
In [7]:
your_data[your_data.Sex == "male"][:10]
Out[7]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
20 | 21 | 0 | 2 | Fynney, Mr. Joseph J | male | 35.0 | 0 | 0 | 239865 | 26.0000 | NaN | S |
Notice how only the indices that you’re interested remain. Similarly, you can filter on numeric criteria. Let’s get all passengers older than 22
In [8]:
your_data[your_data.Age > 22][:10]
Out[8]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 0 | 0 | 248706 | 16.0000 | NaN | S |
18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.0 | 1 | 0 | 345763 | 18.0000 | NaN | S |
I can combine multiple criteria together with boolean operators
In [9]:
your_data[(your_data.Age > 22) & (your_data.Sex == "male")][:10]
Out[9]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
20 | 21 | 0 | 2 | Fynney, Mr. Joseph J | male | 35.0 | 0 | 0 | 239865 | 26.0000 | NaN | S |
21 | 22 | 1 | 2 | Beesley, Mr. Lawrence | male | 34.0 | 0 | 0 | 248698 | 13.0000 | D56 | S |
23 | 24 | 1 | 1 | Sloper, Mr. William Thompson | male | 28.0 | 0 | 0 | 113788 | 35.5000 | A6 | S |
30 | 31 | 0 | 1 | Uruchurtu, Don. Manuel E | male | 40.0 | 0 | 0 | PC 17601 | 27.7208 | NaN | C |
33 | 34 | 0 | 2 | Wheadon, Mr. Edward H | male | 66.0 | 0 | 0 | C.A. 24579 | 10.5000 | NaN | S |
34 | 35 | 0 | 1 | Meyer, Mr. Edgar Joseph | male | 28.0 | 1 | 0 | PC 17604 | 82.1708 | NaN | C |
35 | 36 | 0 | 1 | Holverson, Mr. Alexander Oskar | male | 42.0 | 1 | 0 | 113789 | 52.0000 | NaN | S |
Awesome. Note that any sliced or filtered DataFrame is also a DataFrame. So all of the same methods that were with the parent are now with the sliced/filtered child.
Now that we can access the data that we want, we can do stuff to it. Let’s find the total Fare spent on tickets for the Titanic.
In [10]:
import numpy as np
np.sum(your_data.Fare)
Out[10]:
28693.949299999967
What’s the Sex of the passenger that paid the most?
In [11]:
your_data[your_data.Age == np.max(your_data.Age)].Sex
Out[11]:
630 male
Name: Sex, dtype: object
There’s plenty more that a DataFrame can do. Dig into the documentation like so (this isn’t Pandas-specific)
In [12]:
? your_data
If you haven’t run the above cell yet, it pops up a window at the bottom of your screen containing all the documentation that you can also access from the .doc method. Dive in, and Slack me your questions.