Bring It Into Python¶
Now that we have some standards applied to our data we can read them all into
Python!
For this we will be working in our Jupyter Notebook (docs), even if these notes look like
they’re from the command line.
Our workhorse here is going to be the Pandas package (docs).
Pandas
is great for data analysis, providing analysts with fast aggregate statistics and a unique
data container called the DataFrame
.
The Pandas read_excel
function will
allow us to ...read Excel files into these DataFrames with very little configuration.
Before we dive into using the function, let’s look at some of the documentation.
In [1]: import pandas as pd
In [2]: pd.read_excel ?
Signature: pd.read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None, names=None, parse_cols=None, parse_dates=False, date_parser=None, na_values=None, thousands=None, convert_float=True, has_index_names=None, converters=None, engine=None, squeeze=False, **kwds)
Parameters
----------
io : string, path object (pathlib.Path or py._path.local.LocalPath),
file-like object, pandas ExcelFile, or xlrd workbook.
The string could be a URL. Valid URL schemes include http, ftp, s3,
and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/workbook.xlsx
sheetname : string, int, mixed list of strings/ints, or None, default 0
Strings are used for sheet names, Integers are used in zero-indexed
sheet positions.
Lists of strings/integers are used to request multiple sheets.
Specify None to get all sheets.
str|int -> DataFrame is returned.
list|None -> Dict of DataFrames is returned, with keys representing
sheets.
Available Cases
* Defaults to 0 -> 1st sheet as a DataFrame
* 1 -> 2nd sheet as a DataFrame
* "Sheet1" -> 1st sheet as a DataFrame
* [0,1,"Sheet5"] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
* None -> All sheets as a dictionary of DataFrames
Looks like we can read a spreadsheet in in a fairly simple way. Our go-to will be something along the lines of
the_data = pd.read_excel("path/to/file", sheetname="Sheet We Want", skiprows=4)
the_data
will then be a Pandas DataFrame
, with column names dictated by
the spreadsheet.
This will be a little messy due to some merged rows.
Let’s look:
In [3]: the_data = pd.read_excel("data/AppendixAState2008.xls", sheetname="State", skiprows=4)
In [4]: print(the_data.columns)
Index([ 'Unnamed: 0', 7, 8,
9, 10, 11,
12, '7.1', '8.1',
'9.1', '10.1', '11.1',
'12.1', '7.2', '8.2',
'9.2', '10.2', '11.2',
'12.2', 'Number', 'Rate',
'Start of Grade 10', 'Start of Grade 11', 'Start of Grade 12',
'End of Grade 12', 'Unnamed: 25', 'On-time Grads',
'Size of Cohort****', 'Late Grads', 'Total Grads',
'Unnamed: 30', 'Unnamed: 31', 'Unnamed: 32',
'Unnamed: 33', 'Annual Dropout'],
dtype='object')
In [5]: the_data = pd.read_excel("data/AppendixAState2012.xlsx", sheetname=0, skiprows=4)
In [6]: print(the_data.columns)
Index(['Student Category', 'Beginning Grade 9 Students', 'Transferred In',
'Year 1', 'Year 2', 'Year 3', 'Year 4', 'Transferred Out',
'Adjusted Cohort', 'Graduates', 'Continuing',
'Adjusted 4-Year Cohort Graduation Rate', 'Cohort dropout rate',
'Continuing Rate'],
dtype='object')
We’ll clean all this up in a bit.
Harvesting Data into Python¶
There’s a big break in structure between 2010 and 2011. Prior to 2011, all our data is in a sheet called “State”. With 2011 and after, all our data is in the first sheet at index 0.
Let’s write a function for reading data no matter what, and place the logic for differentiating between the years within that function.
In [7]: def read_sheet(filename):
year = int(filename.split('.x')[0][-4:])
if year < 2011:
data = pd.read_excel(filename, sheetname="State", skiprows=4)
else:
data = pd.read_excel(filename, sheetname=0, skiprows=4)
return data
filename
here comes in as a string, which can then be split on the extension.
The first item produced by that split will be everything before the extension,
and in every case without exception the last four characters will be the year of the data.
We turn that into an integer and thus we have the year!
Another large break is that prior to 2011 there were dropout statistics for grades 7 through 12, whereas after it tracks only grades 9-12. The pre-2011 data gets read in with column names “7.1, 8.1,...,12.1”, which are opaque and not descriptive, so let’s modify both sets of data to some standard.
Data In Your Own Image¶
Changing Column Names¶
Pandas DataFrames have a rename()
method that allows us to change column and row names
as we see fit.
In [8]: the_data.rename ?
Signature: the_data.rename(index=None, columns=None, **kwargs)
Docstring:
Alter axes input function or functions. Function / dict values must be
unique (1-to-1). Labels not contained in a dict / Series will be left
as-is. Alternatively, change ``Series.name`` with a scalar
value (Series only).
Parameters
----------
index, columns : scalar, list-like, dict-like or function, optional
Scalar or list-like will alter the ``Series.name`` attribute,
and raise on DataFrame or Panel.
dict-like or functions are transformations to apply to
that axis' values
copy : boolean, default True
Also copy underlying data
inplace : boolean, default False
Whether to return a new DataFrame. If True then value of copy is
ignored.
...
What we must do is provide it with a dictionary, where the keys are the names of the columns we want to change and the values are the names that we want to change them to.
We could just include the logic for handling this in our existing function. However, even though we’re doing analysis work we still want to be good programmers. Functions should do one thing and one thing only. Let’s create a function that will make this change for us.
If the year is earlier than 2011, we’ll target “9.1”, ..., “12.1”. If after, we’ll go for “Year 1”, ..., “Year 4”. We’ll change both sets to “Dropouts Grade 9”, ..., “Dropouts Grade 12”
In [9]: def alter_columns(dataframe):
columns_dict = {
"9.1": "Dropouts Grade 9",
"10.1": "Dropouts Grade 10",
"11.1": "Dropouts Grade 11",
"12.1": "Dropouts Grade 12",
"Year 1": "Dropouts Grade 9",
"Year 2": "Dropouts Grade 10",
"Year 3": "Dropouts Grade 11",
"Year 4": "Dropouts Grade 12"
}
return dataframe.rename(columns=columns_dict)
What’s great is that the column will only be renamed if it’s present in the DataFrame. Otherwise, it’s ignored. This means that we can pile as many columns renamed as we want into one dict. We can use this to do some of the other column changes we will need to do.
No matter the year, the fact that we’re looking at different demographics is always masked by the column name we get on read.
Whether it’s Unnamed: 0
or Student Category
, they’re both terrible.
Let’s call it Demographic
!
Prior to 2011 we’ve got Total Grads
but after we’ve got Graduates
, which is pretty nondescript.
We have one issue in 2009 where the column is called Unnamed: 29
due to issues with merged rows.
Let’s change them all to Total Graduates
.
Finally, we’re going to need the total sizes of each cohort.
Before 2011 that column is called Size of Cohort****
and then gets called Adjusted Cohort
after 2011.
We have another merged-row anomaly in 2009, where the column is named Unnamed: 27
.
Let’s standardize these as Cohort Size
.
Let’s put all of these together in our columns_dict
and have one uniform naming scheme for the columns that we care about.
In [10]: def alter_columns(dataframe):
columns_dict = {
"Unnamed: 0": "Demographic",
"Student Category": "Demographic",
"9.1": "Dropouts Grade 9",
"10.1": "Dropouts Grade 10",
"11.1": "Dropouts Grade 11",
"12.1": "Dropouts Grade 12"
"Year 1": "Dropouts Grade 9",
"Year 2": "Dropouts Grade 10",
"Year 3": "Dropouts Grade 11",
"Year 4": "Dropouts Grade 12",
"Total Grads": "Total Graduates",
"Graduates": "Total Graduates",
"Unnamed 29": "Total Graduates",
"Size of Cohort****": "Cohort Size",
"Adjusted Cohort": "Cohort Size",
"Unnamed: 27": "Cohort Size"
}
return dataframe.rename(columns=columns_dict)
Now just add it to our “read_sheet” function so that on read, the columns get altered.
In [11]: def read_sheet(filename):
year = int(filename.split('.x')[0][-4:])
if year < 2011:
data = pd.read_excel(filename, sheetname="State", skiprows=4)
else:
data = pd.read_excel(filename, sheetname=0, skiprows=3)
data = alter_columns(data)
return data
Changing Indices¶
Pandas lets us not only access data by column, but by row!
It helps, however, to have rows with meaningful names with which to do this.
In Excel, every row below the column row started out with a demographic name.
When we read it into Python, it created an auto-incremented index for us.
We can ignore that completely using the set_index
method.
In [12]: data.set_index?
Signature: data.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
Docstring:
Set the DataFrame index (row labels) using one or more existing
columns. By default yields a new object.
Parameters
----------
keys : column label or list of column labels / arrays
...
In [9]: print(data) # before the change
Demographic Beginning Grade 9 Students Transferred In ...
0 All Students 79977 12382 ...
1 American Indian 1278 211 ...
2 Asian 5772 1053 ...
3 Pacific Islanders 698 229 ...
4 Black 3622 919 ...
5 Hispanic 14461 2760 ...
6 White 49501 6563 ...
7 Two or More Races 4644 645 ...
8 Special Ed 9870 1093 ...
9 Bilingual 3873 1574 ...
10 Low Income 41430 5772 ...
11 Title I Migrant 2150 417 ...
12 504 4242 231 ...
13 Homeless 3877 1069 ...
14 Foster Care 705 126 ...
15 Female 39142 6070 ...
16 Male 40835 6312 ...
In [13]: data = data.set_index("Demographic")
In [14]: print(data)
Beginning Grade 9 Students Transferred In ...
Demographic
All Students 79977 12382 ...
American Indian 1278 211 ...
Asian 5772 1053 ...
Pacific Islanders 698 229 ...
Black 3622 919 ...
Hispanic 14461 2760 ...
White 49501 6563 ...
Two or More Races 4644 645 ...
Special Ed 9870 1093 ...
Bilingual 3873 1574 ...
Low Income 41430 5772 ...
Title I Migrant 2150 417 ...
504 4242 231 ...
Homeless 3877 1069 ...
Foster Care 705 126 ...
Female 39142 6070 ...
Male 40835 6312 ...
Because this is literally a change brought on by one-line, we can just slap it onto the bottom of our function.
In [15]: def read_sheet(filename):
year = int(filename.split('.x')[0][-4:])
if year < 2011:
data = pd.read_excel(filename, sheetname="State", skiprows=4)
else:
data = pd.read_excel(filename, sheetname=0, skiprows=3)
data = alter_columns(data).set_index("Demographic")
return data
Changing Row Names¶
We do all of this renaming of columns and reassigning of indices for a reason. Pandas lets us reference data by those names! Behold!
In [16]: print(data[["Total Graduates", "Dropouts Grade 12"]])
Total Graduates Dropouts Grade 12
Demographic
All Students 60552 5786
American Indian 760 176
Asian/Pacific Islanders 5438 338
Asian 5027 259
Pacific Islanders 411 79
Black 2487 400
Hispanic 8415 1317
White 40661 3256
Two or More Races 2790 297
Special Ed 4914 966
Bilingual 2428 554
Low Income 24338 4176
Title I Migrant 1370 228
504 2127 220
Female 31286 2397
Male 29266 3389
In [17]: print(data.loc["Female"])
Beginning Grade 9 Students 38928.000000
Transferred In 7060.000000
Dropouts Grade 9 356.000000
Dropouts Grade 10 625.000000
Dropouts Grade 11 1188.000000
Dropouts Grade 12 2397.000000
Transferred Out 7226.000000
Cohort Size 38762.000000
Total Graduates 31286.000000
Continuing 2910.000000
Adjusted 4-Year Cohort Graduation Rate 80.713070
Cohort dropout rate 11.779578
Continuing Rate 7.507353
Name: Female, dtype: float64
Ain’t it cool? Unfortunately, as we’ve seen in the spreadsheets each year has slightly different names for different demographics. As we did with the columns, let’s write a function to set up ONE standard.
In [18]: def alter_rows(dataframe):
rows_dict = {
"Asian/Pac Islander": "Asian/Pacific Islander",
"Asian/Pacific Islanders": "Asian/Pacific Islander",
"Asian Pac Islander": "Asian/Pacific Islander",
"Pac Islander": "Pacific Islander",
"Pacific Islanders": "Pacific Islander",
"American Indians": "First Nations",
"American Indian": "First Nations",
"Amer.Indian": "First Nations",
}
return dataframe.rename(index=rows_dict)
Let’s stick that into the bottom of our read_sheet
function as well.
In [19]: def read_sheet(filename):
year = int(filename.split('.x')[0][-4:])
if year < 2011:
data = pd.read_excel(filename, sheetname="State", skiprows=4)
else:
data = pd.read_excel(filename, sheetname=0, skiprows=3)
data = alter_columns(data).set_index("Demographic")
data = alter_rows(data)
return data
Reducing the Data¶
One must always keep the mission in mind when diving into data. What are we doing here? We’re aggregating educational data so that we can understand how graduation and dropout rates have changed over time. We also want to calculate this rate ourselves, so let’s only retain the data we care about.
What this means for us is that we should isolate the data we want by providing the names for things that we want. In the previous step we decided that the demographics we’re interested in were these:
- All Students
- American Indian
- Asian
- Pacific Islanders
- Black
- Hispanic
- White
And the statistics that we’re interested in are:
- Cohort Size
- Total Graduates
- Total Dropouts (haven’t made this yet, composed of Dropouts Grades 9-12)
So we can write a function that reduces every data set passed in down to only these rows and columns.
In [20]: def reduce_data(dataframe):
wanted_rows = ["All Students", "First Nations",
"Asian/Pacific Islander", "Asian",
"Pacific Islander", "Black", "Hispanic",
"White"]
wanted_columns = ["Dropouts Grade 9", "Dropouts Grade 10",
"Dropouts Grade 11", "Dropouts Grade 12",
"Cohort Size", "Total Graduates"]
return dataframe[wanted_columns].loc[wanted_rows]
Generating Estimates from Data¶
Estimating Dropouts¶
We’ll need to make two estimations on our data sets. One of these is just a direct summation, and the other requires a more significant assumption.
The first will be our Total Dropouts
.
For this we’ll need to add together four columns (Dropouts Grade 9, 10, 11, and 12) and then use that summation to create an entirely new column.
If we just look at the four columns of dropouts we see the following:
In [21]: print(data[["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"]])
Dropouts Grade 9 Dropouts Grade 10 \
Demographic
All Students 767 1363
First Nations 31 65
Asian/Pacific Islander 60 100
Asian 51 86
Pacific Islander 9 14
Black 63 89
Hispanic 216 299
White 361 763
Dropouts Grade 11 Dropouts Grade 12
Demographic
All Students 2730 5786
First Nations 87 176
Asian/Pacific Islander 130 338
Asian 100 259
Pacific Islander 30 79
Black 166 400
Hispanic 662 1317
White 1580 3256
We want to do a summation across rows.
Simple right?
If we try to use Python’s built-in sum
function on these four rows, it’ll throw a TypeError
.
Instead, we’ll use the sum()
method that’s built into the DataFrame.
It’s written to operate across the DataFrame it’s been called on.
In case you didn’t realize it yet, subsets of DataFrames are themselves DataFrames!
In [22]: data[["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"]].sum()
Dropouts Grade 9 2325
Dropouts Grade 10 4142
Dropouts Grade 11 8215
Dropouts Grade 12 17397
dtype: int64
Why didn’t this give us what we wanted?! Pandas assumes first that if you’re doing a sum, you want to collect data within separate columns. However, we want a summation across rows, so that every demographic ends up with its own statistic.
To do that, we must specify an argument in the sum()
method called axis
.
By default, axis
is set to 0, which is summation across columns.
If you think about your DataFrame as an actual table, a table has two axes.
The next axis, axis=1
, corresponds to rows.
If your DataFrame had more dimensions, you would be able to sum across even more axes.
In [23]: data[["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"]].sum(axis=1)
Demographic
All Students 10646
First Nations 359
Asian/Pacific Islander 628
Asian 496
Pacific Islander 132
Black 718
Hispanic 2494
White 5960
dtype: int64
We’ve got the totals we wanted, now let’s add them to our DataFrame.
In [24]: data["Total Dropouts"] = data[["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"]].sum(axis=1)
In [25]: data.columns
Index(['Dropouts Grade 9', 'Dropouts Grade 10', 'Dropouts Grade 11',
'Dropouts Grade 12', 'Cohort Size', 'Total Graduates',
'Total Dropouts'],
dtype='object')
And just like that we’ve created a new column on our DataFrame, complete with the information we want housed within.
In fact, now that we have that summation we probably don’t need the other 4 columns of dropouts.
Let’s...drop those out!
We can do that with the .drop()
method in Pandas.
In [26]: data.drop(["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"], axis=1)
Cohort Size Total Graduates Total Dropouts
Demographic
All Students 78458 60552 10646
First Nations 1339 760 359
Asian/Pacific Islander 6591 5438 628
Asian 5954 5027 496
Pacific Islander 637 411 132
Black 3706 2487 718
Hispanic 12624 8415 2494
White 50598 40661 5960
New Row 6591 5438 628
We choose axis=1
because we’re dropping the same columns across all rows.
Note that this operation does not mutate the DataFrame.
If we wanted to mutate the DataFrame, we’d run that same method with the flag of inplace=True
.
We’re not doing that here, so we’ll just have to reassign our variable name data
to the DataFrame output by .drop()
.
Let’s roll all this into a function and then add that function to read_sheet
.
In [27]: def add_total_dropouts(dataframe):
all_dropouts = ["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"]
dataframe["Total Dropouts"] = dataframe[all_dropouts].sum(axis=1)
dataframe = dataframe.drop(all_dropouts, axis=1)
return dataframe
In [28]: def read_sheet(filename):
year = int(filename.split('.x')[0][-4:])
if year < 2011:
data = pd.read_excel(filename, sheetname="State", skiprows=4)
else:
data = pd.read_excel(filename, sheetname=0, skiprows=4)
data = alter_columns(data).set_index("Demographic")
data = alter_rows(data)
data = reduce_data(data)
data = add_total_dropouts(data)
return data
A Snag in New Data¶
One thing that we’ll find is that when reading in data from 2014 and 2015, the superintendent
of schools became more racially-sensitive and stopped lumping Asians and Pacific Islanders
together as one demographic.
Great for respect and recognition, but bad for us.
When we try our read_sheet
function on that spreadsheet, we find the following:
In [29]: data = read_sheet("../data/AppendixAState2014.xlsx")
In [30]: print(data.loc["Asian/Pacific Islander"])
Cohort Size NaN
Total Graduates NaN
Total Dropouts NaN
Name: Asian/Pacific Islander, dtype: float64
We told Pandas that we definitely wanted a row named “Asian/Pacific Islander”, but the DataFrame didn’t yet have it. Pandas assumed we wanted a row with that name, and simply filled that row with...nothing.
We need that aggregate group in order to decouple Asians from Pacific Islanders in 2007 in our next section. So we need to add that data in ourselves.
Just like how we can create new columns out of existing columns, we can create new rows out of existing rows. Because we’re going with named rows, we can create a new row that follows our standard and is simply the sum of two rows. Let’s test this on a data set that actually does have the data we want, just to make sure we’re not BSing ourselves.
In [31]: data = read_sheet("../data/AppendixAState2012.xlsx")
In [32]: data.loc["New Row"] = data.loc["Asian"] + data.loc["Pacific Islander"]
In [33]: print(data.loc[["Asian", "Pacific Islander", "Asian/Pacific Islander", "New Row"]])
Cohort Size Total Graduates Total Dropouts
Demographic
Asian 5954 5027 496
Pacific Islander 637 411 132
Asian/Pacific Islander 6591 5438 628
New Row 6591 5438 628
Beautiful.
Let’s get a little hacky and turn this into a function, then include that function in our read_sheet
function.
In [34]: def hacky_row_add(dataframe):
dataframe.loc["Asian/Pacific Islander"] = dataframe.loc["Asian"] + dataframe.loc["Pacific Islander"]
return dataframe
In [35]: def read_sheet(filename):
year = int(filename.split('.x')[0][-4:])
if year < 2011:
data = pd.read_excel(filename, sheetname="State", skiprows=4)
else:
data = pd.read_excel(filename, sheetname=0, skiprows=4)
data = alter_columns(data).set_index("Demographic")
data = alter_rows(data)
data = reduce_data(data)
data = add_total_dropouts(data)
if year >= 2014:
data = hacky_row_add(data)
return data
A Breakpoint¶
We’ve now been able to create a nice pipeline for reading in data and organizing it for analysis. We’ve set up standards across all of our DataFrames. Now let’s do a little manipulation and data aggregation.