Bring It Into Python

Now that we have some standards applied to our data we can read them all into Python! For this we will be working in our Jupyter Notebook (docs), even if these notes look like they’re from the command line. Our workhorse here is going to be the Pandas package (docs). Pandas is great for data analysis, providing analysts with fast aggregate statistics and a unique data container called the DataFrame. The Pandas read_excel function will allow us to ...read Excel files into these DataFrames with very little configuration.

Before we dive into using the function, let’s look at some of the documentation.

In [1]: import pandas as pd

In [2]: pd.read_excel ?
Signature: pd.read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None, names=None, parse_cols=None, parse_dates=False, date_parser=None, na_values=None, thousands=None, convert_float=True, has_index_names=None, converters=None, engine=None, squeeze=False, **kwds)

Parameters
----------
io : string, path object (pathlib.Path or py._path.local.LocalPath),
    file-like object, pandas ExcelFile, or xlrd workbook.
    The string could be a URL. Valid URL schemes include http, ftp, s3,
    and file. For file URLs, a host is expected. For instance, a local
    file could be file://localhost/path/to/workbook.xlsx

sheetname : string, int, mixed list of strings/ints, or None, default 0

    Strings are used for sheet names, Integers are used in zero-indexed
    sheet positions.

    Lists of strings/integers are used to request multiple sheets.

    Specify None to get all sheets.

    str|int -> DataFrame is returned.
    list|None -> Dict of DataFrames is returned, with keys representing
    sheets.

    Available Cases

    * Defaults to 0 -> 1st sheet as a DataFrame
    * 1 -> 2nd sheet as a DataFrame
    * "Sheet1" -> 1st sheet as a DataFrame
    * [0,1,"Sheet5"] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
    * None -> All sheets as a dictionary of DataFrames

Looks like we can read a spreadsheet in in a fairly simple way. Our go-to will be something along the lines of

the_data = pd.read_excel("path/to/file", sheetname="Sheet We Want", skiprows=4)

the_data will then be a Pandas DataFrame, with column names dictated by the spreadsheet. This will be a little messy due to some merged rows. Let’s look:

In [3]: the_data = pd.read_excel("data/AppendixAState2008.xls", sheetname="State", skiprows=4)

In [4]: print(the_data.columns)
Index([        'Unnamed: 0',                    7,                    8,
                          9,                   10,                   11,
                         12,                '7.1',                '8.1',
                      '9.1',               '10.1',               '11.1',
                     '12.1',                '7.2',                '8.2',
                      '9.2',               '10.2',               '11.2',
                     '12.2',             'Number',               'Rate',
        'Start of Grade 10',  'Start of Grade 11',  'Start of Grade 12',
          'End of Grade 12',        'Unnamed: 25',      'On-time Grads',
       'Size of Cohort****',         'Late Grads',        'Total Grads',
              'Unnamed: 30',        'Unnamed: 31',        'Unnamed: 32',
              'Unnamed: 33',     'Annual Dropout'],
      dtype='object')

In [5]: the_data = pd.read_excel("data/AppendixAState2012.xlsx", sheetname=0, skiprows=4)

In [6]: print(the_data.columns)
Index(['Student Category', 'Beginning Grade 9 Students', 'Transferred In',
       'Year 1', 'Year 2', 'Year 3', 'Year 4', 'Transferred Out',
       'Adjusted Cohort', 'Graduates', 'Continuing',
       'Adjusted 4-Year Cohort Graduation Rate', 'Cohort dropout rate',
       'Continuing Rate'],
      dtype='object')

We’ll clean all this up in a bit.

Harvesting Data into Python

There’s a big break in structure between 2010 and 2011. Prior to 2011, all our data is in a sheet called “State”. With 2011 and after, all our data is in the first sheet at index 0.

Let’s write a function for reading data no matter what, and place the logic for differentiating between the years within that function.

In [7]: def read_sheet(filename):
            year = int(filename.split('.x')[0][-4:])
            if year < 2011:
                data = pd.read_excel(filename, sheetname="State", skiprows=4)

            else:
                data = pd.read_excel(filename, sheetname=0, skiprows=4)

            return data

filename here comes in as a string, which can then be split on the extension. The first item produced by that split will be everything before the extension, and in every case without exception the last four characters will be the year of the data. We turn that into an integer and thus we have the year!

Another large break is that prior to 2011 there were dropout statistics for grades 7 through 12, whereas after it tracks only grades 9-12. The pre-2011 data gets read in with column names “7.1, 8.1,...,12.1”, which are opaque and not descriptive, so let’s modify both sets of data to some standard.

Data In Your Own Image

Changing Column Names

Pandas DataFrames have a rename() method that allows us to change column and row names as we see fit.

In [8]: the_data.rename ?
Signature: the_data.rename(index=None, columns=None, **kwargs)
Docstring:
Alter axes input function or functions. Function / dict values must be
unique (1-to-1). Labels not contained in a dict / Series will be left
as-is. Alternatively, change ``Series.name`` with a scalar
value (Series only).

Parameters
----------
index, columns : scalar, list-like, dict-like or function, optional
    Scalar or list-like will alter the ``Series.name`` attribute,
    and raise on DataFrame or Panel.
    dict-like or functions are transformations to apply to
    that axis' values
copy : boolean, default True
    Also copy underlying data
inplace : boolean, default False
    Whether to return a new DataFrame. If True then value of copy is
    ignored.
...

What we must do is provide it with a dictionary, where the keys are the names of the columns we want to change and the values are the names that we want to change them to.

We could just include the logic for handling this in our existing function. However, even though we’re doing analysis work we still want to be good programmers. Functions should do one thing and one thing only. Let’s create a function that will make this change for us.

If the year is earlier than 2011, we’ll target “9.1”, ..., “12.1”. If after, we’ll go for “Year 1”, ..., “Year 4”. We’ll change both sets to “Dropouts Grade 9”, ..., “Dropouts Grade 12”

In [9]: def alter_columns(dataframe):

            columns_dict = {
                "9.1": "Dropouts Grade 9",
                "10.1": "Dropouts Grade 10",
                "11.1": "Dropouts Grade 11",
                "12.1": "Dropouts Grade 12",
                "Year 1": "Dropouts Grade 9",
                "Year 2": "Dropouts Grade 10",
                "Year 3": "Dropouts Grade 11",
                "Year 4": "Dropouts Grade 12"
            }

            return dataframe.rename(columns=columns_dict)

What’s great is that the column will only be renamed if it’s present in the DataFrame. Otherwise, it’s ignored. This means that we can pile as many columns renamed as we want into one dict. We can use this to do some of the other column changes we will need to do.

No matter the year, the fact that we’re looking at different demographics is always masked by the column name we get on read. Whether it’s Unnamed: 0 or Student Category, they’re both terrible. Let’s call it Demographic!

Prior to 2011 we’ve got Total Grads but after we’ve got Graduates, which is pretty nondescript. We have one issue in 2009 where the column is called Unnamed: 29 due to issues with merged rows. Let’s change them all to Total Graduates.

Finally, we’re going to need the total sizes of each cohort. Before 2011 that column is called Size of Cohort**** and then gets called Adjusted Cohort after 2011. We have another merged-row anomaly in 2009, where the column is named Unnamed: 27. Let’s standardize these as Cohort Size.

Let’s put all of these together in our columns_dict and have one uniform naming scheme for the columns that we care about.

In [10]: def alter_columns(dataframe):

            columns_dict = {
                "Unnamed: 0": "Demographic",
                "Student Category": "Demographic",
                "9.1": "Dropouts Grade 9",
                "10.1": "Dropouts Grade 10",
                "11.1": "Dropouts Grade 11",
                "12.1": "Dropouts Grade 12"
                "Year 1": "Dropouts Grade 9",
                "Year 2": "Dropouts Grade 10",
                "Year 3": "Dropouts Grade 11",
                "Year 4": "Dropouts Grade 12",
                "Total Grads": "Total Graduates",
                "Graduates": "Total Graduates",
                "Unnamed 29": "Total Graduates",
                "Size of Cohort****": "Cohort Size",
                "Adjusted Cohort": "Cohort Size",
                "Unnamed: 27": "Cohort Size"
            }

            return dataframe.rename(columns=columns_dict)

Now just add it to our “read_sheet” function so that on read, the columns get altered.

In [11]: def read_sheet(filename):
            year = int(filename.split('.x')[0][-4:])
            if year < 2011:
                data = pd.read_excel(filename, sheetname="State", skiprows=4)

            else:
                data = pd.read_excel(filename, sheetname=0, skiprows=3)

            data = alter_columns(data)
            return data

Changing Indices

Pandas lets us not only access data by column, but by row! It helps, however, to have rows with meaningful names with which to do this. In Excel, every row below the column row started out with a demographic name. When we read it into Python, it created an auto-incremented index for us. We can ignore that completely using the set_index method.

In [12]: data.set_index?
Signature: data.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
Docstring:
Set the DataFrame index (row labels) using one or more existing
columns. By default yields a new object.

Parameters
----------
keys : column label or list of column labels / arrays
...

In [9]: print(data) # before the change
          Demographic  Beginning Grade 9 Students  Transferred In   ...
0        All Students                       79977           12382   ...
1     American Indian                        1278             211   ...
2               Asian                        5772            1053   ...
3   Pacific Islanders                         698             229   ...
4               Black                        3622             919   ...
5            Hispanic                       14461            2760   ...
6               White                       49501            6563   ...
7   Two or More Races                        4644             645   ...
8          Special Ed                        9870            1093   ...
9           Bilingual                        3873            1574   ...
10         Low Income                       41430            5772   ...
11    Title I Migrant                        2150             417   ...
12                504                        4242             231   ...
13           Homeless                        3877            1069   ...
14        Foster Care                         705             126   ...
15             Female                       39142            6070   ...
16               Male                       40835            6312   ...

In [13]: data = data.set_index("Demographic")

In [14]: print(data)
                   Beginning Grade 9 Students  Transferred In   ...
Demographic
All Students                            79977           12382   ...
American Indian                          1278             211   ...
Asian                                    5772            1053   ...
Pacific Islanders                         698             229   ...
Black                                    3622             919   ...
Hispanic                                14461            2760   ...
White                                   49501            6563   ...
Two or More Races                        4644             645   ...
Special Ed                               9870            1093   ...
Bilingual                                3873            1574   ...
Low Income                              41430            5772   ...
Title I Migrant                          2150             417   ...
504                                      4242             231   ...
Homeless                                 3877            1069   ...
Foster Care                               705             126   ...
Female                                  39142            6070   ...
Male                                    40835            6312   ...

Because this is literally a change brought on by one-line, we can just slap it onto the bottom of our function.

In [15]: def read_sheet(filename):
            year = int(filename.split('.x')[0][-4:])
            if year < 2011:
                data = pd.read_excel(filename, sheetname="State", skiprows=4)

            else:
                data = pd.read_excel(filename, sheetname=0, skiprows=3)

            data = alter_columns(data).set_index("Demographic")
            return data

Changing Row Names

We do all of this renaming of columns and reassigning of indices for a reason. Pandas lets us reference data by those names! Behold!

In [16]: print(data[["Total Graduates", "Dropouts Grade 12"]])
                         Total Graduates  Dropouts Grade 12
Demographic
All Students                       60552               5786
American Indian                      760                176
Asian/Pacific Islanders             5438                338
Asian                               5027                259
Pacific Islanders                    411                 79
Black                               2487                400
Hispanic                            8415               1317
White                              40661               3256
Two or More Races                   2790                297
Special Ed                          4914                966
Bilingual                           2428                554
Low Income                         24338               4176
Title I Migrant                     1370                228
504                                 2127                220
Female                             31286               2397
Male                               29266               3389

In [17]: print(data.loc["Female"])
Beginning Grade 9 Students                38928.000000
Transferred In                             7060.000000
Dropouts Grade 9                            356.000000
Dropouts Grade 10                           625.000000
Dropouts Grade 11                          1188.000000
Dropouts Grade 12                          2397.000000
Transferred Out                            7226.000000
Cohort Size                               38762.000000
Total Graduates                           31286.000000
Continuing                                 2910.000000
Adjusted 4-Year Cohort Graduation Rate       80.713070
Cohort dropout rate                          11.779578
Continuing Rate                               7.507353
Name: Female, dtype: float64

Ain’t it cool? Unfortunately, as we’ve seen in the spreadsheets each year has slightly different names for different demographics. As we did with the columns, let’s write a function to set up ONE standard.

In [18]: def alter_rows(dataframe):
            rows_dict = {
                "Asian/Pac Islander": "Asian/Pacific Islander",
                "Asian/Pacific Islanders": "Asian/Pacific Islander",
                "Asian Pac Islander": "Asian/Pacific Islander",
                "Pac Islander": "Pacific Islander",
                "Pacific Islanders": "Pacific Islander",
                "American Indians": "First Nations",
                "American Indian": "First Nations",
                "Amer.Indian": "First Nations",
            }

            return dataframe.rename(index=rows_dict)

Let’s stick that into the bottom of our read_sheet function as well.

In [19]: def read_sheet(filename):
            year = int(filename.split('.x')[0][-4:])
            if year < 2011:
                data = pd.read_excel(filename, sheetname="State", skiprows=4)

            else:
                data = pd.read_excel(filename, sheetname=0, skiprows=3)

            data = alter_columns(data).set_index("Demographic")
            data = alter_rows(data)
            return data

Reducing the Data

One must always keep the mission in mind when diving into data. What are we doing here? We’re aggregating educational data so that we can understand how graduation and dropout rates have changed over time. We also want to calculate this rate ourselves, so let’s only retain the data we care about.

What this means for us is that we should isolate the data we want by providing the names for things that we want. In the previous step we decided that the demographics we’re interested in were these:

  • All Students
  • American Indian
  • Asian
  • Pacific Islanders
  • Black
  • Hispanic
  • White

And the statistics that we’re interested in are:

  • Cohort Size
  • Total Graduates
  • Total Dropouts (haven’t made this yet, composed of Dropouts Grades 9-12)

So we can write a function that reduces every data set passed in down to only these rows and columns.

In [20]: def reduce_data(dataframe):
            wanted_rows = ["All Students", "First Nations",
                           "Asian/Pacific Islander", "Asian",
                           "Pacific Islander", "Black", "Hispanic",
                           "White"]
            wanted_columns = ["Dropouts Grade 9", "Dropouts Grade 10",
                              "Dropouts Grade 11", "Dropouts Grade 12",
                              "Cohort Size", "Total Graduates"]

            return dataframe[wanted_columns].loc[wanted_rows]

Generating Estimates from Data

Estimating Dropouts

We’ll need to make two estimations on our data sets. One of these is just a direct summation, and the other requires a more significant assumption.

The first will be our Total Dropouts. For this we’ll need to add together four columns (Dropouts Grade 9, 10, 11, and 12) and then use that summation to create an entirely new column.

If we just look at the four columns of dropouts we see the following:

In [21]: print(data[["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"]])
                        Dropouts Grade 9  Dropouts Grade 10  \
Demographic
All Students                         767               1363
First Nations                         31                 65
Asian/Pacific Islander                60                100
Asian                                 51                 86
Pacific Islander                       9                 14
Black                                 63                 89
Hispanic                             216                299
White                                361                763

                        Dropouts Grade 11  Dropouts Grade 12
Demographic
All Students                         2730               5786
First Nations                          87                176
Asian/Pacific Islander                130                338
Asian                                 100                259
Pacific Islander                       30                 79
Black                                 166                400
Hispanic                              662               1317
White                                1580               3256

We want to do a summation across rows. Simple right? If we try to use Python’s built-in sum function on these four rows, it’ll throw a TypeError. Instead, we’ll use the sum() method that’s built into the DataFrame. It’s written to operate across the DataFrame it’s been called on. In case you didn’t realize it yet, subsets of DataFrames are themselves DataFrames!

In [22]: data[["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"]].sum()
Dropouts Grade 9      2325
Dropouts Grade 10     4142
Dropouts Grade 11     8215
Dropouts Grade 12    17397
dtype: int64

Why didn’t this give us what we wanted?! Pandas assumes first that if you’re doing a sum, you want to collect data within separate columns. However, we want a summation across rows, so that every demographic ends up with its own statistic.

To do that, we must specify an argument in the sum() method called axis. By default, axis is set to 0, which is summation across columns. If you think about your DataFrame as an actual table, a table has two axes. The next axis, axis=1, corresponds to rows. If your DataFrame had more dimensions, you would be able to sum across even more axes.

In [23]: data[["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"]].sum(axis=1)
Demographic
All Students              10646
First Nations               359
Asian/Pacific Islander      628
Asian                       496
Pacific Islander            132
Black                       718
Hispanic                   2494
White                      5960
dtype: int64

We’ve got the totals we wanted, now let’s add them to our DataFrame.

In [24]: data["Total Dropouts"] = data[["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"]].sum(axis=1)

In [25]: data.columns
Index(['Dropouts Grade 9', 'Dropouts Grade 10', 'Dropouts Grade 11',
   'Dropouts Grade 12', 'Cohort Size', 'Total Graduates',
   'Total Dropouts'],
  dtype='object')

And just like that we’ve created a new column on our DataFrame, complete with the information we want housed within.

In fact, now that we have that summation we probably don’t need the other 4 columns of dropouts. Let’s...drop those out! We can do that with the .drop() method in Pandas.

In [26]: data.drop(["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"], axis=1)
                        Cohort Size  Total Graduates  Total Dropouts
Demographic
All Students                  78458            60552           10646
First Nations                  1339              760             359
Asian/Pacific Islander         6591             5438             628
Asian                          5954             5027             496
Pacific Islander                637              411             132
Black                          3706             2487             718
Hispanic                      12624             8415            2494
White                         50598            40661            5960
New Row                        6591             5438             628

We choose axis=1 because we’re dropping the same columns across all rows.

Note that this operation does not mutate the DataFrame. If we wanted to mutate the DataFrame, we’d run that same method with the flag of inplace=True. We’re not doing that here, so we’ll just have to reassign our variable name data to the DataFrame output by .drop(). Let’s roll all this into a function and then add that function to read_sheet.

In [27]: def add_total_dropouts(dataframe):
            all_dropouts = ["Dropouts Grade 9", "Dropouts Grade 10", "Dropouts Grade 11", "Dropouts Grade 12"]
            dataframe["Total Dropouts"] = dataframe[all_dropouts].sum(axis=1)
            dataframe = dataframe.drop(all_dropouts, axis=1)
            return dataframe

In [28]: def read_sheet(filename):
            year = int(filename.split('.x')[0][-4:])
            if year < 2011:
                data = pd.read_excel(filename, sheetname="State", skiprows=4)

            else:
                data = pd.read_excel(filename, sheetname=0, skiprows=4)

            data = alter_columns(data).set_index("Demographic")
            data = alter_rows(data)
            data = reduce_data(data)
            data = add_total_dropouts(data)

            return data

A Snag in New Data

One thing that we’ll find is that when reading in data from 2014 and 2015, the superintendent of schools became more racially-sensitive and stopped lumping Asians and Pacific Islanders together as one demographic. Great for respect and recognition, but bad for us. When we try our read_sheet function on that spreadsheet, we find the following:

In [29]: data = read_sheet("../data/AppendixAState2014.xlsx")

In [30]: print(data.loc["Asian/Pacific Islander"])
Cohort Size         NaN
Total Graduates     NaN
Total Dropouts      NaN
Name: Asian/Pacific Islander, dtype: float64

We told Pandas that we definitely wanted a row named “Asian/Pacific Islander”, but the DataFrame didn’t yet have it. Pandas assumed we wanted a row with that name, and simply filled that row with...nothing.

We need that aggregate group in order to decouple Asians from Pacific Islanders in 2007 in our next section. So we need to add that data in ourselves.

Just like how we can create new columns out of existing columns, we can create new rows out of existing rows. Because we’re going with named rows, we can create a new row that follows our standard and is simply the sum of two rows. Let’s test this on a data set that actually does have the data we want, just to make sure we’re not BSing ourselves.

In [31]: data = read_sheet("../data/AppendixAState2012.xlsx")

In [32]: data.loc["New Row"] = data.loc["Asian"] + data.loc["Pacific Islander"]

In [33]: print(data.loc[["Asian", "Pacific Islander", "Asian/Pacific Islander", "New Row"]])
                        Cohort Size     Total Graduates  Total Dropouts
Demographic
Asian                          5954                5027             496
Pacific Islander                637                 411             132
Asian/Pacific Islander         6591                5438             628
New Row                        6591                5438             628

Beautiful. Let’s get a little hacky and turn this into a function, then include that function in our read_sheet function.

In [34]: def hacky_row_add(dataframe):
            dataframe.loc["Asian/Pacific Islander"] = dataframe.loc["Asian"] + dataframe.loc["Pacific Islander"]
            return dataframe

In [35]: def read_sheet(filename):
            year = int(filename.split('.x')[0][-4:])
            if year < 2011:
                data = pd.read_excel(filename, sheetname="State", skiprows=4)

            else:
                data = pd.read_excel(filename, sheetname=0, skiprows=4)

            data = alter_columns(data).set_index("Demographic")
            data = alter_rows(data)
            data = reduce_data(data)
            data = add_total_dropouts(data)

            if year >= 2014:
                data = hacky_row_add(data)

            return data

A Breakpoint

We’ve now been able to create a nice pipeline for reading in data and organizing it for analysis. We’ve set up standards across all of our DataFrames. Now let’s do a little manipulation and data aggregation.

Next Up: Manipulating and Aggregating the Data