Manipulating and Aggregating the Data

Now that we have our data organized, let’s deal with an inconsistency. Then, we’ll retrieve exactly what we need to move toward visualization.

Estimating Asian/Pacific Islander Ratios

We still have to deal with the unfortunate issue of having Asians and Pacific Islanders merged as one population in 2007. They’re two entirely different groups comprising vastly different cultures and should be treated as such.

Since the data we’ve got is our Golden Truth, we can’t do much in the way of trying to find out how those populations should properly be separated in the data. What we can do is make some assumptions based on the other data we’ve got.

Chances are that even though there will be changes in the Asian and Pacific Islander populations between 2007 and 2015, they won’t be that significantly different. So if we know the ratio of Asian to Pacific Islander in every year, we should be able to take the average of that ratio and apply it to 2007 to get reasonably-accurate statistics.

Before we write any code to address this I want to emphasize something. What is about to follow is an APPROXIMATION, and ALL approximations are wrong by default, without exception. A given approximation may be good. It may be very close to the truth, but it is NOT and should never be mistaken for the truth. Approximations are simply likely based on the data we have.

Let’s first get filenames for all of our years except 2007.

In [36]: import os

In [37]: all_sheets = ["../data/" + filename for filename in os.listdir("../data") if "State" in filename and "2007" not in filename]

The os (docs) package allows us to interact with our file system in various ways. Here we use it to list all the items within a directory, but we can use it to change directories, get the path to the current directory, and a host of other things.

We’ve also used a list comprehension (docs) to build up a list of file paths that we want. With that in hand, we can build up a list of DataFrames like so:

In [38]: all_years = [read_sheet(sheetname) for sheetname in all_sheets]

For each year in each column of interest we should collect the fraction of “Asian” to “Asian/Pacific Islander” and “Pacific Islander” to “Asian/Pacific Islander”. We can do this like so:

In [39]: dropouts_ratios = {"Asian": [], "Pacific Islander": []}

In [40]: graduates_ratios = {"Asian": [], "Pacific Islander": []}

In [41]: cohort_ratios = {"Asian": [], "Pacific Islander": []}

In [42]: for year_data in all_years:
            drops_asian = year_data["Total Dropouts"].loc["Asian"]
            drops_pacis = year_data["Total Dropouts"].loc["Pacific Islander"]
            drops_all = year_data["Total Dropouts"].loc["Asian/Pacific Islander"]

            dropouts_ratios["Asian"].append(drops_asian / drops_all)
            dropouts_ratios["Pacific Islander"].append(drops_pacis / drops_all)

            grads_asian = year_data["Total Graduates"].loc["Asian"]
            grads_pacis = year_data["Total Graduates"].loc["Pacific Islander"]
            grads_all = year_data["Total Graduates"].loc["Asian/Pacific Islander"]

            graduates_ratios["Asian"].append(grads_asian / grads_all)
            graduates_ratios["Pacific Islander"].append(grads_pacis / grads_all)

            cohort_asian = year_data["Cohort Size"].loc["Asian"]
            cohort_pacis = year_data["Cohort Size"].loc["Pacific Islander"]
            cohort_all = year_data["Cohort Size"].loc["Asian/Pacific Islander"]

            cohort_ratios["Asian"].append(cohort_asian / cohort_all)
            cohort_ratios["Pacific Islander"].append(cohort_pacis / cohort_all)

Let’s look at the hell we’ve wrought

In [43]: from pprint import pprint

In [44]: pprint(graduates_ratios)
 {'Asian': [0.96374841399311217,
             0.95651435366060811,
             0.94446331521739135,
             0.94470774091627174,
             0.92442074292019127,
             0.93056841325680428,
             0.92392463562033411,
             0.91526874358096544],
  'Pacific Islander': [0.036251586006887798,
                        0.043485646339391879,
                        0.055536684782608696,
                        0.055292259083728278,
                        0.075579257079808754,
                        0.069431586743195703,
                        0.076075364379665833,
                        0.084731256419034576]}

Okay, not bad here. There’s some shift in the fractions corresponding separately to Asians and Pacific Islanders between 2008 (index 0) and 2015 (index 7), but there’s not much shift overall. Additionally, what shift there is is largely in one direction.

In [45]: pprint(cohort_ratios)
{'Asian': [0.95106611744419001,
            0.94407645184257671,
            0.92116881561549535,
            0.93170731707317078,
            0.903353057199211,
            0.9085245403434129,
            0.90061460050966868,
            0.89183255269320838],
 'Pacific Islander': [0.047896229887313159,
                       0.054530840385484912,
                       0.077183952612619816,
                       0.068292682926829273,
                       0.096646942800788949,
                       0.091475459656587141,
                       0.099385399490331289,
                       0.10816744730679156]}

A similar shift seen here. Cohorts gain some more Pacific Islanders over time. However the difference between 2008 and 2015 isn’t terribly large, and again it moves in one direction at almost all times.

In [46]: pprint(dropouts_ratios)
{'Asian': [0.90183486238532107,
            0.86358635863586364,
            0.77459954233409611,
            0.87183308494783907,
            0.78980891719745228,
            0.78494623655913975,
            0.74180327868852458,
            0.70459518599562365],
 'Pacific Islander': [0.098165137614678905,
                       0.13641364136413642,
                       0.22540045766590389,
                       0.12816691505216096,
                       0.21019108280254778,
                       0.21505376344086022,
                       0.25819672131147542,
                       0.29540481400437635]}

Well shit, this could be problematic. Between 2008 and 2015 the fraction of Asians in the dropout population of Asians/Pacific Islanders drops by 20%, while Pacific Islanders goes up by 20%. The trend is still mostly in one direction for both, but that big of a difference is non-trivial. We need to change our strategy a bit.

If we had more time we could do some fancy stuff like use Scikit-Learn’s Linear Regression module (docs) and find out what the fraction of Asians was in 2007, but time is a luxury. Instead we’ll assume that the year closest to 2007 would best represent 2007. In our case, that’s 2008. So we’ll take the ratios in 2008 as our approximations and apply them to 2007’s data.

In [47]: data_2007 = read_sheet("../data/AppendixAState2007.xls")

In [48]: data_2007.loc["Asian"]["Cohort Size"] = data_2007.loc["Asian/Pacific Islander"]["Cohort Size"] * cohort_ratios["Asian"][0]

In [49]: data_2007.loc["Asian"]["Total Graduates"] = data_2007.loc["Asian/Pacific Islander"]["Total Graduates"] * graduates_ratios["Asian"][0]

In [50]: data_2007.loc["Asian"]["Total Dropouts"] = data_2007.loc["Asian/Pacific Islander"]["Total Dropouts"] * dropouts_ratios["Asian"][0]

In [48]: data_2007.loc["Pacific Islander"]["Cohort Size"] = data_2007.loc["Asian/Pacific Islander"]["Cohort Size"] * cohort_ratios["Pacific Islander"][0]

In [49]: data_2007.loc["Pacific Islander"]["Total Graduates"] = data_2007.loc["Asian/Pacific Islander"]["Total Graduates"] * graduates_ratios["Pacific Islander"][0]

In [50]: data_2007.loc["Pacific Islander"]["Total Dropouts"] = data_2007.loc["Asian/Pacific Islander"]["Total Dropouts"] * dropouts_ratios["Pacific Islander"][0]

In [51]: print(data_2007)

Finally, let’s add the 2007 data to our list of other data frames, in order.

In [52]: all_years = [data_2007] + all_years

Aggregating Data

Now we’ve got all our ducks in a row. Let’s actually aggregate the data as we like it.

Dropout and Graduation Rates

We want to define the Dropout and Graduation rates fairly similarly. These will be:

  • the total number of dropouts divided by the total number of students in the cohort x 100
  • the total number of graduates divided by the total nunber of students in the cohort x 100

We’re multiplying by 100 since the simple division gives a fraction, and we want a percentage.

For each year let’s add these columns. This is easiest if we just create a function to assign these new columns and run every DataFrame in our list of DataFrames through that function.

In [53]: def add_dropout_graduation_rates(dataframe):
            dataframe["Dropout Rate"] = dataframe["Total Dropouts"] / dataframe["Cohort Size"] * 100
            dataframe["Graduation Rate"] = dataframe["Total Graduates"] / dataframe["Cohort Size"] * 100
            return dataframe

In [54]: all_years = [add_dropout_graduation_rates(data) for data in all_years]

As always, check on your data to make sure it happened correctly.

In [55]: all_years[0]
                         Cohort Size  Total Graduates  Total Dropouts  \
Demographic
All Students            81242.004323     62938.000000    18044.000000
First Nations            2341.299565      1306.000000     1022.000000
Asian/Pacific Islander   6698.049340      5710.000000      994.000000
Asian                    6370.287780      5503.003444      896.423853
Pacific Islander          320.811311       206.996556       97.576147
Black                    4060.327081      2753.000000     1448.000000
Hispanic                 8265.266250      5698.000000     2984.000000
White                   58909.999243     47067.000000    11292.000000

                        Dropout Rate  Graduation Rate
Demographic
All Students               22.210186        77.469778
First Nations              43.650971        55.780987
Asian/Pacific Islander     14.840142        85.248700
Asian                      14.071952        86.385476
Pacific Islander           30.415432        64.522836
Black                      35.662152        67.802419
Hispanic                   36.102890        68.939098
White                      19.168223        79.896453

All of our rates are between 0 and 100 and otherwise look good!

Reorganize Data by Years

What we want next is to turn our list of data frames into 2 data frames:

  • Dropout Rates, organized by year (columns) and demographic (rows).
  • Graduation Rates, again organized by year and demographic.

How do we get there? Our data is already organized into a list of DataFrames in time-order. We can use that to our advantage. Our procedure will be as follows:

  1. Create new dictionaries for all dropout rates and all graduation rates across all years.
In [56]: years = list(range(2007, 2016))

In [57]: dropout_rates = {}

In [58]: graduation_rates = {}
  1. For each DataFrame in our list of DataFrames, populate each dict with the corresponding data for that year.
In [59]: for idx, item in enumerate(years):
            dropout_rates[item] = all_years[idx]["Dropout Rate"]
            graduation_rates[item] = all_years[idx]["Graduation Rate"]
  1. Turn the dropout and graduation rates dicts into DataFrames that we may use for later.
In [60]: dropout_df = pd.DataFrame(dropout_rates)

In [61]: graduate_df = pd.DataFrame(graduation_rates)
  1. Gaze at our handiwork and prep for visualization.
In [62]: print(dropout_df)
                             2007       2008       2009       2010       2011   ...
Demographic
All Students            22.210186  22.667685  20.565407  18.577791  13.933354   ...
First Nations           43.650971  43.461666  39.043391  35.232383  28.697001   ...
Asian/Pacific Islander  14.840142  16.851076  13.599304  13.114671   9.626973   ...
Asian                   14.071952  15.978792  12.439855  11.027965   9.008315   ...
Pacific Islander        30.415432  34.536918  34.019843  38.298800  18.067227   ...
Black                   35.662152  39.093349  32.911129  29.588363  20.431138   ...
Hispanic                36.102890  37.280614  33.304071  29.815636  21.085859   ...
White                   19.168223  19.088472  17.468412  15.889889  12.006935   ...

Clean Up and Next Steps

Now our statistics have all been calculated. However, even with all of our data aggregated and making sense, no one wants to look at a table unless they need to access the raw data. It’s time to visualize it and make it easily digestible. But first, let’s clean up the mess we’ve made.

Clear out all of the code/cells that we don’t need to move forward toward visualization. This includes but is not limited to cells for:

  • looking at documentation
  • experimentation
  • printing data to the screen
  • anything not directly involved in our data manipulation pipeline
  • any notes written in cells that we don’t want in a final report

We can delete cells by selecting the one we want to delete, going to “Edit” in the Jupyter menu and clicking on “Delete Cells”. We can also do it with a keystroke. Click on the numbering next to the cell and hit the “D” key twice. Be careful though! You can only undo your last most recent cell deletion! So make sure that what you want to destroy is truly what you want to get rid of.

Finally, move all your package imports to the top. I like to have them all in the first cell. If you declared any global variables, move those to the top as well, into a second cell.

When you’re sure that your code is clean and your data preparation pipeline is well-ordered and as you want it, go to the “Kernel” dropdown in the Jupyter menu and hit “Restart and Run All”. This will clear out every process you’ve been running in this session of Jupyter, but leave all the code and text in the cells. It will then re-run everything from top to bottom, renumbering cells along the way.

Next Up: Visualize the Results