Manipulating and Aggregating the Data¶
Now that we have our data organized, let’s deal with an inconsistency. Then, we’ll retrieve exactly what we need to move toward visualization.
Estimating Asian/Pacific Islander Ratios¶
We still have to deal with the unfortunate issue of having Asians and Pacific Islanders merged as one population in 2007. They’re two entirely different groups comprising vastly different cultures and should be treated as such.
Since the data we’ve got is our Golden Truth, we can’t do much in the way of trying to find out how those populations should properly be separated in the data. What we can do is make some assumptions based on the other data we’ve got.
Chances are that even though there will be changes in the Asian and Pacific Islander populations between 2007 and 2015, they won’t be that significantly different. So if we know the ratio of Asian to Pacific Islander in every year, we should be able to take the average of that ratio and apply it to 2007 to get reasonably-accurate statistics.
Before we write any code to address this I want to emphasize something. What is about to follow is an APPROXIMATION, and ALL approximations are wrong by default, without exception. A given approximation may be good. It may be very close to the truth, but it is NOT and should never be mistaken for the truth. Approximations are simply likely based on the data we have.
Let’s first get filenames for all of our years except 2007.
In [36]: import os
In [37]: all_sheets = ["../data/" + filename for filename in os.listdir("../data") if "State" in filename and "2007" not in filename]
The os (docs) package allows us to interact with our file system in various ways. Here we use it to list all the items within a directory, but we can use it to change directories, get the path to the current directory, and a host of other things.
We’ve also used a list comprehension (docs) to build up a list of file paths that we want. With that in hand, we can build up a list of DataFrames like so:
In [38]: all_years = [read_sheet(sheetname) for sheetname in all_sheets]
For each year in each column of interest we should collect the fraction of “Asian” to “Asian/Pacific Islander” and “Pacific Islander” to “Asian/Pacific Islander”. We can do this like so:
In [39]: dropouts_ratios = {"Asian": [], "Pacific Islander": []}
In [40]: graduates_ratios = {"Asian": [], "Pacific Islander": []}
In [41]: cohort_ratios = {"Asian": [], "Pacific Islander": []}
In [42]: for year_data in all_years:
drops_asian = year_data["Total Dropouts"].loc["Asian"]
drops_pacis = year_data["Total Dropouts"].loc["Pacific Islander"]
drops_all = year_data["Total Dropouts"].loc["Asian/Pacific Islander"]
dropouts_ratios["Asian"].append(drops_asian / drops_all)
dropouts_ratios["Pacific Islander"].append(drops_pacis / drops_all)
grads_asian = year_data["Total Graduates"].loc["Asian"]
grads_pacis = year_data["Total Graduates"].loc["Pacific Islander"]
grads_all = year_data["Total Graduates"].loc["Asian/Pacific Islander"]
graduates_ratios["Asian"].append(grads_asian / grads_all)
graduates_ratios["Pacific Islander"].append(grads_pacis / grads_all)
cohort_asian = year_data["Cohort Size"].loc["Asian"]
cohort_pacis = year_data["Cohort Size"].loc["Pacific Islander"]
cohort_all = year_data["Cohort Size"].loc["Asian/Pacific Islander"]
cohort_ratios["Asian"].append(cohort_asian / cohort_all)
cohort_ratios["Pacific Islander"].append(cohort_pacis / cohort_all)
Let’s look at the hell we’ve wrought
In [43]: from pprint import pprint
In [44]: pprint(graduates_ratios)
{'Asian': [0.96374841399311217,
0.95651435366060811,
0.94446331521739135,
0.94470774091627174,
0.92442074292019127,
0.93056841325680428,
0.92392463562033411,
0.91526874358096544],
'Pacific Islander': [0.036251586006887798,
0.043485646339391879,
0.055536684782608696,
0.055292259083728278,
0.075579257079808754,
0.069431586743195703,
0.076075364379665833,
0.084731256419034576]}
Okay, not bad here. There’s some shift in the fractions corresponding separately to Asians and Pacific Islanders between 2008 (index 0) and 2015 (index 7), but there’s not much shift overall. Additionally, what shift there is is largely in one direction.
In [45]: pprint(cohort_ratios)
{'Asian': [0.95106611744419001,
0.94407645184257671,
0.92116881561549535,
0.93170731707317078,
0.903353057199211,
0.9085245403434129,
0.90061460050966868,
0.89183255269320838],
'Pacific Islander': [0.047896229887313159,
0.054530840385484912,
0.077183952612619816,
0.068292682926829273,
0.096646942800788949,
0.091475459656587141,
0.099385399490331289,
0.10816744730679156]}
A similar shift seen here. Cohorts gain some more Pacific Islanders over time. However the difference between 2008 and 2015 isn’t terribly large, and again it moves in one direction at almost all times.
In [46]: pprint(dropouts_ratios)
{'Asian': [0.90183486238532107,
0.86358635863586364,
0.77459954233409611,
0.87183308494783907,
0.78980891719745228,
0.78494623655913975,
0.74180327868852458,
0.70459518599562365],
'Pacific Islander': [0.098165137614678905,
0.13641364136413642,
0.22540045766590389,
0.12816691505216096,
0.21019108280254778,
0.21505376344086022,
0.25819672131147542,
0.29540481400437635]}
Well shit, this could be problematic. Between 2008 and 2015 the fraction of Asians in the dropout population of Asians/Pacific Islanders drops by 20%, while Pacific Islanders goes up by 20%. The trend is still mostly in one direction for both, but that big of a difference is non-trivial. We need to change our strategy a bit.
If we had more time we could do some fancy stuff like use Scikit-Learn’s Linear Regression module (docs) and find out what the fraction of Asians was in 2007, but time is a luxury. Instead we’ll assume that the year closest to 2007 would best represent 2007. In our case, that’s 2008. So we’ll take the ratios in 2008 as our approximations and apply them to 2007’s data.
In [47]: data_2007 = read_sheet("../data/AppendixAState2007.xls")
In [48]: data_2007.loc["Asian"]["Cohort Size"] = data_2007.loc["Asian/Pacific Islander"]["Cohort Size"] * cohort_ratios["Asian"][0]
In [49]: data_2007.loc["Asian"]["Total Graduates"] = data_2007.loc["Asian/Pacific Islander"]["Total Graduates"] * graduates_ratios["Asian"][0]
In [50]: data_2007.loc["Asian"]["Total Dropouts"] = data_2007.loc["Asian/Pacific Islander"]["Total Dropouts"] * dropouts_ratios["Asian"][0]
In [48]: data_2007.loc["Pacific Islander"]["Cohort Size"] = data_2007.loc["Asian/Pacific Islander"]["Cohort Size"] * cohort_ratios["Pacific Islander"][0]
In [49]: data_2007.loc["Pacific Islander"]["Total Graduates"] = data_2007.loc["Asian/Pacific Islander"]["Total Graduates"] * graduates_ratios["Pacific Islander"][0]
In [50]: data_2007.loc["Pacific Islander"]["Total Dropouts"] = data_2007.loc["Asian/Pacific Islander"]["Total Dropouts"] * dropouts_ratios["Pacific Islander"][0]
In [51]: print(data_2007)
Finally, let’s add the 2007 data to our list of other data frames, in order.
In [52]: all_years = [data_2007] + all_years
Aggregating Data¶
Now we’ve got all our ducks in a row. Let’s actually aggregate the data as we like it.
Dropout and Graduation Rates¶
We want to define the Dropout and Graduation rates fairly similarly. These will be:
- the total number of dropouts divided by the total number of students in the cohort x 100
- the total number of graduates divided by the total nunber of students in the cohort x 100
We’re multiplying by 100 since the simple division gives a fraction, and we want a percentage.
For each year let’s add these columns. This is easiest if we just create a function to assign these new columns and run every DataFrame in our list of DataFrames through that function.
In [53]: def add_dropout_graduation_rates(dataframe):
dataframe["Dropout Rate"] = dataframe["Total Dropouts"] / dataframe["Cohort Size"] * 100
dataframe["Graduation Rate"] = dataframe["Total Graduates"] / dataframe["Cohort Size"] * 100
return dataframe
In [54]: all_years = [add_dropout_graduation_rates(data) for data in all_years]
As always, check on your data to make sure it happened correctly.
In [55]: all_years[0]
Cohort Size Total Graduates Total Dropouts \
Demographic
All Students 81242.004323 62938.000000 18044.000000
First Nations 2341.299565 1306.000000 1022.000000
Asian/Pacific Islander 6698.049340 5710.000000 994.000000
Asian 6370.287780 5503.003444 896.423853
Pacific Islander 320.811311 206.996556 97.576147
Black 4060.327081 2753.000000 1448.000000
Hispanic 8265.266250 5698.000000 2984.000000
White 58909.999243 47067.000000 11292.000000
Dropout Rate Graduation Rate
Demographic
All Students 22.210186 77.469778
First Nations 43.650971 55.780987
Asian/Pacific Islander 14.840142 85.248700
Asian 14.071952 86.385476
Pacific Islander 30.415432 64.522836
Black 35.662152 67.802419
Hispanic 36.102890 68.939098
White 19.168223 79.896453
All of our rates are between 0 and 100 and otherwise look good!
Reorganize Data by Years¶
What we want next is to turn our list of data frames into 2 data frames:
- Dropout Rates, organized by year (columns) and demographic (rows).
- Graduation Rates, again organized by year and demographic.
How do we get there? Our data is already organized into a list of DataFrames in time-order. We can use that to our advantage. Our procedure will be as follows:
- Create new dictionaries for all dropout rates and all graduation rates across all years.
In [56]: years = list(range(2007, 2016))
In [57]: dropout_rates = {}
In [58]: graduation_rates = {}
- For each DataFrame in our list of DataFrames, populate each dict with the corresponding data for that year.
In [59]: for idx, item in enumerate(years):
dropout_rates[item] = all_years[idx]["Dropout Rate"]
graduation_rates[item] = all_years[idx]["Graduation Rate"]
- Turn the dropout and graduation rates dicts into DataFrames that we may use for later.
In [60]: dropout_df = pd.DataFrame(dropout_rates)
In [61]: graduate_df = pd.DataFrame(graduation_rates)
- Gaze at our handiwork and prep for visualization.
In [62]: print(dropout_df)
2007 2008 2009 2010 2011 ...
Demographic
All Students 22.210186 22.667685 20.565407 18.577791 13.933354 ...
First Nations 43.650971 43.461666 39.043391 35.232383 28.697001 ...
Asian/Pacific Islander 14.840142 16.851076 13.599304 13.114671 9.626973 ...
Asian 14.071952 15.978792 12.439855 11.027965 9.008315 ...
Pacific Islander 30.415432 34.536918 34.019843 38.298800 18.067227 ...
Black 35.662152 39.093349 32.911129 29.588363 20.431138 ...
Hispanic 36.102890 37.280614 33.304071 29.815636 21.085859 ...
White 19.168223 19.088472 17.468412 15.889889 12.006935 ...
Clean Up and Next Steps¶
Now our statistics have all been calculated. However, even with all of our data aggregated and making sense, no one wants to look at a table unless they need to access the raw data. It’s time to visualize it and make it easily digestible. But first, let’s clean up the mess we’ve made.
Clear out all of the code/cells that we don’t need to move forward toward visualization. This includes but is not limited to cells for:
- looking at documentation
- experimentation
- printing data to the screen
- anything not directly involved in our data manipulation pipeline
- any notes written in cells that we don’t want in a final report
We can delete cells by selecting the one we want to delete, going to “Edit” in the Jupyter menu and clicking on “Delete Cells”. We can also do it with a keystroke. Click on the numbering next to the cell and hit the “D” key twice. Be careful though! You can only undo your last most recent cell deletion! So make sure that what you want to destroy is truly what you want to get rid of.
Finally, move all your package imports to the top. I like to have them all in the first cell. If you declared any global variables, move those to the top as well, into a second cell.
When you’re sure that your code is clean and your data preparation pipeline is well-ordered and as you want it, go to the “Kernel” dropdown in the Jupyter menu and hit “Restart and Run All”. This will clear out every process you’ve been running in this session of Jupyter, but leave all the code and text in the cells. It will then re-run everything from top to bottom, renumbering cells along the way.
Next Up: Visualize the Results