The Data

Overall, we’re looking to group some statistics by demographic and visualize what the overall average graduation and dropout rates look like over time.

In many cases as an analyst you have to gather your own data. In this case, we already have the data provided to us.

The data is sourced from the State of Washington Office of Superintendent of Public Instruction. What we have is data from 2006 through 2015 showing the graduation and drop out rates across the state of Washington. Currently they’re in the form of Excel spreadsheets and can be obtained here.

Inspect the Data

Before we read it into Python and make some actual sense out of it, let’s see what it looks like. If we pop open AppendixAState2008.xls we’ll see 35 columns and 14 rows of data. Each row represents a demographic. Looking at the other files, we see similar setups, though they’re all slightly different. Some are one row off, some have data grouped together in an odd fashion. The main differences we see occur in 2007, 2009, and 2010, where all of the data for every district are provided row-by-row in the first sheet of the workbook. The state-wide data is housed in the sheet labeled “State”.

There is no escaping inconsistency of data when doing analysis. So, if we want to do the work we came to do, we have to apply some standards, take some notes, and even make some sacrifices.

Apply Standards

We’ll be using the Pandas read_excel method to collect the data we need neatly into Python. The way that read_excel works is to pull in data row-by-row. It will arrange data into columns based on how it’s distributed in the spreadsheet, and it doesn’t do so well with merged columns. It’ll allow us to skip some rows and begin its interpretation elsewhere, but not every spreadsheet needs to skip the same rows, and even further not every spreadsheet has the same setup. We can apply some standards to start, and take note of what we really can’t do for later.

Let’s start with the number of rows we want to skip, and let’s make it universal. Across every spreadsheet, let’s make the column names start at row 5, and work from there. Insert rows where we need to across all our spreadsheets.

Now let’s look at our rows. In 2007 we’ve got the following denominations:

  • All Students
  • American Indian
  • Asian/Pacific Islanders
  • Black
  • Hispanic
  • White
  • Other
  • Limited English
  • Low Income
  • Special Education
  • Migrant
  • Female
  • Male

2008 adds a separation between “Asians” and “Pacific Islanders” and continues on into 2009.

In 2010 we lose “Other” and pick up “Multiracial or Not Provided”, “504”, “Foster Care”, and “Foster Care - W DSHS Students”.

In 2011 we change “Multiracial” for “Two or More Races”. We change “Limited English” for “Bilingual.” “Migrant” becomes “Title I Migrant”. We lose both Foster Care categories.

In 2012 we get back to “Limited English” from “Bilingual”. We also get “Foster Care” back.

In 2013 we pick up “Homeless”. From there on, we stay consistent.

A few things are missing entirely from some years, so let’s decide on what we want to look at specifically. We want as good a racial breakdown as we can get, so let’s retain the following rows:

  • All Students
  • American Indian
  • Asian
  • Pacific Islanders
  • Black
  • Hispanic
  • White

We still have to deal with the lack of separation between Asians and Pacific Islanders in 2007. We will handle this when we move to Python, but let’s outline a strategy here.

Even though the populations span over many years, we don’t expect that the ratios between populations would change significantly over time. Here we can make a key assumption: within a small margin of error, the ratio between Asians and Pacific Islanders in 2007 is likely very near the average of the ratios for other years. So if it’s like a 70-30 split across all years, we can take the number graduated and apply that ratio. We will of course have to note this grand assumption in our notes.

Next Up: Bring It Into Python