Visualize the Results

Tables of numbers are great. I for one love numbers. However, most people don’t care to see columns and columns upon rows and rows of data. Even if they do (like me), it’s hard to pick out the big picture ideas that the data can convey.

Enter Visualization

Visualizing data involves, to some degree, creating quantified art. We visualize to distill overall trends in data into something that can be grasped by looking at it. Want to show something increasing over time? Present a line going up from left to right. Want to show variety in data? Maybe a scatterplot with colors representing different species in the data is what you’d like. Make art represent life here.

Python has many visualization libraries that all do effectively the same thing. Our workhorse for today will be Matplotlib (docs).

The Matplotlib API is a bit...nightmarish. In some spots it’s completely nonsensical. Instead of trying to necessarily parse everything you may read from the documentation, find examples of what matplotlib can do and reproduce them yourself.

What we’re going to be trying to do today is make bar charts of our graduation and dropout rates over time. These will be two separate figures, with each demographic as a different color for each year. We’ll work up to that. Let’s first make A graph. Then ONE bar chart for ONE demographic in our data. Finally, ALL OF IT!

Getting the Feet Wet

We import matplotlib into our notebook at the top with our other imports like so:

In [1]: import matplotlib.pyplot as plt

In [2]: %matplotlib inline

matplotlib is truly a massive library, including functionality for doing things like making maps and 3D graphics. We don’t want all of that. All we need is the pyplot functionality, so we just import that and alias it. We’ll be using a great variety of functions from pyplot so this alias will come in handy.

That next line, %matplotlib inline is what we call iPython magic. Typically, when we create a graph with Matplotlib, it pops open that graph in a new window and cuts off your ability to interact with your code. That’s quite inconvenient for several reasons:

  • We should be able to iterate on code in this notebook, and not be chained to what we wrote
  • We will want to make more than one figure
  • No one likes a pop-up!

First Graph

Before we get to the data we want, let’s see how matplotlib works. For this let’s make fake data.

In [3]: x = list(range(20))

In [4]: y = [num ** 2 for num in x]

With this fake data we’ll make a line graph. x will be on the horizontal “X-axis”, and y will be on the vertical “Y-axis”. To make a line graph with matplotlib we use plt.plot

In [6]: plt.plot(x, y)
        plt.show()

And right below our cell a graph is rendered. The range of numbers that the graph goes through are automatically populated based on inputs, and the color of the line is chosen by default.

If I wanted to put a second line on the same graph, I’d use a second plt.plot statement.

In [7]: y2 = [400 - num for num in y]

In [8]: plt.plot(x, y)
        plt.plot(x, y2)
        plt.show()

And now our second line shows up in green. Note that the first to arguments supplied to plt.plot must have the same size. They don’t have to be the same data type, but they do have to be iterables contain numerical info.

Let’s say that we don’t want our second line to be green, we want it to be red. We also want it to be dashed instead of solid.

In [9]: plt.plot(x, y)
        plt.plot(x, y2, linestyle="--", color="red")
        plt.show()

The linestyle property takes one of a few choices for changing the way lines look. The color property takes words corresponding to a limited set of colors, as well as hex values like #FF0000.

What’s a graph without labels? Labels give a graph meaning and context. Every graph you generate should have labels, maybe even a title. Let’s put both on ours.

In [10]: plt.plot(x, y)
         plt.plot(x, y2, linestyle="--", color="red")
         plt.title("My First Graph")
         plt.xlabel("Some data")
         plt.ylabel("Some other data")
         plt.show()

Ain’t she a beaut?

I don’t like that the graph stops before it hits the right edge, nor that the lines sit flush against the top and bottom edges of this graph. I want to change the limits along each axis.

In [11]: plt.plot(x, y)
         plt.plot(x, y2, linestyle="--", color="red")
         plt.title("My First Graph")
         plt.xlabel("Some data")
         plt.ylabel("Some other data")
         plt.xlim(0, 19)
         plt.ylim(-10, 410)
         plt.show()

Don’t you love it when the world is your playground?

Let’s move on to bars.

Bar Starz

The Single Data Series

To create a bar chart with matplotlib we use the bar() function. The first argument will set the left edge for each bar, while the second argument sets the height of each bar. We’ll just concern ourselves with that for now.

In [11]: plt.bar?
Make a bar plot.

Make a bar plot with rectangles bounded by:

  `left`, `left` + `width`, `bottom`, `bottom` + `height`
        (left, right, bottom and top edges)

Parameters
----------
left : sequence of scalars
    the x coordinates of the left sides of the bars

height : sequence of scalars
    the heights of the bars

width : scalar or array-like, optional
    the width(s) of the bars
    default: 0.8
...

For the data that we’ve aggregated into dropout_df and graduate_df, our left-edge positions will correspond to the year of data we’re looking at, while the height will be the values we’ve calculated for one demographic. So, let’s look at the dropout rates for all students between 2007 and 2015.

In [12]: print(dropout_df.columns) # our years
Int64Index([2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015], dtype='int64')

In [13]: print(dropout_df.loc["All Students"])
2007    22.210186
2008    22.667685
2009    20.565407
2010    18.577791
2011    13.933354
2012    13.569043
2013    12.957700
2014    12.309690
2015    11.865464
Name: All Students, dtype: float64

In [14]: plt.bar(dropout_df.columns, dropout_df.loc["All Students"])
         plt.show()

...yay? Bleh. matplotlib has this annoying habit of thinking for you. When it comes to looking at small differences in large numbers, it tries to default to scientific notation in a really strange way. We don’t want that, so we’re going to need to dig a little deeper into the pyplot module.

pyplot has a function called subplot. What this function does is set up a series of smaller canvases inside of your plotting space corresponding to the number of subplots you want. You specify that with three arguments: number of rows, number of columns, and current frame. Calling plt.subplot(1, 2, 1) means that you want to create a plotting space that has one row and two columns, and that the plotting code you write next will produce a graph within the first window (not zero-indexed).

The subplot function also returns an object that acts A LOT like pyplot, but allows for far more control over your plotting space.

In [15]: ax = plt.subplot(1, 1, 1)

In [16]: type(ax)
matplotlib.axes._subplots.AxesSubplot

In fact, this object includes most of the same plotting functions as pyplot. So, along with the added functionality we can use the ax object almost exactly like we were using plt. The one difference is that ax doesn’t have its own show() function. You can still use it without, but you’ll get a lot of output you don’t care to see from matplotlib. So, we’ll tack plt.show() on the end of any chain of functions we use from ax.

Let’s directly change the labels on individual tickmarks on our X-axis to be more visually appealing. We set it up like so:

In [17]: ax = plt.subplot(1, 1, 1)
         ax.bar(dropout_df.columns, dropout_df.loc["All Students"])
         ax.set_xticks(dropout_df.columns)
         ax.set_xticklabels(dropout_df.columns)
         plt.show()

Ok, so that’s a little better. ax.set_xticks will ensure that our tick marks correspond to the numerical data that we want (i.e. years). Meanwhile, ax.set_xticklabels sets the strings that actually get printed to the X-axis for each tick.

It doesn’t look quite right though. As a viewer of this data, I’d expect that the years would be centered underneath the bars they correspond to, but that’s not what we have here. Luckily, ax.bar has an argument align which can be set as either “left” (default), or “center.”

In [18]: ax = plt.subplot(1, 1, 1)
         ax.bar(dropout_df.columns, dropout_df.loc["All Students"], align="center")
         ax.set_xticks(dropout_df.columns)
         ax.set_xticklabels(dropout_df.columns)
         plt.show()

This is good enough for now. To set the labels on the axes themselves we’ll use ax.set_xlabel() and ax.set_ylabel().

In [19]: ax = plt.subplot(1, 1, 1)
         ax.bar(dropout_df.columns, dropout_df.loc["All Students"], align="center")
         ax.set_xticks(dropout_df.columns)
         ax.set_xticklabels(dropout_df.columns)
         ax.set_xlabel("Year")
         ax.set_ylabel("Dropout Rate (%)")
         plt.show()

Adding a Second Data Series

We’ll eventually want to put all (or at least most) of our demographics on one chart. Let’s just try to add a second one right now.

In [20]: ax = plt.subplot(1, 1, 1)
         ax.bar(dropout_df.columns, dropout_df.loc["All Students"], align="center")
         ax.bar(dropout_df.columns, dropout_df.loc["Black"], align="center", color="red")
         ax.set_xticks(dropout_df.columns)
         ax.set_xticklabels(dropout_df.columns)
         ax.set_xlabel("Year")
         ax.set_ylabel("Dropout Rate (%)")
         plt.show()

Crap. matplotlib isn’t intelligent enough to shimmy the next dataset into a different position so that we can view both. We’ve gotta reach in and do that ourselves.

Recall that unless align is set to center the first argument passed to ax.bar sets the left edge of each bar. Looking back at the documentation for this function, it can also take a third argument called width. We can use these two together to plot two bar series next to each other. Default width is 0.8. Let’s set each series width to half of that. Let’s then position one series to slightly left of the tickmarks, with the other having its left edge right AT the tick marks.

In [21]: width = 0.4

In [22]: ax = plt.subplot(1, 1, 1)
         ax.bar(dropout_df.columns - width, dropout_df.loc["All Students"], width=width)
         ax.bar(dropout_df.columns, dropout_df.loc["Black"], width=width, color="red")
         ax.set_xticks(dropout_df.columns)
         ax.set_xticklabels(dropout_df.columns)
         ax.set_xlabel("Year")
         ax.set_ylabel("Dropout Rate (%)")
         plt.show()

Despite the super depressing information portrayed by this graph, this looks a ton better. We can’t really tell what’s being shown by each bar series though, so let’s add a legend to the graph. To make it really work in our favor, we add a label to each call of ax.bar describing what it is. ax.legend will pick up on these labels and render them as we need!

In [23]: ax = plt.subplot(1, 1, 1)
         ax.bar(dropout_df.columns - width, dropout_df.loc["All Students"],
                width=width, label="All Students")
         ax.bar(dropout_df.columns, dropout_df.loc["Black"], width=width,
                color="red", label="Black")

         ax.set_xticks(dropout_df.columns)
         ax.set_xticklabels(dropout_df.columns)
         ax.set_xlabel("Year")
         ax.set_ylabel("Dropout Rate (%)")
         ax.legend()
         plt.show()

matplotlib will do its best to move the legend out of the way of the data, but if you want it displayed in some other way than what it chose, look into the docs about how to use the loc parameter.

Woo! Information! It’s getting clearer by the second! Let’s reach for the stars and add a third.

This Is Not Sustainable

To add a third series and still fit it within the bounds we have set, we’re going to have to get even more hacky than before. We want the first series to the left, the second one centered on the tick mark, and the third to the right. We have to change the width of each bar again to accomodate this new player and not overlap data from different years. If we try to do it the same way we did above, it’s going to look weird.

In [24]: width = 0.3

In [25]: ax = plt.subplot(1, 1, 1)
         ax.bar(dropout_df.columns - width, dropout_df.loc["All Students"],
                width=width, label="All Students")
         ax.bar(dropout_df.columns, dropout_df.loc["Black"], width=width,
                color="red", label="Black")
         ax.bar(dropout_df.columns + width, dropout_df.loc["Hispanic"], width=width,
                color="green", label="Hispanic")

         ax.set_xticks(dropout_df.columns)
         ax.set_xticklabels(dropout_df.columns)
         ax.set_xlabel("Year")
         ax.set_ylabel("Dropout Rate (%)")
         ax.legend()
         plt.show()

A rule to keep in mind here is that an even number of bar series gets aligned to the left, while an odd amount gets aligned to the center.

In [26]: width = 0.3

In [27]: ax = plt.subplot(1, 1, 1)
         ax.bar(dropout_df.columns - width, dropout_df.loc["All Students"],
                width=width, label="All Students", align="center")
         ax.bar(dropout_df.columns, dropout_df.loc["Black"], width=width,
                color="red", label="Black", align="center")
         ax.bar(dropout_df.columns + width, dropout_df.loc["Hispanic"], width=width,
                color="green", label="Hispanic", align="center")

         ax.set_xticks(dropout_df.columns)
         ax.set_xticklabels(dropout_df.columns)
         ax.set_xlabel("Year")
         ax.set_ylabel("Dropout Rate (%)")
         ax.legend()
         plt.show()

Ok so this works...for now. But what happens when we want to add the rest of our data? Not including the composite Asian/Pacific Islander category, we have 9 separate series of data to shove into this one tiny chart! We shouldn’t have to keep manually recalculating the appropriate width of each bar, then decide whether to center them or align them to the left, while also choosing a new color as if each is a special snowflake. Let’s write a function to do that!

Before we get to that function though, let’s first increase the size of our plot space so that we don’t make a graph for ants. Our old friend plt has a function called figure that will handle that. It has a parameter called figsize which takes an iterable of two numbers, (width, height). Each is meant to be measured in inches.

In [28]: plt.figure(figsize=(14, 4))
         ax = plt.subplot(1, 1, 1)
         ax.bar(dropout_df.columns - width, dropout_df.loc["All Students"],
                width=width, label="All Students", align="center")
         ax.bar(dropout_df.columns, dropout_df.loc["Black"], width=width,
                color="red", label="Black", align="center")
         ax.bar(dropout_df.columns + width, dropout_df.loc["Hispanic"], width=width,
                color="green", label="Hispanic", align="center")

         ax.set_xticks(dropout_df.columns)
         ax.set_xticklabels(dropout_df.columns)
         ax.set_xlabel("Year")
         ax.set_ylabel("Dropout Rate (%)")
         ax.legend()
         plt.show()

Wow that does look so much better. We could scale up the sizes of the numbers on the actual axis labels, but that goes a little deeper than I want for this walkthrough.

Let’s Get Cleverer...er

Alright, let’s encapsulate all of this plotting code into a function. This function generate the chart we need with the proper axis labels, title, scaling, and everything else we could want.

def plot_aggregate_data(dataframe,
                        demographics=["All Students", "Black", "Hispanic"],
                        title=None,
                        figsize=(14, 4),
                        yaxis_label=None,
                        ylims=None):

    width = (1.0 - 0.1) / len(demographics) # adjust the width depending
                                            # on how many things we graph

    plt.figure(figsize=figsize)
    ax = plt.subplot(1, 1, 1)
    ax.bar(dataframe.columns - width, dataframe.loc["All Students"],
        width=width, color="blue", label="All Students", align="center")
    ax.bar(dataframe.columns, dataframe.loc["Black"],
        width=width, color="red", label="Black", align="center")
    ax.bar(dataframe.columns + width, dataframe.loc["Hispanic"],
        width=width, color="green", label="Hispanic", align="center")

    ax.set_xticks(dataframe.columns)
    ax.set_xticklabels(dataframe.columns)
    ax.set_xlabel("Year")
    ax.set_ylabel("Dropout Rate (%)")
    ax.legend()
    plt.show()

Let’s decide what colors we want each demographic to be

# inside the plot_aggregate_data function

colors = {
    "All Students": "#000000",     # black
    "Black": "#FF0000",            # red
    "Hispanic": "#00FF00",         # green
    "First Nations": "#FFFF00",    # yellow
    "Asian": "#0000FF",            # blue
    "Pacific Islander": "#FF9000", # orange
    "White": "#FF00FF",            # purple
}

All those ax.bar calls look effectively the same with some minor changes between each one. Perhaps we can just write the line once and loop over it.

# inside again

plt.figure(figsize=figsize)
ax = plt.subplot(1, 1, 1)

for demo in demographics:
    ax.bar(dataframe.columns, dataframe.loc[demo],
        width=width, color=colors[demo], label=demo, align="center")

ax.set_xticks(dataframe.columns)
ax.set_xticklabels(dataframe.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
ax.legend()
plt.show()

We must do something about those offsets from the center though! We decided that when we have an odd number of things to graph, align="center". Else, align="left", which is its default setting.

# inside yet again

plt.figure(figsize=figsize)
ax = plt.subplot(1, 1, 1)

if len(demographics) % 2:
    for demo in demographics:
        ax.bar(dataframe.columns, dataframe.loc[demo],
            width=width, color=colors[demo], label=demo, align="center")

else:
    for demo in demographics:
        ax.bar(dataframe.columns, dataframe.loc[demo],
            width=width, color=colors[demo], label=demo)

ax.set_xticks(dataframe.columns)
ax.set_xticklabels(dataframe.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
ax.legend()
plt.show()

What about where each series starts? That was dependent on the width right? When there was an odd-numbered amount,

  • For 3 series it was align="center" with center - width, center, and center + width as bar positions
  • For 5 it should be center - (2 * width), center - width, center, center + width, center + (2 * width)
  • For N series it should be center + ((idx + 1) - round(N / 2)) * width for idx in range(0, N)

For an even-numbered amount,

  • For 2 series it was align="left" with left - width and left
  • For 4 it should be left - 2 * width, left - width, left, left + width
  • For N series it should be left + (idx - N / 2) * width for idx in range(0, N)

Let’s fix these into our function.

# inside yet again

n_items = len(demographics)

plt.figure(figsize=figsize)
ax = plt.subplot(1, 1, 1)

if len(demographics) % 2:
    for idx, demo in enumerate(demographics):               # changed
        ax.bar(dataframe.columns + (idx - round(n_items / 2)) * width,
            dataframe.loc[demo], width=width,
            color=colors[demo], label=demo, align="center")

else:
    for idx, demo in enumerate(demographics):               # changed
        ax.bar(dataframe.columns + (idx - n_items / 2) * width,
            dataframe.loc[demo], width=width,
            color=colors[demo], label=demo)

ax.set_xticks(dataframe.columns)
ax.set_xticklabels(dataframe.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
ax.legend()
plt.show()

Finally, we make our ylabel variable and add a title to our plot. We also add ax.grid() so that when we’re looking at tons of bars we know where they reach numerically, and accommodate for when we want to adjust the vertical axis with ylims.

def plot_aggregate_data(dataframe,
                        demographics=["All Students", "Black", "Hispanic"],
                        title=None,
                        figsize=(14, 4),
                        yaxis_label=None,
                        ylims=None):
    # the final product
    colors = {
        "All Students": "#000000",     # black
        "Black": "#FF0000",            # red
        "Hispanic": "#00FF00",         # green
        "First Nations": "#FFFF00",    # yellow
        "Asian": "#0000FF",            # blue
        "Pacific Islander": "#FF9000", # orange
        "White": "#FF00FF",            # purple
    }

    n_items = len(demographics)                                 # added
    width = (1.0 - 0.1) / n_items                               # changed

    plt.figure(figsize=figsize)
    if title:                                                   # added
        plt.title(title)                                        # added

    ax = plt.subplot(1, 1, 1)

    if len(demographics) % 2:
        for idx, demo in enumerate(demographics):               # changed
            ax.bar(dataframe.columns + ((idx + 1) - round(n_items / 2)) * width,
                dataframe.loc[demo], width=width,
                color=colors[demo], label=demo, align="center")

    else:
        for idx, demo in enumerate(demographics):               # changed
            ax.bar(dataframe.columns + (idx - n_items / 2) * width,
                dataframe.loc[demo], width=width,
                color=colors[demo], label=demo)

    ax.set_xticks(dataframe.columns)
    ax.set_xticklabels(dataframe.columns)
    ax.set_xlabel("Year")
    ax.set_ylabel(yaxis_label)                                  # changed
    if ylims:                                                   # added
        ax.set_ylim(ylims)
    ax.legend()
    ax.grid()
    plt.show()

This is a beast of a function, but it does the job. And we can use it to look at our data for however many different combinations of demographics as we want. If we want to tweak anything about how we plotted information, we just retool this function a bit.

Now that we have our visuals, what now?

Next Up: What’s Next?