Visualize the Results¶
Tables of numbers are great. I for one love numbers. However, most people don’t care to see columns and columns upon rows and rows of data. Even if they do (like me), it’s hard to pick out the big picture ideas that the data can convey.
Enter Visualization¶
Visualizing data involves, to some degree, creating quantified art. We visualize to distill overall trends in data into something that can be grasped by looking at it. Want to show something increasing over time? Present a line going up from left to right. Want to show variety in data? Maybe a scatterplot with colors representing different species in the data is what you’d like. Make art represent life here.
Python has many visualization libraries that all do effectively the same thing. Our workhorse for today will be Matplotlib (docs).
The Matplotlib API is a bit...nightmarish. In some spots it’s completely nonsensical. Instead of trying to necessarily parse everything you may read from the documentation, find examples of what matplotlib can do and reproduce them yourself.
What we’re going to be trying to do today is make bar charts of our graduation and dropout rates over time. These will be two separate figures, with each demographic as a different color for each year. We’ll work up to that. Let’s first make A graph. Then ONE bar chart for ONE demographic in our data. Finally, ALL OF IT!
Getting the Feet Wet¶
We import matplotlib
into our notebook at the top with our other imports like so:
In [1]: import matplotlib.pyplot as plt
In [2]: %matplotlib inline
matplotlib
is truly a massive library, including functionality for doing things like making maps and 3D graphics.
We don’t want all of that.
All we need is the pyplot
functionality, so we just import that and alias it.
We’ll be using a great variety of functions from pyplot
so this alias will come in handy.
That next line, %matplotlib inline
is what we call iPython magic.
Typically, when we create a graph with Matplotlib, it pops open that graph in a new window and cuts off your ability to interact with your code.
That’s quite inconvenient for several reasons:
- We should be able to iterate on code in this notebook, and not be chained to what we wrote
- We will want to make more than one figure
- No one likes a pop-up!
First Graph¶
Before we get to the data we want, let’s see how matplotlib
works.
For this let’s make fake data.
In [3]: x = list(range(20))
In [4]: y = [num ** 2 for num in x]
With this fake data we’ll make a line graph.
x
will be on the horizontal “X-axis”, and y
will be on the vertical “Y-axis”.
To make a line graph with matplotlib
we use plt.plot
In [6]: plt.plot(x, y)
plt.show()
And right below our cell a graph is rendered. The range of numbers that the graph goes through are automatically populated based on inputs, and the color of the line is chosen by default.
If I wanted to put a second line on the same graph, I’d use a second plt.plot
statement.
In [7]: y2 = [400 - num for num in y]
In [8]: plt.plot(x, y)
plt.plot(x, y2)
plt.show()
And now our second line shows up in green.
Note that the first to arguments supplied to plt.plot
must have the same size.
They don’t have to be the same data type, but they do have to be iterables contain numerical info.
Let’s say that we don’t want our second line to be green, we want it to be red. We also want it to be dashed instead of solid.
In [9]: plt.plot(x, y)
plt.plot(x, y2, linestyle="--", color="red")
plt.show()
The linestyle
property takes one of a few choices for changing the way lines look.
The color
property takes words corresponding to a limited set of colors, as well as hex values like #FF0000
.
What’s a graph without labels? Labels give a graph meaning and context. Every graph you generate should have labels, maybe even a title. Let’s put both on ours.
In [10]: plt.plot(x, y)
plt.plot(x, y2, linestyle="--", color="red")
plt.title("My First Graph")
plt.xlabel("Some data")
plt.ylabel("Some other data")
plt.show()
Ain’t she a beaut?
I don’t like that the graph stops before it hits the right edge, nor that the lines sit flush against the top and bottom edges of this graph. I want to change the limits along each axis.
In [11]: plt.plot(x, y)
plt.plot(x, y2, linestyle="--", color="red")
plt.title("My First Graph")
plt.xlabel("Some data")
plt.ylabel("Some other data")
plt.xlim(0, 19)
plt.ylim(-10, 410)
plt.show()
Don’t you love it when the world is your playground?
Let’s move on to bars.
Bar Starz¶
The Single Data Series¶
To create a bar chart with matplotlib
we use the bar()
function.
The first argument will set the left edge for each bar, while the second argument
sets the height of each bar.
We’ll just concern ourselves with that for now.
In [11]: plt.bar?
Make a bar plot.
Make a bar plot with rectangles bounded by:
`left`, `left` + `width`, `bottom`, `bottom` + `height`
(left, right, bottom and top edges)
Parameters
----------
left : sequence of scalars
the x coordinates of the left sides of the bars
height : sequence of scalars
the heights of the bars
width : scalar or array-like, optional
the width(s) of the bars
default: 0.8
...
For the data that we’ve aggregated into dropout_df
and graduate_df
, our left-edge positions will correspond to the year of data we’re looking at, while the height will be the values we’ve calculated for one demographic.
So, let’s look at the dropout rates for all students between 2007 and 2015.
In [12]: print(dropout_df.columns) # our years
Int64Index([2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015], dtype='int64')
In [13]: print(dropout_df.loc["All Students"])
2007 22.210186
2008 22.667685
2009 20.565407
2010 18.577791
2011 13.933354
2012 13.569043
2013 12.957700
2014 12.309690
2015 11.865464
Name: All Students, dtype: float64
In [14]: plt.bar(dropout_df.columns, dropout_df.loc["All Students"])
plt.show()
...yay? Bleh.
matplotlib
has this annoying habit of thinking for you.
When it comes to looking at small differences in large numbers, it tries to default to scientific notation in a really strange way.
We don’t want that, so we’re going to need to dig a little deeper into the pyplot
module.
pyplot
has a function called subplot
.
What this function does is set up a series of smaller canvases inside of your plotting space corresponding to the number of subplots you want.
You specify that with three arguments: number of rows, number of columns, and current frame.
Calling plt.subplot(1, 2, 1)
means that you want to create a plotting space that has one row and two columns, and that the plotting code you write next will produce a graph within the first window (not zero-indexed).
The subplot
function also returns an object that acts A LOT like pyplot
, but allows for far more control over your plotting space.
In [15]: ax = plt.subplot(1, 1, 1)
In [16]: type(ax)
matplotlib.axes._subplots.AxesSubplot
In fact, this object includes most of the same plotting functions as pyplot
.
So, along with the added functionality we can use the ax
object almost exactly like we were using plt
.
The one difference is that ax
doesn’t have its own show()
function.
You can still use it without, but you’ll get a lot of output you don’t care to see from matplotlib
.
So, we’ll tack plt.show()
on the end of any chain of functions we use from ax
.
Let’s directly change the labels on individual tickmarks on our X-axis to be more visually appealing. We set it up like so:
In [17]: ax = plt.subplot(1, 1, 1)
ax.bar(dropout_df.columns, dropout_df.loc["All Students"])
ax.set_xticks(dropout_df.columns)
ax.set_xticklabels(dropout_df.columns)
plt.show()
Ok, so that’s a little better.
ax.set_xticks
will ensure that our tick marks correspond to the numerical data that we want (i.e. years).
Meanwhile, ax.set_xticklabels
sets the strings that actually get printed to the X-axis for each tick.
It doesn’t look quite right though.
As a viewer of this data, I’d expect that the years would be centered underneath the bars they correspond to, but that’s not what we have here.
Luckily, ax.bar
has an argument align
which can be set as either “left” (default), or “center.”
In [18]: ax = plt.subplot(1, 1, 1)
ax.bar(dropout_df.columns, dropout_df.loc["All Students"], align="center")
ax.set_xticks(dropout_df.columns)
ax.set_xticklabels(dropout_df.columns)
plt.show()
This is good enough for now.
To set the labels on the axes themselves we’ll use ax.set_xlabel()
and ax.set_ylabel()
.
In [19]: ax = plt.subplot(1, 1, 1)
ax.bar(dropout_df.columns, dropout_df.loc["All Students"], align="center")
ax.set_xticks(dropout_df.columns)
ax.set_xticklabels(dropout_df.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
plt.show()
Adding a Second Data Series¶
We’ll eventually want to put all (or at least most) of our demographics on one chart. Let’s just try to add a second one right now.
In [20]: ax = plt.subplot(1, 1, 1)
ax.bar(dropout_df.columns, dropout_df.loc["All Students"], align="center")
ax.bar(dropout_df.columns, dropout_df.loc["Black"], align="center", color="red")
ax.set_xticks(dropout_df.columns)
ax.set_xticklabels(dropout_df.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
plt.show()
Crap.
matplotlib
isn’t intelligent enough to shimmy the next dataset into a different position so that we can view both.
We’ve gotta reach in and do that ourselves.
Recall that unless align
is set to center the first argument passed to ax.bar
sets the left edge of each bar.
Looking back at the documentation for this function, it can also take a third argument called width
.
We can use these two together to plot two bar series next to each other.
Default width
is 0.8.
Let’s set each series width to half of that.
Let’s then position one series to slightly left of the tickmarks, with the other having its left edge right AT the tick marks.
In [21]: width = 0.4
In [22]: ax = plt.subplot(1, 1, 1)
ax.bar(dropout_df.columns - width, dropout_df.loc["All Students"], width=width)
ax.bar(dropout_df.columns, dropout_df.loc["Black"], width=width, color="red")
ax.set_xticks(dropout_df.columns)
ax.set_xticklabels(dropout_df.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
plt.show()
Despite the super depressing information portrayed by this graph, this looks a ton better.
We can’t really tell what’s being shown by each bar series though, so let’s add a legend to the graph.
To make it really work in our favor, we add a label to each call of ax.bar
describing what it is.
ax.legend
will pick up on these labels and render them as we need!
In [23]: ax = plt.subplot(1, 1, 1)
ax.bar(dropout_df.columns - width, dropout_df.loc["All Students"],
width=width, label="All Students")
ax.bar(dropout_df.columns, dropout_df.loc["Black"], width=width,
color="red", label="Black")
ax.set_xticks(dropout_df.columns)
ax.set_xticklabels(dropout_df.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
ax.legend()
plt.show()
matplotlib
will do its best to move the legend out of the way of the data, but if you want it displayed in some other way than what it chose, look into the docs about how to use the loc
parameter.
Woo! Information! It’s getting clearer by the second! Let’s reach for the stars and add a third.
This Is Not Sustainable¶
To add a third series and still fit it within the bounds we have set, we’re going to have to get even more hacky than before. We want the first series to the left, the second one centered on the tick mark, and the third to the right. We have to change the width of each bar again to accomodate this new player and not overlap data from different years. If we try to do it the same way we did above, it’s going to look weird.
In [24]: width = 0.3
In [25]: ax = plt.subplot(1, 1, 1)
ax.bar(dropout_df.columns - width, dropout_df.loc["All Students"],
width=width, label="All Students")
ax.bar(dropout_df.columns, dropout_df.loc["Black"], width=width,
color="red", label="Black")
ax.bar(dropout_df.columns + width, dropout_df.loc["Hispanic"], width=width,
color="green", label="Hispanic")
ax.set_xticks(dropout_df.columns)
ax.set_xticklabels(dropout_df.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
ax.legend()
plt.show()
A rule to keep in mind here is that an even number of bar series gets aligned to the left, while an odd amount gets aligned to the center.
In [26]: width = 0.3
In [27]: ax = plt.subplot(1, 1, 1)
ax.bar(dropout_df.columns - width, dropout_df.loc["All Students"],
width=width, label="All Students", align="center")
ax.bar(dropout_df.columns, dropout_df.loc["Black"], width=width,
color="red", label="Black", align="center")
ax.bar(dropout_df.columns + width, dropout_df.loc["Hispanic"], width=width,
color="green", label="Hispanic", align="center")
ax.set_xticks(dropout_df.columns)
ax.set_xticklabels(dropout_df.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
ax.legend()
plt.show()
Ok so this works...for now. But what happens when we want to add the rest of our data? Not including the composite Asian/Pacific Islander category, we have 9 separate series of data to shove into this one tiny chart! We shouldn’t have to keep manually recalculating the appropriate width of each bar, then decide whether to center them or align them to the left, while also choosing a new color as if each is a special snowflake. Let’s write a function to do that!
Before we get to that function though, let’s first increase the size of our plot space so that we don’t make a graph for ants.
Our old friend plt
has a function called figure
that will handle that.
It has a parameter called figsize
which takes an iterable of two numbers, (width, height).
Each is meant to be measured in inches.
In [28]: plt.figure(figsize=(14, 4))
ax = plt.subplot(1, 1, 1)
ax.bar(dropout_df.columns - width, dropout_df.loc["All Students"],
width=width, label="All Students", align="center")
ax.bar(dropout_df.columns, dropout_df.loc["Black"], width=width,
color="red", label="Black", align="center")
ax.bar(dropout_df.columns + width, dropout_df.loc["Hispanic"], width=width,
color="green", label="Hispanic", align="center")
ax.set_xticks(dropout_df.columns)
ax.set_xticklabels(dropout_df.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
ax.legend()
plt.show()
Wow that does look so much better. We could scale up the sizes of the numbers on the actual axis labels, but that goes a little deeper than I want for this walkthrough.
Let’s Get Cleverer...er¶
Alright, let’s encapsulate all of this plotting code into a function. This function generate the chart we need with the proper axis labels, title, scaling, and everything else we could want.
def plot_aggregate_data(dataframe,
demographics=["All Students", "Black", "Hispanic"],
title=None,
figsize=(14, 4),
yaxis_label=None,
ylims=None):
width = (1.0 - 0.1) / len(demographics) # adjust the width depending
# on how many things we graph
plt.figure(figsize=figsize)
ax = plt.subplot(1, 1, 1)
ax.bar(dataframe.columns - width, dataframe.loc["All Students"],
width=width, color="blue", label="All Students", align="center")
ax.bar(dataframe.columns, dataframe.loc["Black"],
width=width, color="red", label="Black", align="center")
ax.bar(dataframe.columns + width, dataframe.loc["Hispanic"],
width=width, color="green", label="Hispanic", align="center")
ax.set_xticks(dataframe.columns)
ax.set_xticklabels(dataframe.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
ax.legend()
plt.show()
Let’s decide what colors we want each demographic to be
# inside the plot_aggregate_data function
colors = {
"All Students": "#000000", # black
"Black": "#FF0000", # red
"Hispanic": "#00FF00", # green
"First Nations": "#FFFF00", # yellow
"Asian": "#0000FF", # blue
"Pacific Islander": "#FF9000", # orange
"White": "#FF00FF", # purple
}
All those ax.bar
calls look effectively the same with some minor changes between each one.
Perhaps we can just write the line once and loop over it.
# inside again
plt.figure(figsize=figsize)
ax = plt.subplot(1, 1, 1)
for demo in demographics:
ax.bar(dataframe.columns, dataframe.loc[demo],
width=width, color=colors[demo], label=demo, align="center")
ax.set_xticks(dataframe.columns)
ax.set_xticklabels(dataframe.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
ax.legend()
plt.show()
We must do something about those offsets from the center though!
We decided that when we have an odd number of things to graph, align="center"
.
Else, align="left"
, which is its default setting.
# inside yet again
plt.figure(figsize=figsize)
ax = plt.subplot(1, 1, 1)
if len(demographics) % 2:
for demo in demographics:
ax.bar(dataframe.columns, dataframe.loc[demo],
width=width, color=colors[demo], label=demo, align="center")
else:
for demo in demographics:
ax.bar(dataframe.columns, dataframe.loc[demo],
width=width, color=colors[demo], label=demo)
ax.set_xticks(dataframe.columns)
ax.set_xticklabels(dataframe.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
ax.legend()
plt.show()
What about where each series starts? That was dependent on the width right? When there was an odd-numbered amount,
- For 3 series it was
align="center"
with center - width, center, and center + width as bar positions - For 5 it should be center - (2 * width), center - width, center, center + width, center + (2 * width)
- For N series it should be center + ((idx + 1) - round(N / 2)) * width for idx in range(0, N)
For an even-numbered amount,
- For 2 series it was
align="left"
with left - width and left - For 4 it should be left - 2 * width, left - width, left, left + width
- For N series it should be left + (idx - N / 2) * width for idx in range(0, N)
Let’s fix these into our function.
# inside yet again
n_items = len(demographics)
plt.figure(figsize=figsize)
ax = plt.subplot(1, 1, 1)
if len(demographics) % 2:
for idx, demo in enumerate(demographics): # changed
ax.bar(dataframe.columns + (idx - round(n_items / 2)) * width,
dataframe.loc[demo], width=width,
color=colors[demo], label=demo, align="center")
else:
for idx, demo in enumerate(demographics): # changed
ax.bar(dataframe.columns + (idx - n_items / 2) * width,
dataframe.loc[demo], width=width,
color=colors[demo], label=demo)
ax.set_xticks(dataframe.columns)
ax.set_xticklabels(dataframe.columns)
ax.set_xlabel("Year")
ax.set_ylabel("Dropout Rate (%)")
ax.legend()
plt.show()
Finally, we make our ylabel variable and add a title to our plot.
We also add ax.grid()
so that when we’re looking at tons of bars we know where they
reach numerically,
and accommodate for when we want to adjust the vertical axis with ylims
.
def plot_aggregate_data(dataframe,
demographics=["All Students", "Black", "Hispanic"],
title=None,
figsize=(14, 4),
yaxis_label=None,
ylims=None):
# the final product
colors = {
"All Students": "#000000", # black
"Black": "#FF0000", # red
"Hispanic": "#00FF00", # green
"First Nations": "#FFFF00", # yellow
"Asian": "#0000FF", # blue
"Pacific Islander": "#FF9000", # orange
"White": "#FF00FF", # purple
}
n_items = len(demographics) # added
width = (1.0 - 0.1) / n_items # changed
plt.figure(figsize=figsize)
if title: # added
plt.title(title) # added
ax = plt.subplot(1, 1, 1)
if len(demographics) % 2:
for idx, demo in enumerate(demographics): # changed
ax.bar(dataframe.columns + ((idx + 1) - round(n_items / 2)) * width,
dataframe.loc[demo], width=width,
color=colors[demo], label=demo, align="center")
else:
for idx, demo in enumerate(demographics): # changed
ax.bar(dataframe.columns + (idx - n_items / 2) * width,
dataframe.loc[demo], width=width,
color=colors[demo], label=demo)
ax.set_xticks(dataframe.columns)
ax.set_xticklabels(dataframe.columns)
ax.set_xlabel("Year")
ax.set_ylabel(yaxis_label) # changed
if ylims: # added
ax.set_ylim(ylims)
ax.legend()
ax.grid()
plt.show()
This is a beast of a function, but it does the job. And we can use it to look at our data for however many different combinations of demographics as we want. If we want to tweak anything about how we plotted information, we just retool this function a bit.
Now that we have our visuals, what now?
Next Up: What’s Next?