Extracting Data from the Web¶
The internet makes a vast quantity of data available.
But not always in the form or combination you want.
It can be nice to combine data from different sources to create meaning.
Part 1: Web Scraping¶
Data online comes in many different formats:
- Simple websites with static (or perhaps dynamic) data in HTML
- Web services providing structured data
- Web services providing tranformative service (geocoding)
- Web services providing presentation (mapping)
Let’s concentrate for now on that first class of data, HTML.
HTML Data¶
Ideally HTML would be well-formed and strictly correct in it’s structure:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<p>A nice clean paragraph</p>
<p>And another nice clean paragraph</p>
</body>
</html>
But in fact, it usually ends up looking more like this:
<html>
<form>
<table>
<td><input name="input1">Row 1 cell 1
<tr><td>Row 2 cell 1
</form>
<td>Row 2 cell 2<br>This</br> sure is a long cell
</body>
</html>
This is the result of one of the fundamental laws of the internet:
“Be strict in what you send and tolerant in what you receive”
Cleaning Up the Mess¶
My favorite library for dealing with the mess that HTML can become is BeautifulSoup. So let’s go ahead and create a virtualenv for playing with it a bit:
Then, install the correct version of BeautifulSoup (you want 4, not 3):
[souptests]
heffalump:souptests cewing$ pip install beautifulsoup4
Downloading/unpacking beautifulsoup4
Downloading beautifulsoup4-4.3.2.tar.gz (143kB): 143kB downloaded
Running setup.py (path:/Users/cewing/virtualenvs/souptests/build/beautifulsoup4/setup.py) egg_info for package beautifulsoup4
Installing collected packages: beautifulsoup4
Running setup.py install for beautifulsoup4
Successfully installed beautifulsoup4
Cleaning up...
[souptests]
heffalump:souptests cewing$
BeautifulSoup can use the Python HTMLParser.
PRO: Batteries Included. It’s already there
CON: It’s not great, especially before Python 2.7.3
BeautifulSoup also supports using other parsers.
lxml
is better, but it can be much harder to install. For our exercise,
Let’s use html5lib
:
[souptests]
heffalump:souptests cewing$ pip install html5lib
Downloading/unpacking html5lib
Downloading html5lib-0.999.tar.gz (885kB): 885kB downloaded
Running setup.py (path:/Users/cewing/virtualenvs/souptests/build/html5lib/setup.py) egg_info for package html5lib
Downloading/unpacking six (from html5lib)
Downloading six-1.5.2-py2.py3-none-any.whl
Installing collected packages: html5lib, six
Running setup.py install for html5lib
Successfully installed html5lib six
Cleaning up...
[souptests]
heffalump:souptests cewing$
Once installed, BeautifulSoup will choose html5lib
automatically.
Actually, BeautifulSoup will choose the “best” available.
You can specify the parser if you need to control it and you have more than one parser available.
Getting Webpages¶
As with IMAP, FTP and other web protocols, Python provides tools for using HTTP
as a client. They are spread across the urllib
and urllib2
packages.
These packages have pretty unintuitive APIs.
The requests
library is becoming the de-facto standard for this type of
work. Let’s install it too.
[souptests]
heffalump:souptests cewing$ pip install requests
Downloading/unpacking requests
Downloading requests-2.2.1-py2.py3-none-any.whl (625kB): 625kB downloaded
Installing collected packages: requests
Successfully installed requests
Cleaning up...
[souptests]
heffalump:souptests cewing$
In requests
, each HTTP method is provided by a module-level function:
GET
==requests.get(url, **kwargs)
POST
==requests.post(url, **kwargs)
- ...
Those unspecified kwargs
represent other parts of an HTTP request:
params
: a dict of url query parameters (?foo=bar&baz=bim)headers
: a dict of headers to send with the requestdata
: the body of the request, if any (form data for POST goes here)- ...
The return value from one of these functions is a response
which provides:
response.status_code
: see the HTTP Status Code returnedresponse.ok
: True ifresponse.status_code
is not an errorresponse.raise_for_status()
: call to raise a python error if it isresponse.headers
: The headers sent from the serverresponse.text
: Body of the response, decoded to unicoderesponse.encoding
: The encoding used to decoderesponse.content
: The original encoded response body as bytes
You can read more about this library on your own.
I urge you to do so.
An Example: Scraping Blog Posts¶
Let’s use the tools we’ve set up here to play with scraping a simple structure, a list of blog posts.
Begin by firing up a Python interpreter:
[souptests]
heffalump:souptests cewing$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
Fetching a Page¶
The first step is to fetch the page we’ll be scraping.
I’ve created a shortened url that points to a feed aggregator for open
source blog posts. Unfortunately, tinyurl
won’t issue a proper
redirect response for requests that come from the requests
library, so
we’ll have to pretend we are a real web browser.
Open the developer tools for your browser and make sure you are viewing the
Network tab so you can see network traffic your browser sends and
receives. Load the url http:/tinyurl.com/sample-oss-posts in a new tab.
Back in the network tab, click on the requests that went to tinyurl. Find
the headers for the request and copy the User-Agent
header value. Then
begin to follow along in your Python interpreter:
>>> import requests
>>> url = 'http://tinyurl.com/sample-oss-posts'
>>> ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.111 Safari/537.36'
>>> headers = {'User-Agent': ua}
>>> resp = requests.get(url, headers=headers)
>>> resp
<Response [200]>
>>> foo = resp.text
>>> len(foo)
601747
>>> resp.encoding
'utf-8'
>>> type(foo)
<type 'unicode'>
Let’s prevent ourselves from having to repeat that step by writing our fetched webpage out to the filesystem:
>>> bytes = resp.content
>>> len(bytes)
602455
>>> with open('blog_list.html', 'w') as outfile:
... outfile.write(bytes)
...
>>> import os
>>> os.listdir(os.getcwd())
['blog_list.html', ...]
>>>
You should now be able to open the new file in your web browser. Do so.
The first step is to identify the smallest container element comment to all the things you want to extract. We want to get all blog posts, so let’s find the container that wraps each one.
What’s the best tool for getting this information from a web page?
Parsing HTML¶
When you look at the HTML from this webpage in your browser’s devtools, it displays as a formatted structure of HTML tags. We can interact with those tags in devtools because they are actually just representations of a DOM node. (DOM stands for Document Object Model)
In order to work with the page in the same fashion in Python, we need to
parse it into the same kind of structure. That’s what BeautifulSoup
does for us.
>>> from bs4 import BeautifulSoup
>>> parsed = BeautifulSoup(resp.text)
>>> len(parsed)
2
>>>
So parsing the document took the length from 601747 characters to 2 ??. What are those two things?
>>> [type(t) for t in parsed]
[<class 'bs4.element.Doctype'>, <class 'bs4.element.Tag'>]
>>>
Once an html page has been parsed by BeautifulSoup
, everything becomes a
node
. The parsed document itself is a node
and nodes
are iterable.
When you iterate over a node, you get the nodes that it contains in the DOM tree.
These nodes can be roughly classed into two types, NavigableString
and
Tag
The main difference is that Tag
nodes can contain text or other nodes,
where NavigableStrings
contain only text.
You can interact with these node types in a number of ways.
The most common are Searching and Traversing
Let’s start with the simpler of the two, searching.
Searching HTML¶
A Tag
in BeautifulSoup
has a couple of methods that support searching:
tag.find()
:- will find the first instance of a node that matches the search specification
tag.find_all()
:- will find all instances that match the search specification.
So, How do we build a specification for searching? The call signature for
find_all
helps a bit:
tag.find_all(name, attrs, recursive, text, limit, **kwargs)
name
is the name of an html tag type (‘a’, ‘p’, ‘div’, etc.)attrs
is a dictionary of key-value pairs where the key is an html attribute name and the value is the value you want to match.recursive
controls whether to find descendents (the default) or just children (recursive=False)text
allows you to findNavigableString
nodes instead ofTag
nodes.limit
controls how many to find, maximum.
The last element kwargs allows you to pass arbitrary keyword arguments.
If the argument you pass is not recognized as one of the other arguments, it will be treated as the name of an HTML attribute to filter on.
Passing id="my-div"
would result in a search for any item with the id
“my-div”:
<div id="my-div">This is found</div>
<div id="other-div">This would not be</div>
Note
because class
is a keyword in python, you can’t use it as a keyword
argument. Instead you should use class_
(class_="button"
)
Looking at the blog listing, we can see that the container that is wrapped
around each post shares a common CSS class: feedEntry
. Let’s grab all of
them:
>>> entries = parsed.find_all('div', class_='feedEntry')
>>> len(entries)
70
>>>
Okay. That works.
Let’s see if we can extract a list of the titles of each post.
For this, we want to make sure we find the first anchor tag in each entry, and then extract the text it contains:
>>> e1 = entries[0]
>>> e1.find('a').text
u'\n Dimitri Fontaine: PostgreSQL, Aggregates and Histograms\n '
>>> e1.find('a').find('h2').string
u'Dimitri Fontaine: PostgreSQL, Aggregates and Histograms'
>>> titles = [e.find('a').find('h2').string for e in entries]
>>> len(titles)
70
>>>
We can also find the set of possible sources for our blog posts. The
byline is contained in a <p>
tag with the CSS class discreet
.
Let’s gather up all of those and see what we have:
>>> byline = e1.find('p', class_='discreet')
>>> len(list(byline.children))
3
>>> [type(n) for n in list(byline.children)]
[<class 'bs4.element.NavigableString'>, <class 'bs4.element.Tag'>,
<class 'bs4.element.NavigableString'>]
>>> classifier = list(byline.children)[0].strip()
>>> classifier
u'From Planet PostgreSQL.\n \n \n Published on'
>>> all_bylines = [e.find('p', class_='discreet') for e in entries]
>>> len(all_bylines)
70
>>> all_classifiers = [list(b.children)[0].strip() for b in all_bylines]
>>> len(all_classifiers)
70
>>> all_classifiers[0]
u'From Planet PostgreSQL.\n \n \n Published on'
>>> unique_classifiers = set(all_classifiers)
>>> len(unique_classifiers)
30
>>> import pprint
>>> pprint.pprint(unique_classifiers)
set([u'u'By Will McGugan from Django community aggregator:\n ...
>>>
If we look these over, we find that we have some from Planet Django
, some
from Planet PostgreSQL
and maybe some others as well (I get one from
plope
too).
Let’s take one more step, and divide our post titles into categories based on whether they are Django, PostgreSQL or other.
Start by defining a function to get the classifier for an entry:
>>> def get_classifier(entry):
... byline = entry.find('p', class_='discreet')
... for classifier in ['django', 'postgresql']:
... if classifier in byline.text.lower():
... return classifier
... return 'other'
...
>>>
Then use that function to find the unique set of classifiers
>>> classifiers = [get_classifier(e) for e in entries]
>>> len(set(classifiers))
3
>>> set(classifiers)
set(['other', 'postgresql', 'django'])
We can also extract titles for each post with a function:
>>> def get_title(entry):
... return entry.find('a').find('h2').string.strip()
...
>>> titles = [get_title(e) for e in entries]
>>> len(titles)
70
>>> titles[0]
u'A method for rendering templates with Python'
Put it all together to build a dictionary of categorized post titles:
>>> paired = [(get_classifier(e), get_title(e)) for e in entries]
>>> paired[0]
('django', u'A method for rendering templates with Python')
>>> groups = {}
>>> for cat, title in paired:
... list = groups.setdefault(cat, [])
... list.append(title)
...
>>> groups['django']
[u'A method for rendering templates with Python',
u"Don't import (too much) in your django settings",
...]
Neat!
Going Farther¶
Okay, so that’s the basics. For your assignment you’ll take this a step farther and build a list of restaurant health inspection data using the King County government website.