SQL Persistence in Python¶
In this tutorial, you’ll walk through some basic concepts of data persistence
using the most common Python DBAPI2 connector for the PostgreSQL database,
pyscopg2
.
This will build a very basic understanding of working with PostgreSQL in Python. However you may wish to bookmark the postgreSQL documentation and the psycopg2 documentation for future reference.
Begin by activating the virtualenv we created in class:
heffalump:psycopg2 cewing$ workon psycopg2
[psycopg2]
heffalump:psycopg2 cewing$
Interacting With the Database¶
Once all that is in place, we’re ready to interact with our database using
psycopg2
.
Connections and Cursors¶
We’ll begin by getting connected. Connecting to any database consists of providing a specially-formatted string to the connector, called a DSN or Data Source Name.
Each different type of database uses a different format for this string. In
PostgreSQL it is typically a set of key=value
pairs where the keys come
from a defined set of possible keys.
There are a lot of possible keywords, but the ones you are most likely to see and use are:
- dbname: the name of the database in the server you want to connect with.
- host: the hostname on which the server is listening. This can also be a pathname to a socket file if the system is using Unix Domain Socket connections.
- port: the port number on which the server is listening. This can also be a socket file extension if the system is using Unix Domain Socket connections.
- user: The username to use when connecting to the database. Default is the system name of the user who is running the connect command.
- password: The password of the user. This is only used if the system requires password authentication.
We set up our database to allow us to connect directly using ident authorization. So the only parameters we must pass are the dbname and user.
Fire up an interactive Python session and get a connection:
[psycopg2]
heffalump:psycopg2 cewing$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import psycopg2
>>> conn = psycopg2.connect(dbname="psycotest", user="cewing")
>>> conn
<connection object at 0x7fafc8e005c0; dsn: 'user=cewing dbname=psycotest', closed: 0>
>>>
A connection represents our tie to the database. But to interact with it, we want to use a cursor:
>>> cur = conn.cursor()
>>> cur
<cursor object at 0x10a370718; closed: 0>
>>>
The cursor is a local representation of the state of the database. You can execute statements on it, and see the results of those statements, but until you commit a transaction, the changes are not persisted to the system on disk.
Simple Inserts and Selects¶
Use your cursor to insert a new record into the author
table:
>>> insert = "INSERT INTO author (name) VALUES('Iain M. Banks');"
>>> cur.execute(insert)
>>> cur.rowcount
1
>>>
Notice that we execute
a statement using the cursor. After this is done, we
can interrogate the cursor to find out what happened. In this case, we can
learn that one row was inserted.
NOTE:
Every so often, you will make an error in typing an SQL command. When you try to execute the statement, you’ll be informed of the error. This is nice. It’s important to note, though, that many kinds of errors can result in the current transaction with the database being “aborted”.
When this happens, you’ll see error messages like this:
>>> cur.execute(insert)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
psycopg2.InternalError: current transaction is aborted, commands ignored until end of transaction block
There is nothing to fear here. You simply have to end a transaction block so that you can start interacting with the database again. The safest way is to roll back the transaction, which ensures that nothing since the last commit will be saved:
>>> conn.rollback()
(more about transactions soon)
We can also retrieve from the database the information we just inserted, using
a SELECT
statement:
>>> query = "SELECT * from author;"
>>> cur.execute(query)
>>> cur.fetchall()
[(1, 'Iain M. Banks')]
>>>
You’ll see that our select query found one row in the database. The row is returned as a tuple with as many values as there are columns in the query. We asked for all columns (*) and so we got two.
The order of the values in each tuple is dependent on the query. In this case we asked for all columns so we got them in the database order (id, name).
Parameterized Statements¶
Inserting static data one row at a time is tedious.
We are software engineers. We can do better than that.
In order to repeat a statement a number of times, with different values, we must use parameters.
In DBAPI2 packages, these parameters are specialized forms of placeholders
used in the strings passed to the execute
command. Each database system
uses its own format, but the general idea is the same. You create an SQL
statement with placeholders where you want values to be inserted. Then you call
the ‘execute’ command with two arguments: your parameterized statement, and
a tuple containing as many values as you have parameters.
There is also an executemany
method on a cursor object that supports
passing an iterable of tuples. The SQL statement will be run one time for each
tuple in the iterable:
>>> insert = "INSERT INTO author (name) VALUES(%s);"
>>> authors = [("China Mieville",), ("Frank Herbert",),
... ("J.R.R. Tolkein",), ("Susan Cooper",),
... ("Madeline L'Engle",), ]
>>> cur.executemany(insert, authors)
>>> cur.rowcount
5
>>>
And we can read our inserted values back:
>>> cur.execute(query)
>>> rows = cur.fetchall()
>>> for row in rows:
... print row
...
(1, 'Iain M. Banks')
(2, 'China Mieville')
(3, 'Frank Herbert')
(4, 'J.R.R. Tolkein')
(5, 'Susan Cooper')
(6, "Madeline L'Engle")
>>>
RED LETTER WARNING¶
A SUPER IMPORTANT WARNING THAT YOU MUST PAY ATTENTION TO
The placeholder for psycopg2 is %s
. This placeholder is the same
regardless of the type of data you are passing in as your values.
Do Not Be Fooled into thinking that this means you can use string formatting to build your SQL statements:
# THIS IS BAD:
cur.execute("INSERT INTO author (name) VALUES(%s)" % "Bob Dobbins")
This syntax does not properly escape the values passed in.
This syntax leaves you wide open to SQL Injection Attacks.
Warning
If I ever see you using this syntax I will personally take you out behind the woodshed and tan your hide.
I’m not kidding.
Python provides you with a syntax that is safe from the kinds of attacks that make you front page news. Use it properly:
cur.execute("INSERT INTO author (name) VALUES(%s)", ("Bob Dobbins", ))
Transactions¶
Transactions group operations together, allowing you to verify them before the results hit the database.
In the DBAPI2 specification, data-altering statements require an explicit
commit
unless auto-commit has been enabled.
Thus far, we haven’t actually committed a transaction. If we open a second terminal and fire up the psql shell program, we can see that the data we’ve inserted is not yet in our database:
heffalump:training.python_web cewing$ psql -d psycotest
psql (9.3.2)
Type "help" for help.
psycotest=# \d
List of relations
Schema | Name | Type | Owner
--------+---------------------+----------+--------
public | author | table | cewing
public | author_authorid_seq | sequence | cewing
public | book | table | cewing
public | book_bookid_seq | sequence | cewing
(4 rows)
psycotest=# select * from author;
authorid | name
----------+------
(0 rows)
psycotest=#
In order for the values we’ve inserted to actually be persisted to the filesystem, making them available outside the cursor we have, we must commit a transaction.
We do this using the connection object we first set up:
>>> conn
<connection object at 0x7fafc8e005c0; dsn: 'user=cewing dbname=psycotest', closed: 0>
>>> conn.commit()
>>>
And now, back in psql
, our data is finally on disk:
psycotest=# select * from author;
authorid | name
----------+------------------
1 | Iain M. Banks
2 | China Mieville
3 | Frank Herbert
4 | J.R.R. Tolkein
5 | Susan Cooper
6 | Madeline L'Engle
(6 rows)
Handling Errors with Rollback¶
The largest benefit of having a transactional system like this is that you can fix errors before they make a hash of your actual database.
When you attempt to commit a transaction there are two possible outcomes: success or failure. If the commit succeeds, you can be sure that the changes you’ve made are final and complete.
If the commit fails for some reason, an exception will be raised. You can then tell the connection to roll back the transaction. This will undo all changes since the last transaction commit, leaving your database in a consistent, well-known state.
To help visualize this, let’s set up a quick exercise.
First, at your psql prompt, empty the table you just filled:
psycotest=# delete from author;
DELETE 6
psycotest=# select * from author;
authorid | name
----------+------
(0 rows)
psycotest=#
Next, create a new file in your project directory. Call it populatedb.py
.
Add the following code:
import psycopg2
DB_CONNECTION_PARAMS = {
'dbname': 'psycotest',
'user': 'cewing',
}
AUTHOR_INSERT = "INSERT INTO author (name) VALUES(%s);"
AUTHOR_QUERY = "SELECT * FROM author;"
BOOK_INSERT = """
INSERT INTO book (title, authorid) VALUES(%s, (
SELECT author FROM author WHERE name=%s ));
"""
BOOK_QUERY = "SELECT * FROM book;"
AUTHORS_BOOKS = {
'China Mieville': ["Perdido Street Station", "The Scar", "King Rat"],
'Frank Herbert': ["Dune", "Hellstrom's Hive"],
'J.R.R. Tolkien': ["The Hobbit", "The Silmarillion"],
'Susan Cooper': ["The Dark is Rising", "The Greenwitch"],
'Madeline L\'Engle': ["A Wrinkle in Time", "A Swiftly Tilting Planet"]
}
These module-level constants will let us write a bit less code below. We have a dictionary that represents the parameters we will use to connect to the database, a number of useful SQL statements for inserting and querying data, and a set of data we will use.
You might see an error in the SQL above. Leave it where it is. We will fix it after demonstrating rollback.
Next, add the following helper functions to populatedb.py
:
def show_query_results(conn, query):
with conn.cursor() as cur:
cur.execute(query)
had_rows = False
for row in cur.fetchall():
print row
had_rows = True
if not had_rows:
print "no rows returned"
def show_authors(conn):
query = AUTHOR_QUERY
show_query_results(conn, query)
def show_books(conn):
query = BOOK_QUERY
show_query_results(conn, query)
def populate_db(conn):
with conn.cursor() as cur:
authors = ([author] for author in AUTHORS_BOOKS.keys())
cur.executemany(AUTHOR_INSERT, authors)
params = ([book, author] for author in AUTHORS_BOOKS
for book in AUTHORS_BOOKS[author])
cur.executemany(BOOK_INSERT, params)
The show_query_results
function is a helper that will take a ‘SELECT’ query
and a connection, perform the query on that connection and then print the
results.
The show_authors
and show_books
functions are simple one-stage wrappers
that perform the correct query using show_query_results
.
The final function, populate_db
, inserts authors and books into our
database as two separate queries. Note the nested generator expression that
provides all books by all authors for inserting into the book table. Python can
be fun!
Note: The con.cursor()
call in show_query_results
and
populate_db
above is being used as a context manager. What this means is
that when the block defined by the with
statement exits, the cursor will be
cleanly closed.
Finally, in order to actually use all of this, we need a __main__
block
that will try to run our code and explicitly roll back in case of error.
Add the following to the bottom of the populatedb.py
file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | if __name__ == '__main__':
conn1 = psycopg2.connect(**DB_CONNECTION_PARAMS)
conn2 = psycopg2.connect(**DB_CONNECTION_PARAMS)
try:
populate_db(conn1)
print "\nauthors and books on conn2 before commit:"
show_authors(conn2)
show_books(conn2)
except psycopg2.Error:
conn1.rollback()
print "\nauthors and books on conn2 after rollback:"
show_authors(conn2)
show_books(conn2)
raise
else:
conn1.commit()
print "\nauthors and books on conn2 after commit:"
show_authors(conn2)
show_books(conn2)
finally:
conn1.close()
conn2.close()
|
(L3-4) In this code we set up two separate connections to the database. We will do our write operations using the first, and our read operations on the second to illustrate the effect of commit and rollback.
(L5-9) First, we try to write our data to the database. If that is successfull, we read the author and book tables from our second connection to show that before committing, the tables remain empty.
(L16-20) In the case that no error occurs, we hit the else:
block. This
allows us to commit our transaction on the first connection and demonstrate
that afterward we can read our data back from the second connection.
(L10-15) If an error is raised, we enter the except
block. Here, we roll
back our transaction, and demonstrate that after rollback no data has hit our
database. In the end, we re-raise the exception so that our script will fail
visibly.
Note: We are catching the base exception class for all psycopg2 database errors. There are a number of more specific errors you can use to determine if perhaps a transaction might be retried or must be rolled back. That’s more involved than we need to get for this demonstration, though.
(L21-23) At the last, we add a finally
block that will happen even if
errors occur. Here we safely close the two connections we’ve opened to our
database so that we don’t leave them hanging when the script exits.
Now that we have all that in place, let’s execute our populateddb.py
script
from a terminal. In your active psycopg2
virtualenv, try the following:
[psycopg2]
heffalump:psycopg2 cewing$ python populatedb.py
authors and books on conn2 after rollback:
no rows returned
no rows returned
Traceback (most recent call last):
File "populatedb.py", line 64, in <module>
populate_db(conn1)
File "populatedb.py", line 56, in populate_db
cur.executemany(BOOK_INSERT, params)
psycopg2.ProgrammingError: column "authorid" is of type integer but expression is of type author
LINE 2: ...TO book (title, authorid) VALUES('Perdido Street Station', (
^
HINT: You will need to rewrite or cast the expression.
[psycopg2]
heffalump:psycopg2 cewing$
Notice first that the initial write operation worked. The error that is raised
comes from the point in populate_db
where we are inserting books. Despite
this, the conn.rollback()
in our except
block removes all changes to
the database made since the last commit. This means that when we look at the
database with our second connection, no data is available in either table.
Let’s fix our SQL error and retry the process.
Edit the BOOK_INSERT
constant at the top of our script as follows (change
the ‘author’ after SELECT
in the second line to ‘authorid’):
BOOK_INSERT = """
INSERT INTO book (title, authorid) VALUES(%s, (
SELECT authorid FROM author WHERE name=%s ));
"""
Now you can re-run the script and see what success looks like:
[psycopg2]
heffalump:psycopg2 cewing$ python populatedb.py
authors and books on conn2 before commit:
no rows returned
no rows returned
authors and books on conn2 after commit:
(62, 'China Mieville')
(63, 'Frank Herbert')
(64, 'Susan Cooper')
(65, 'J.R.R. Tolkien')
(66, "Madeline L'Engle")
(45, 'Perdido Street Station', 62)
(46, 'The Scar', 62)
(47, 'King Rat', 62)
(48, 'Dune', 63)
(49, "Hellstrom's Hive", 63)
(50, 'The Dark is Rising', 64)
(51, 'The Greenwitch', 64)
(52, 'The Hobbit', 65)
(53, 'The Silmarillion', 65)
(54, 'A Wrinkle in Time', 66)
(55, 'A Swiftly Tilting Planet', 66)
[psycopg2]
heffalump:psycopg2 cewing$
Wrap-Up¶
The Python DBAPI2 specification provides for a uniform interface between Python programs and the Relational databases they might use for persistence.
In this tutorial you’ve learned a bit about the general operations of DBAPI2
using one particular implementation, psycopg2
.
There are small variations between implementations, particularly in the arena of placeholders in parameterized SQL statments and how they should be formatted. But the general shape of a DB interaction should be very consistent from one API packge to another.
Next, we’ll learn about how to use these underlying API packages through the lens of an Object Relational Manager, providing us with more automatic connections between our Python object layer and the underlying persistence model.
For reference, please download the ORM code
required to interact with this book/author database.