Session Four: Dictionaries, Sets, Exceptions, and Files

Review/Questions

Review of Previous Classes

  • Sequences
    • Slicing
    • Lists
    • Tuples
    • tuple vs lists - which to use?
  • interating
    • for
    • while
      • break and continue
    • else with loops
Any questions?

A couple other nifty utilties with for loops:

tuple unpacking:

remember this?

x, y = 3, 4

You can do that in a for loop, also:

In [3]: from __future__ import print_function
In [4]: l = [(1, 2), (3, 4), (5, 6)]
In [5]: for i, j in l:
            print("i:%i, j:%i" % (i, j))

i:1, j:2
i:3, j:4
i:5, j:6

Looping through two loops at once:

zip:

In [10]: l1 = [1, 2, 3]
In [11]: l2 = [3, 4, 5]
In [12]: for i, j in zip(l1, l2):
   ....:     print("i:%i, j:%i" % (i, j))
   ....:
i:1, j:3
i:2, j:4
i:3, j:5

Homework comments

Building up a long string.

The obvious thing to do is something like:

msg = u""
for piece in list_of_stuff:
    msg += piece

But: strings are immutable – python needs to create a new string each time you add a piece – not efficient:

msg = []
for piece in list_of_stuff:
    msg.append(piece)
u" ".join(msg)

appending to lists is efficient – and so is the join() method of strings.

What is assert for?

Testing – NOT for issues expected to happen operationally:

assert m >= 0

in operational code should be:

if m < 0:
    raise ValueError

I’ll cover Exceptions later this class...

(Asserts get ignored if optimization is turned on!)

A little warm up

Fun with strings

  • Rewrite: the first 3 numbers are: %i, %i, %i"%(1,2,3)
    • for an arbitrary number of numbers...
  • Write a format string that will take:
    • ( 2, 123.4567, 10000)
    • and produce:
    • `` “file_002 : 123.46, 1e+04” ``

Homework Review

Someone volunteer to have their homeworks (Task 6 and 7) debugged in-class.

Free programming help!

Prepare your Questions

Open up your task 7 files in your text editor.

  • dictionaries.py
  • exceptions.py
  • paths.py
  • files.py

Today’s Puzzle: Trigrams

N-grams are a way to study word associations

https://books.google.com/ngrams

  • Our task today: read in the words from a large text file, create a dictionary of trigrams.
  • Write pseudo code and create a design.
  • Use dictionaries, exceptions, file reading/writing.

Announcements

  • Enter your attendance in Canvas.
  • When are office hours?
  • Tell us when you prefer TA office hours
  • Collaboration is okay, but not copying.

Dictionaries and Sets

Dictionary

Python calls it a dict

Other languages call it:

  • dictionary
  • associative array
  • map
  • hash table
  • hash
  • key-value pair

Dictionary Constructors

>>> {'key1': 3, 'key2': 5}
{'key1': 3, 'key2': 5}
>>> dict([('key1', 3),('key2', 5)])
{'key1': 3, 'key2': 5}
>>> dict(key1=3, key2=5)
{'key1': 3, 'key2': 5}
>>> d = {}
>>> d['key1'] = 3
>>> d['key2'] = 5
>>> d
{'key1': 3, 'key2': 5}

Dictionary Indexing

>>> d = {'name': 'Brian', 'score': 42}
>>> d['score']
42
>>> d = {1: 'one', 0: 'zero'}
>>> d[0]
'zero'
>>> d['non-existing key']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'non-existing key'

Keys can be any immutable object:

  • number
  • string
  • tuple
In [325]: d[3] = 'string'
In [326]: d[3.14] = 'pi'
In [327]: d['pi'] = 3.14
In [328]: d[ (1,2,3) ] = 'a tuple key'
In [329]: d[ [1,2,3] ] = 'a list key'
   TypeError: unhashable type: 'list'

Actually – any “hashable” type.

Hash functions convert arbitrarily large data to a small proxy (usually int)

Always return the same proxy for the same input

MD5, SHA, etc

Dictionaries hash the key to an integer proxy and use it to find the key and value.

Key lookup is efficient because the hash function leads directly to a bucket with very few keys (often just one)

What would happen if the proxy changed after storing a key?

Hashability requires immutability

Key lookup is very efficient

Same average time regardless of size

Note: Python name look-ups are implemented with dict – it’s highly optimized

Key to value:

  • lookup is one way

Value to key:

  • requires visiting the whole dict

If you need to check dict values often, create another dict or set

(up to you to keep them in sync)

Dictionary Ordering (not)

Dictionaries have no defined order

In [352]: d = {'one':1, 'two':2, 'three':3}
In [353]: d
Out[353]: {'one': 1, 'three': 3, 'two': 2}
In [354]: d.keys()
Out[354]: ['three', 'two', 'one']

You will be fooled by what you see into thinking that the order of pairs can be relied on.

It cannot.

Dictionary Iterating

for iterates over the keys

In [15]: d = {'name': 'Brian', 'score': 42}

In [16]: for x in d:
   ....:     print(x)
   ....:
score
name

(note the different order...)

dict keys and values

In [20]: d = {'name': 'Brian', 'score': 42}

In [21]: d.keys()
Out[21]: ['score', 'name']

In [22]: d.values()
Out[22]: [42, 'Brian']

In [23]: d.items()
Out[23]: [('score', 42), ('name', 'Brian')]

dict keys and values

Iterating on everything

In [26]: d = {'name': 'Brian', 'score': 42}

In [27]: for k, v in d.items():
   ....:     print("%s: %s" % (k,v))
   ....:
score: 42
name: Brian

Dictionary Performance

  • indexing is fast and constant time: O(1)
  • Membership (x in s) constant time: O(1)
  • visiting all is proportional to n: O(n)
  • inserting is constant time: O(1)
  • deleting is constant time: O(1)

http://wiki.python.org/moin/TimeComplexity

Other dict operations:

See them all here:

https://docs.python.org/2/library/stdtypes.html#mapping-types-dict

Is it in there?

In [5]: d
Out[5]: {'that': 7, 'this': 5}

In [6]: 'that' in d
Out[6]: True

In [7]: 'this' not in d
Out[7]: False

Membership is on the keys.

(like indexing)

In [9]: d.get('this')
Out[9]: 5

But you can specify a default

In [11]: d.get(u'something', u'a default')
Out[11]: u'a default'

Never raises an Exception (default default is None)

In [13]: for item in d.iteritems():
   ....:     print item
   ....:
('this', 5)
('that', 7)
In [15]: for key in d.iterkeys():
   ....:     print key
   ....:
this
that
In [16]: for val in d.itervalues():
   ....:     print val
   ....:
5
7

the iter* methods don’t actually create the lists.

gets the value at a given key while removing it

Pop just a key

In [19]: d.pop('this')
Out[19]: 5
In [20]: d
Out[20]: {'that': 7}

pop out an arbitrary key, value pair

In [23]: d.popitem()
Out[23]: ('that', 7)
In [24]: d
Out[24]: {}

setdefault(key[, default])

gets the value if it’s there, sets it if it’s not

In [26]: d = {}

In [27]: d.setdefault(u'something', u'a value')
Out[27]: u'a value'
In [28]: d
Out[28]: {u'something': u'a value'}
In [29]: d.setdefault(u'something', u'a different value')
Out[29]: u'a value'
In [30]: d
Out[30]: {u'something': u'a value'}

dict View objects:

Like keys(), values(), items(), but maintain a link to the original dict

In [47]: d
Out[47]: {u'something': u'a value'}
In [48]: item_view = d.viewitems()
In [49]: item_view
Out[49]: dict_items([(u'something', u'a value')])
In [50]: d['something else'] = u'another value'

In [51]: item_view
Out[51]: dict_items([('something else', u'another value'), (u'something', u'a value')])

Sets

A set is an unordered collection of distinct values

Essentially a dict with only keys

Set Constructors

>>> set()
set([])

>>> set([1, 2, 3])
set([1, 2, 3])

>>> {1, 2, 3}
set([1, 2, 3])

>>> s = set()

>>> s.update([1, 2, 3])
>>> s
set([1, 2, 3])

Set Properties

Set members must be hashable

Like dictionary keys – and for same reason (efficient lookup)

No indexing (unordered)

>>> s[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing

Set Methods

>>> s = set([1])
>>> s.pop() # an arbitrary member
1
>>> s.pop()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'pop from an empty set'
>>> s = set([1, 2, 3])
>>> s.remove(2)
>>> s.remove(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 2

All the “set” operations from math class...

s.isdisjoint(other)

s.issubset(other)

s.union(other, ...)

s.intersection(other, ...)

s.difference(other, ...)

s.symmetric_difference( other, ...)

Frozen Set

Another kind of set: frozenset

immutable – for use as a key in a dict (or another set...)

>>> fs = frozenset((3,8,5))
>>> fs.add(9)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'frozenset' object has no attribute 'add'

Exceptions

Another Branching structure:

try:
    do_something()
    f = open('missing.txt')
    process(f)   # never called if file missing
except IOError:
    print "couldn't open missing.txt"

Exceptions

Never Do this:

try:
    do_something()
    f = open('missing.txt')
    process(f)   # never called if file missing
except:
    print "couldn't open missing.txt"

Exceptions

Use Exceptions, rather than your own tests:

Don’t do this:

do_something()
if os.path.exists('missing.txt'):
    f = open('missing.txt')
    process(f)   # never called if file missing

It will almost always work – but the almost will drive you crazy

Example from homework

if num_in.isdigit():
    num_in = int(num_in)

but – int(num_in) will only work if the string can be converted to an integer.

So you can do

try:
    num_in = int(num_in)
except ValueError:
    print(u"Input must be an integer, try again.")

Or let the Exception be raised....

"it's Easier to Ask Forgiveness than Permission"

-- Grace Hopper

http://www.youtube.com/watch?v=AZDWveIdqjY

(Pycon talk by Alex Martelli)

For simple scripts, let exceptions happen.

Only handle the exception if the code can and will do something about it.

(much better debugging info when an error does occur)

Exceptions – finally

try:
    do_something()
    f = open('missing.txt')
    process(f)   # never called if file missing
except IOError:
    print(u"couldn't open missing.txt")
finally:
    do_some_clean-up

The finally: clause will always run

Exceptions – else

try:
    do_something()
    f = open('missing.txt')
except IOError:
    print(u"couldn't open missing.txt")
else:
    process(f) # only called if there was no exception
Advantage:
you know where the Exception came from

Exceptions – using them

try:
    do_something()
    f = open('missing.txt')
except IOError as the_error:
    print the_error
    the_error.extra_info = "some more information"
    raise

Particularly useful if you catch more than one exception:

except (IOError, BufferError, OSError) as the_error:
    do_something_with (the_error)

Raising Exceptions

def divide(a,b):
    if b == 0:
        raise ZeroDivisionError("b can not be zero")
    else:
        return a / b

when you call it:

In [515]: divide (12,0)
ZeroDivisionError: b can not be zero

Built in Exceptions

You can create your own custom exceptions, but...

exp = [name for name in dir(__builtin__) if "Error" in name]
len(exp)
32

For the most part, you can/should use a built in one

Choose the best match you can for the built in Exception you raise.

Example (for last week’s ackerman homework):

if (not isinstance(m, int)) or (not isinstance(n, int)):
    raise ValueError

Is the value of the input the problem here?

Nope: the type is the problem:

if (not isinstance(m, int)) or (not isinstance(n, int)):
    raise TypeError

but should you be checking type anyway? (EAFP)

File Reading and Writing

Files

Text Files

import io
f = io.open('secrets.txt', encoding='utf-8')
secret_data = f.read()
f.close()

secret_data is a (unicode) string

encoding defaults to sys.getdefaultencoding() – often NOT what you want.

(There is also the regular open() built in, but it won’t handle Unicode for you...)

Binary Files

f = io.open('secrets.bin', 'rb')
secret_data = f.read()
f.close()

secret_data is a byte string

(with arbitrary bytes in it – well, not arbitrary – whatever is in the file.)

(See the struct module to unpack formatted binary data)

File Opening Modes

f = io.open('secrets.txt', [mode])
'r', 'w', 'a'
'rb', 'wb', 'ab'
r+, w+, a+
r+b, w+b, a+b
U
U+

These follow the Unix conventions, and aren’t all that well documented on the Python docs. But these BSD docs make it pretty clear:

http://www.manpagez.com/man/3/fopen/

Gotcha – ‘w’ modes always clear the file

Text is default

  • Newlines are translated: \r\n -> \n
  • – reading and writing!
  • Use *nix-style in your code: \n
  • io.open() returns various “stream” objects – but they act like file objects.
  • In text mode, io.open() defaults to “Universal” newline mode.

Gotcha:

  • no difference between text and binary on *nix
  • breaks on Windows

io.open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True)

  • file is generally a file name or full path
  • mode is the mode for opening: ‘r’, ‘w’, etc.
  • buffering controls the buffering mode (0 for no buffering)
  • encoding sets the unicode encoding – only for text files – when set, you can ONLY write unicode object to the file.
  • errors sets the encoding error mode: ‘strict’, ‘ignore’, ‘replace’,...
  • newline controls Universal Newline mode: lets you write DOS-type files on *nix, for instance (text mode only).
  • closedfd controls close() behavior if a file descriptor, rather than a name is passed in (advanced usage!)

(https://docs.python.org/2/library/io.html?highlight=io.open#io.open)

File Reading

Reading part of a file

header_size = 4096
f = open('secrets.txt')
secret_header = f.read(header_size)
secret_rest = f.read()
f.close()

Common Idioms

for line in io.open('secrets.txt'):
    print line

(the file object is an iterator!)

f = io.open('secrets.txt')
while True:
    line = f.readline()
    if not line:
        break
    do_something_with_line()

File Writing

outfile = io.open('output.txt', 'w')
for i in range(10):
    outfile.write("this is line: %i\n"%i)

File Methods

Commonly Used Methods

f.read() f.readline() f.readlines()

f.write(str) f.writelines(seq)

f.seek(offset) f.tell()

f.flush()

f.close()

File Like Objects

Many classes implement the file interface:

  • loggers
  • sys.stdout
  • urllib.open()
  • pipes, subprocesses
  • StringIO

https://docs.python.org/2/library/stdtypes.html#file-objects

StringIO

In [417]: import StringIO
In [420]: f = StringIO.StringIO()
In [421]: f.write(u"somestuff")
In [422]: f.seek(0)
In [423]: f.read()
Out[423]: 'somestuff'

(handy for testing file handling code...)

Paths and Directories

Paths

Paths are generally handled with simple strings (or Unicode strings)

Relative paths:

u'secret.txt'
u'./secret.txt'

Absolute paths:

u'/home/chris/secret.txt'

Either work with open() , etc.

(working directory only makes sense with command-line programs...)

os module

os.getcwd() -- os.getcwdu() (u for Unicode)
chdir(path)
os.path.abspath()
os.path.relpath()
os.path.split()
os.path.splitext()
os.path.basename()
os.path.dirname()
os.path.join()

(all platform independent)

os.listdir()
os.mkdir()
os.walk()

(higher level stuff in shutil module)

pathlib

pathlib is a new package for handling paths in an OO way:

http://pathlib.readthedocs.org/en/pep428/

It is now part of the Python3 standard library, and has been back-ported for use with Python2:

$ pip install pathlib

All the stuff in os.path and more:

In [64]: import pathlib
In [65]: pth = pathlib.Path('./')
In [66]: pth.is_dir()
Out[66]: True
In [67]: pth.absolute()
Out[67]: PosixPath('/Users/Chris/PythonStuff/CodeFellowsClass/sea-f2-python-sept14/Examples/Session04')
In [68]: for f in pth.iterdir():
             print f
junk2.txt
junkfile.txt
...

Homework

Assignments:

  • Task 9: dict/sets lab
  • Task 10: Exceptions
  • Task 11: Mailroom Madness
  • Task 12: Investigate Session 5

Task 9: Dictionaries and Sets

In your student folder, create a subdirectory called session04. Create a new branch called task9 and switch to it (git checkout task9).

Within the session04 subdirectory, create a new file called dict_lab.py.

Add the file to your clone of the repository and commit changes frequently while working on the following tasks. When you are done, push your changes to GitHub and issue a pull request.

  • Create a dictionary containing “name”, “city”, and “cake” for “Chris” from “Seattle” who likes “Chocolate”.
  • Display the dictionary.
  • Delete the entry for “cake”.
  • Display the dictionary.
  • Add an entry for “fruit” with “Mango” and display the dictionary.
    • Display the dictionary keys.
    • Display the dictionary values.
    • Display whether or not “cake” is a key in the dictionary (i.e. False) (now).
    • Display whether or not “Mango” is a value in the dictionary.
  • Using the dict constructor and zip, build a dictionary of numbers from zero to fifteen and the hexadecimal equivalent (string is fine).
  • Using the dictionary from item 1: Make a dictionary using the same keys but with the number of ‘a’s in each value.
  • Create sets s2, s3 and s4 that contain numbers from zero through twenty, divisible 2, 3 and 4.
  • Display the sets.
  • Display if s3 is a subset of s2 (False)
  • and if s4 is a subset of s2 (True).
  • Create a set with the letters in ‘Python’ and add ‘i’ to the set.
  • Create a frozenset with the letters in ‘marathon’
  • display the union and intersection of the two sets.

Task 10: Exceptions

  • Improving raw_input : - Create a new file: safe_input.py – add it to your repo, and submit a pull

    request. Make sure to make frequent commits with good commit messages.

  • The raw_input() function can generate two exceptions: - EOFError or end-of-file (EOF) - KeyboardInterrupt or canceled input. - Create a wrapper function, perhaps safe_input() that returns ‘None’ rather

    than raising these exceptions.

  • Note: - ^C causes a KeyboardInterrupt Error - ^D (^Z on Windows) causes an End Of File Error. - ^ is the Control character

  • The next step should be done in your mailroom.py file: - Update your mailroom.py program to use exceptions (and BAFP) to handle

    malformed numeric input (and other malformed input)

    • Make sure to have your commit comment reflect that you’ve added this feature

Task 11: Mailroom Madness

  • Using all you’ve learned so far, complete your mailroom program according to the pseudocode and flow chart you created last session.
    • use dicts where appropriate
    • see if you can use a dict to switch between the users selections
    • Try to use a dict and the .format() method to do the letter as one big template – rather than building up a big string in parts.
    • For extra fun, see if you can use a file to preserve the donation list and changes made to it while the program is running.

Task 12: Investigate Session 5

Read through the Session 05 slides.

http://codefellows.github.io/sea-c34-python/session05.html

There are three sections. For each one, come up with three questions each.

  • Arguments (3 questions)
  • Comprehensions (3 questions)
  • Lambdas and Functional Programming (3 questions)

Write some Python code to help you answer them, one function per question.

For each function, write a good docstring describing what question you are trying to answer.

Put the functions in four separate modules (files) called arguments.py, comprehensions.py, functional.py in the session05 subdirectory of your student directory.

That is, you should have nine questions, and nine functions, total, spread out across three files.

Use everything you’ve learned so far as needed (including lists, tuples, slicing, iteration, functions, booleans, printing, modules, assertions, dictionaries, sets, exceptions, file reading/writing, and paths).

Create a branch in your local repo called task12 and switch to it (git checkout task12).

Add your files to that branch, commit and push, then submit a pull request to the main class repo.

Finally, submit your assignment in Canvas by giving the URL of the pull request.