Introduction To Python: Part 3

During this lecture, we’ll learn a bit about Python sequences, strings and dictionaries. We’ll also cover looping, both determinate and indeterminate. Finally, we’ll talk about how to interact with data from the file system.

Sequences

Python is both strongly typed and dynamically typed. This combination leads to an approach to programming we call “Duck Typing”. So long as an object behaves like the kind of thing we want, we can assume it is the kind of thing we want.

Sequences are a prime example of this type of thinking.

In Python, a sequence refers to an ordered collection of objects. To be counted as a sequence, the object should support at least the following operations:

  • Indexing
  • Slicing
  • Membership
  • Concatenation
  • Length
  • Iteration

There are a number of standard data types in Python that fulfill this contract.

Python 2 Python 3
byte string (str) byte string (bytes)
unicode string (unicode) unicode string (str)
list list
tuple tuple
bytearray bytearray
buffer memoryview
xrange object range object

Of these types, the ones you will most often use are the string types, lists and tuples. The others are largely crafted for special purposes and you will rarely see them. However, the operations we will discuss next apply to all of them (with a few caveats).

Indexing

We can look up an object from within a sequence using the subscription operator: []. We use the index (position) of the object in the sequence to look it up. In Python, indexing always starts at 0.

In [98]: s = u"this is a string"
In [99]: s[0]
Out[99]: u't'
In [100]: s[5]
Out[100]: u'i'

We can also pass a negative integer as the index. This returns the object n positions from the end of the sequence:

In [105]: s = u"this is a string"
In [106]: s[-1]
Out[106]: u'g'
In [107]: s[-6]
Out[107]: u's'

If you ask for an object by an index that is beyond the end of the sequence, this causes an IndexError:

In [4]: s = [0, 1, 2, 3]
In [5]: s[4]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-5-42efaba84d8b> in <module>()
----> 1 s[4]

IndexError: list index out of range

Slicing

Indexing returns one object from a sequence. To get a new sequence containing elements from the original, we use slicing. This also uses the subscription operator, but with a bit of a syntactic twist. We use one or more colons (:) to separate the three available arguments, start, stop, and step:

seq[start:stop:step]

In slicing, asking for seq[start:stop] will return a new sequence (of the same type) containing all the elements of the original where start <= index < stop.

In [121]: s = u"a bunch of words"
In [122]: s[2]
Out[122]: u'b'
In [123]: s[6]
Out[123]: u'h'
In [124]: s[2:6]
Out[124]: u'bunc'
In [125]: s[2:7]
Out[125]: u'bunch'

It can often be helpful in slicing to think of the index values as pointing to the spaces between the items in the sequence:

  a       b   u   n   c   h       o   f
|   |   |   |   |   |   |   |   |   |
0   1   2   3   4   5   6   7   8   9

So why do we start with zero? Why is the stop index in the slice not included? Because doing things this way leads to some very nice properties for slices:

len(seq[a:b]) == b-a

seq[:b] + seq[b:] == seq

len(seq[:b]) == b

len(seq[-b:]) == b

As a result of these properties, it’s easier to avoid off-by-one errors in Python.

The third argument to the slice operation is the step. It is used to control which items between start and stop are returned.

In [289]: string = u"a fairly long string"
In [290]: string[0:15]
Out[290]: u'a fairly long s'
In [291]: string[0:15:2]
Out[291]: u'afil ogs'
In [292]: string[0:15:3]
Out[292]: u'aallg'

Using a negative value for step can lead to a nifty way to reverse a sequence:

In [293]: string[::-1]
Out[293]: u'gnirts gnol ylriaf a'

As we’ve mentioned before, indexing a sequence returns a single object. Slicing returns a new sequence. There’s one other major difference between the two. Slicing past the end of a sequence does not cause an error:

In [129]: s = "a bunch of words"
In [130]: s[17]
----> 1 s[17]
IndexError: string index out of range
In [131]: s[10:20]
Out[131]: ' words'
In [132]: s[20:30]
Out[132]: "

Membership

Sequence types support using the membership operators: in (py3) and not in (py3). These allow us to test for the presence (or absence) of an object in a sequence.

In [15]: s = [1, 2, 3, 4, 5, 6]
In [16]: 5 in s
Out[16]: True
In [17]: 42 in s
Out[17]: False
In [18]: 42 not in s
Out[18]: True

When used with the string types, the membership operators behave like substring in other languages. Use them to test whether a string contains another, shorter string:

In [20]: s = u"This is a long string"
In [21]: u"long" in s
Out[21]: True

This is only true for the string-type sequences. Can you think of why that might be?

Concatenation

When used with sequences as operands, the + and * operators will concatenate sequences.

In [25]: s1 = u"left"
In [26]: s2 = u"right"
In [27]: s1 + s2
Out[27]: u'leftright'
In [28]: (s1 + s2) * 3
Out[28]: u'leftrightleftrightleftright'

Since slicing returns a new sequence, this applies to slices as well. This fact can allow for some very concise code.

For example (from CodingBat) lets assume you need to create a new string that contains three repetitions of a given string. But if the given string is longer than three characters, you only want to use the first three.

A not-particularly-Pythonic solution to the problem might look like this:

def front3(str):
  if len(str) < 3:
    return str+str+str
  else:
    return str[:3]+str[:3]+str[:3]

But the truly Pythonic programmer can express the same thing this way:

def front3(str):
    return str[:3] * 3

Length

Sequences have length. To get the length of a sequence we use the len builtin (py3).

In [36]: s = u"how long is this, anyway?"
In [37]: len(s)
Out[37]: 25

Because of zero-based indexing, you must remember that the last index in a sequence is always len(s) -1:

In [38]: count = len(s)
In [39]: s[len(s)]
------------------------------------------------------------
IndexError                Traceback (most recent call last)
<ipython-input-39-5a33b9d3e525> in <module>()
----> 1 s[count]
IndexError: string index out of range

But honestly, using that is not Pythonic anyway. Always use seq[-1] to find the last item in a sequence.

If you care (and some do) about why Python uses len(x) instead of x.length(), you can read this post with an explanation of the rationale from BDFL Guido Van Rossom.

Miscellaneous

There are a few other common operations (py3) on sequences you’ll want to know about.

The min (py3) and max (py3) builtins work as you might expect:

In [42]: all_letters = u"thequickbrownfoxjumpedoverthelazydog"
In [43]: min(all_letters)
Out[43]: u'a'
In [44]: max(all_letters)
Out[44]: u'z'

The index method returns the position of an object in a sequence. If the object is not in the sequence, this causes a ValueError:

In [46]: all_letters.index(u'd')
Out[46]: 21
In [47]: all_letters.index(u'A')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-2db728a46f78> in <module>()
----> 1 all_letters.index(u'A')

ValueError: substring not found

Finally, the count method will count the total number of occurances of an object within a sequence. With strings, the object can be a single letter, or a substring. With the count method, if the object is not in the sequence, then no error is raised. The return value is 0:

In [52]: all_letters.count(u'o')
Out[52]: 4
In [53]: all_letters.count(u'the')
Out[53]: 2
In [54]: all_letters.count(u'A')
Out[54]: 0

Iteration

Repetition, Repetition, Repetition, Repe...

For Loops

We’ve already seen simple iteration over a sequence using for ... in:

In [170]: for x in "a string":
   .....:         print(x)
   .....:
a

s
t
r
i
n
g

Other languages build and use an index, which is then used to extract each item from the sequence:

for(var i=0; i<arr.length; i++) {
    var value = arr[i];
    console.log(i + ") " + value);

Python does not require this. But if you need to have the index for some reason, you can use the enumerate builtin (py3):

In [140]: for idx, letter in enumerate(u'Python'):
   .....:     print(idx, letter, end=' ')
   .....:
0 P 1 y 2 t 3 h 4 o 5 n

We’ve seen how the range function (it’s a type in Python3) can be useful for looping a known number of times. This is especially true when you don’t care about the value of the item from the sequence:

In [171]: for i in range(5):
   .....:     print('hello')
   .....:
hello
hello
hello
hello
hello

Remember that in Python, loops do not create a local namespace. The loop variable you use is still in scope after the loop terminates:

In [172]: x = 10
In [173]: for x in range(3):
   .....:     pass
   .....:
In [174]: x
Out[174]: 2

Loop control

Sometimes you want to interrupt or alter the flow of control through a loop. Loops can be controlled in two ways, with break and continue.

The break statement causes a loop to terminate immediately:

In [141]: for i in range(101):
   .....:     print(i)
   .....:     if i > 50:
   .....:         break
   .....:
0 1 2 3 4 5... 46 47 48 49 50 51

And continue returns you immediately to the head of the loop. It allows you to skip statements later in the loop block while continuing the loop itself:

In [143]: for i in range(101):
   .....:     if i > 50:
   .....:         break
   .....:     if i < 25:
   .....:         continue
   .....:     print(i, end=' ')
   .....:
   25 26 27 28 29 ... 41 42 43 44 45 46 47 48 49 50

An interesting feature of Python loops is that there is an optional else clause. The statements in this optional block are only executed if the loop exits normally. That means only if break was not used to stop iteration:

In [147]: for x in range(10):
   .....:     if x == 11:
   .....:         break
   .....: else:
   .....:     print(u'finished')
finished
In [148]: for x in range(10):
   .....:     if x == 5:
   .....:         print(x)
   .....:         break
   .....: else:
   .....:     print(u'finished')
5

This can be surprisingly useful, even if the name is a bit hard to remember.

While Loops

The while keyword is for when you don’t know how many loops you need. It continues to execute the body until condition is not True:

while a_condition:
   some_code
   in_the_body

While loops are more general than for loops. You can always express a for loop using the while structure, but the reverse is not always true. On the other hand, while is more error prone. You must remember to make progress in the body of the loop in order to allow the condition to become False. Otherwise you can fall victim to infinite loops.

i = 0;
while i < 5:
    print(i)

There are three approaches to terminating a while loop. You can use the break statement to end iteration:

In [150]: while True:
   .....:     i += 1
   .....:     if i > 10:
   .....:         break
   .....:     print(i, end=' ')
   .....:
1 2 3 4 5 6 7 8 9 10

Another approach is to set a flag variable. The boolean value of this variable starts as True Operations inside the loop update it to False, terminating the loop:

In [156]: import random
In [157]: keep_going = True
In [158]: while keep_going:
   .....:     num = random.choice(range(5))
   .....:     print(num)
   .....:     if num == 3:
   .....:         keep_going = False
   .....:
3

Finally, you can use a straight conditional statement as the test. Here, you update the value of the test variable such that the condition will evaluate to False:

In [161]: while i < 10:
   .....:     i += random.choice(range(4))
   .....:     print(i)
   .....:
0 0 2 3 4 6 8 8 8 9 12

Similarities

Both for and while loops can use break and continue for internal flow control. Both for and while loops can have an optional else block. In both loops, the statements in the else block are only executed if the loop terminates normally (no break).

String Features

Fun with Strings

Unicode v. Bytes

Python has two string types: byte strings and unicode objects.

Unicode is a classification system intended to allow a representation of all possible characters in all possible languages. Each character has a code point that is a byte or bytes which represents that character. When printed, these code points are translated into appropriate glyphs by the operating system.

When working in Python, you should always handle text as unicode objects. Text can be defined as any string meant to be read by a human via some output device.

Handling of unicode and bytes in Python3 is significantly different from Python2. In order to create compatible code (that will run the same in both systems), you should use one of the following two strategies:

You can import unicode_literals from the __future__ library. This must be the first line of code in your Python module.

from __future__ import unicode_literals
'this is a unicode string with élan'

Another approach is to be explicit about what type of string you are writing, using object literals:

u'this is a unicode string with élan'

The former strategy is a bit easier, but is not always safe in older legacy code bases, as it is an all-or-nothing operation. It makes every single string in the file a unicode object. The latter strategy is safer in this respect, as you get to choose which is which.

You can read more about compatible string handling at the Python-Future website.

Byte strings are strings that are composed entirely of numbers. This can be a bit confusing because they often appear to be letters. The string b"a" appears to contain the letter a, but really it contains the number 97 (or 01100001). Your terminal, your text editor, your OS is responsible for translating those numbers into characters when showing you the content of the string. But it’s still the number underneath. Be cautious about your assumptions.

Again, you have two strategies to work with bytestrings safely in Python 2 and Python 3. You can import unicode_literals and then specifically mark certain strings as bytestrings. Or you can mark certain strings as bytestrings. In either case, you have to mark bytestrings:

from __future__ import unicode_literals
b'polishing my resum\xc3\xa9 this week'
b'polishing my resum\xc3\xa9 this week'

The conversion of bytes to unicode and vice-versa should always take place at the I/O boundary. That means on the point where data is passing out of Python to the filesystem or network. Or the point where data enters Python from the filesystem or network.

At the point of crossing outbound, we can use the encode method of unicode objects to convert them to bytes. The argument to this function controls which codec is used to make the conversion. UTF8 is the most common codec in web work.

In [1]: fancy = u"Resumé"
In [2]: fancy
Out[2]: 'Resumé'
In [3]: fancy.encode('utf8')
Out[3]: b'Resum\xc3\xa9'

When data is inbound to Python, we can use the decode method of a byte string to convert it to Unicode. Again, passing a codec name selects which should be used for the conversion:

In [4]: bytes = _
In [5]: bytes
Out[5]: b'Resum\xc3\xa9'
In [6]: bytes.decode('utf8')
Out[6]: 'Resumé'

If no codec is specified, Python defaults to using the default encoding for the Python instance. This is usually ascii and is almost never the thing you really want. Be specific.

In Python 2, conversion of bytes to unicode and back was one of the largest sources of problems in programs. Both the encode and decode methods were supported by both byte strings and unicode objects. This led to a lot of implicit conversion, which of course uses default encoding.

It’s very easy when working entirely in English to have these types of problems an not know about them. If the characters in a string fall entirely within the ascii set, then no errors will occur. But as soon as characters beyond ascii are used, all sorts of trouble pops up.

Watch for UnicodeDecodeError and UnicodeEncodeError and write tests that use non-ascii characters.

String Manipulation

You can break strings apart using the split (py3) method. You have to make sure that the string you are splitting and the string you are using to split it are of the same type (bytes or unicode). The result is a list of the pieces:

In [167]: csv = "comma, separated, values"
In [168]: csv.split(', ')
Out[168]: ['comma', 'separated', 'values']

In the other direction, calling the join (py3) method will connect a sequence of pieces using the string on which it is called:

In [169]: psv = '|'.join(csv.split(', '))
In [170]: psv
Out[170]: 'comma|separated|values'

There are methods that allow us to change the case of text:

In [171]: sample = u'A long string of words'
In [172]: sample.upper()
Out[172]: u'A LONG STRING OF WORDS'
In [173]: sample.lower()
Out[173]: u'a long string of words'
In [174]: sample.swapcase()
Out[174]: u'a LONG STRING OF WORDS'
In [175]: sample.title()
Out[175]: u'A Long String Of Words'

And there are methods that allow us to test the nature of the characters in the text:

In [181]: number = u"12345"
In [182]: number.isnumeric()
Out[182]: True
In [183]: number.isalnum()
Out[183]: True
In [184]: number.isalpha()
Out[184]: False
In [185]: fancy = u"Th!$ $tr!ng h@$ $ymb0l$"
In [186]: fancy.isalnum()
Out[186]: False

Every character in a string has a numeric value. To see this value, use the ord (py3) builtin. The chr (py3) builtin reverses the process:

In [109]: for i in 'Cris':
   .....:     print(ord(i), end=' ')
67 114 105 115
In [110]: for i in (67,114,105,115):
   .....:     print(chr(i), end=' ')
C r i s

Building Strings

The concatenation operator + works for building strings out of fragments. But it’s not an efficient way to work. Avoid it.

Instead, use string formatting:

'Hello {0}!'.format(name)

It’s faster, and easier to maintain over time.

When building a format string, the placeholder is a pair of curly braces. They can be empty, but it’s better to put an integer into one, indicating the index of the argument to format (py3) to use. You can also pass keyword arguments to format, if the placeholders contain names instead of integers:

"My name is {1} {0}".format('Ewing', 'Cris')
"The {name} are {status}!".format(
    name='Seahawks', status='awesome'
)

Especially in legacy code you will see another method of formatting, using the % operator.

"This is a %s %s" % ('format', 'template')

This is still a functioning alternative and there is no pressing need to update. But you should prefer the new style in writing new code. The only dividing line is that the % operator supports both bytes and unicode objects, where in Python 3, .format is only a method on unicode objects.

There is a good website available that will help you learn everything you want to know about the formatting mini-language you can use to control these format specifiers.

Dictionaries and Sets

Dictionaries in Python are a mapping of keys to values. In other languages, they are called:

  • associative array
  • map
  • hash table
  • hash
  • key-value pair

The correct name of the type in Python is dict (py3)

You can build a new dict in a number of ways.

You can use the object literal:

{'key1': 3, 'key2': 5}

You can call the dict type object with a sequence of two-tuples. The first in each will become the key, the second the value:

>>> dict([('key1', 3),('key2', 5)])
{'key1': 3, 'key2': 5}

You can also use keyword arguments to the dict type object. In this case, you are limited to keys which are legal python names:

>>> dict(key1=3, key2=5)
{'key1': 3, 'key2': 5}

Indexing

To look up a value in a dict, we use the subscription operator, just like with sequences:

>>> d = {'name': 'Brian', 'score': 42}
>>> d['score']
42
>>> d = {1: 'one', 0: 'zero'}
>>> d[0]
'zero'

If you provide a key that is not in the dictionary, a KeyError is caused:

>>> d['non-existing key']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'non-existing key'

In a certain sense, Python is built on dicts. Namespaces are implemented as dicts. For this reason, the performance of lookup is highly optimized. Lookup time for any object is constant, regardless of the size of the dict.

When storing a value in a dict, you use a key. This key can be any immutable object (more on that later). In actuality, any object that is hashable can be used. What does that mean, though?

Hashing

Hashing is the process of converting arbitrarily large data to a small proxy (usually an integer). You can use any number of different algorithms to do this, MD5, SHA, etc. The key (if you’ll forgive the pun) is that the algorithm must always return the same proxy for the same input. In a dict, keys are hashed to an integer proxy, which is used to find a location in an array behind the scenes. This is efficient because a good hashing algorithms means only a very few key/value pairs correlate to any proxy.

What would happen if the proxy changed after a value was stored in the dict? Hashability requires that the object being hashed be immutable.

Dicts are inherently unordered collections. When you print them out, or look at them in the interpreter, this is not apparent. You will be fooled into thinking that you can rely on the order of the pairs. This is not true.

In [352]: d = {'one':1, 'two':2, 'three':3}
In [353]: d
Out[353]: {'one': 1, 'three': 3, 'two': 2}
In [354]: d.keys()
Out[354]: ['three', 'two', 'one']

Iteration and Dicts

You can use a dict with a for loop. By default, the keys are what are iterated over.

In [15]: d = {'name': 'Brian', 'score': 42}

In [16]: for x in d:
   ....:     print(x)
   ....:
score
name

If you want to iterate over values, or perhaps over the key/value pairs in the dict there are methods to support that.

In [2]: d.keys()
Out[2]: dict_keys(['score', 'name'])
In [3]: d.values()
Out[3]: dict_values([42, 'Brian'])
In [4]: d.items()
Out[4]: dict_items([('score', 42), ('name', 'Brian')])

In Python 2, there were nine methods on dicts that supplied these behaviors. The keys, values and items methods returned lists. The iter... methods (iterkeys, etc.) returned iterators, which were much more efficient for large dicts. The view... methods (viewkeys, etc.) return dict views which behaved as iterators, but also updated themselves as the dictionary changed.

In Python 3, the three remainin methods operate like the last of those. To get semantically equivalent code in Python 3, use the following map:

Python 2 Python 3
d.keys() list(d.keys())
d.values() list(d.values())
d.items() list(d.items())
d.iterkeys() iter(d.keys())
d.itervalues() iter(d.values())
d.iteritems() iter(d.items())
d.viewkeys() d.keys()
d.viewvalues() d.values()
d.viewitems() d.items()

You should also refer to Python Futures for additional compatible idioms.

Performance

Dictionaries are optimized for inserting and retrieving values:

  • indexing is fast and constant time: O(1)
  • Membership (x in s) constant time: O(1)
  • visiting all is proportional to n: O(n)
  • inserting is constant time: O(1)
  • deleting is constant time: O(1)

more on what exactly that means soon.

Miscellaneous

You can find all the methods of the dict type in the Python standard library documentation. But here are a number of interesting methods you may find useful:

Membership (on keys):

In [5]: d
Out[5]: {'that': 7, 'this': 5}

In [6]: 'that' in d
Out[6]: True

In [7]: 'this' not in d
Out[7]: False

The get method (py3) allows you to get a value or returns a default if the key you seek is not in the dict. The default value returned is None, but you can control it. It has the advantage of never causing a KeyError:

In [9]: d.get('this')
Out[9]: 5
In [11]: d.get(u'something', u'a default')
Out[11]: u'a default'

To remove a key/value pair from a dict, we use the pop method (py3). It takes a key as the optional argument. The value corresponding to the key is return and the key/value pair are removed. If no argument is supplied, an arbitrary key/value pair is removed, and the value returned.

In [19]: d.pop('this')
Out[19]: 5
In [20]: d
Out[20]: {'that': 7}
In [23]: d.popitem()
Out[23]: ('that', 7)
In [24]: d
Out[24]: {}

One of the most useful methods on the dict type is setdefault (py3). You pass it a key and a default value. If the key is present in the dict, the stored value is returned. If the key is not present, then the default value is stored and returned.

In [26]: d = {}
In [27]: d.setdefault(u'something', u'a value')
Out[27]: u'a value'
In [28]: d
Out[28]: {u'something': u'a value'}
In [29]: d.setdefault(u'something', u'a different value')
Out[29]: u'a value'
In [30]: d
Out[30]: {u'something': u'a value'}

Sets

A set is an unordered collection of distinct values. You can think of a set as a dict which has only keys and no values. You can create a set using the set literal ({}) or the set type object:

In [4]: {1, 2, 3}
Out[4]: {1, 2, 3}
In [5]: set()
Out[5]: set()
In [6]: set([1, 2, 3])
Out[6]: {1, 2, 3}
In [7]: {1, 2, 3}
Out[7]: {1, 2, 3}
In [8]: s = set()
In [9]: s.update([1, 2, 3])
In [10]: s
Out[10]: {1, 2, 3}
In [11]: s.add(4)
In [12]: s
Out[12]: {1, 2, 3, 4}

Sets share a lot of properties with dicts. Members of a set must be hashable, like dictionary keys, and for the same reason. Sets are also unordered, and so you cannot index them:

>>> s[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing

Set support similar operations to dicts as well.

In [1]: s = set([1])
In [2]: s.pop()
Out[2]: 1
In [3]: s.pop()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-e76f41daca5e> in <module>()
----> 1 s.pop()
KeyError: 'pop from an empty set'

In [4]: s = set([1,2,3])
In [5]: s.remove(2)
In [6]: s.remove(2)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-542ac1b736c7> in <module>()
----> 1 s.remove(2)
KeyError: 2

Beyond all this, sets also operate as traditional mathematical sets. You get all the operations you might remember from set theory class:

s.isdisjoint(other)

s.issubset(other)

s.union(other, ...)

s.intersection(other, ...)

s.difference(other, ...)

s.symmetric_difference( other, ...)

Finally, if you need to have an immutable object that functions like a set, Python provides the frozenset type. It works just like a set, except that once constructed it may not be altered:

>>> fs = frozenset((3,8,5))
>>> fs.add(9)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'frozenset' object has no attribute 'add'

File Reading and Writing

It’s often useful in programming to be able to open files in order to read data from them, or write data to them. Python has, for a long time, had a built-in function open which handled this operation. However, this builtin was created in the days before unicode was widely used, and it does not handle text that contains unicode particularly well.

For this reason, we have moved away from using the built-in open and toward using the open function from the io module. The io.open function is available in both Python 2 (2.6 and 2.7) and Python 3 and so provides a cross-compatible approach to opening files.

import io
f = io.open('secrets.txt', encoding='utf-8')
secret_data = f.read()
f.close()

By default, files are opened in “read text” mode, which automatically decode the bytes contained in the file to unicode. You may provide an encoding keyword argument to control the “codec” that is used to perform this conversion. The most common codec you will use is “utf-8”. If you do not provide an encoding, then Python will default to the value of sys.getdefaultencoding(), which is nearly always “ascii”.

If you have needs other than reading unicode test, you may use the mode argument ('rb' below) to control aspects of your interaction with the file.

For example, the rb mode opens a file for reading bytes. When you read data from a file opened in rb mode, it will be a bytestring. If you need to convert it to unicode, you can call decode on it at that point.

f = io.open('secrets.bin', 'rb')
secret_data = f.read()
f.close()

There are a number of modes available for files. Unfortunately, Python’s own documentation of the meaning of these modes is not very clear. You can use the man page for the unix command fopen which supports the same modes, and has much better information. One thing you should be careful of. Merely opening a file in w mode will always truncate the file, rendering it empty.

When you open a file with io.open it defaults to using “Universal Newline” mode. This means that while some operating systems use different characters as line endings, Python will always translate them into the *nix-tyle "\n" when you read data from the file. The "\n" characters are translated back to OS-Native line endings when you write data out to a file. You should always use the "\n" character as a line ending when writing strings in Python.

You should be aware that although there is no difference between reading in bytes or text mode on Unix operating systems, the two are quite different in Windows. You’ll be tempted to accept the default and always open files in text mode. This will break binary files on Windows. Don’t do it. Get in the habit of thinking carefully about whether the content of the file you are reading is text or binary data.

Beyond the mode and encoding parameters we’ve discussed, there are a number of other parameters to the io.open function.

The required first argument is the path to the file you wish to open.

The errors parameter allows you to control how errors in decoding file bytes to Python unicode are handled. You may specify that you wish to ignore errors, replace broken characters with a specific identifier, or use strict mode to force errors to terminate reading.

The other parameters are more advanced and will not come up often in your work with files. You may read about them in the io module documentation (py3).

Once you have an open file, you can read from it. The read method accepts an optional argument of a number of bytes to read. If you provide no value, the entire file (starting at your current position) is read.

header_size = 4096
f = open('secrets.txt')
secret_header = f.read(header_size)
secret_rest = f.read()
f.close()

Files are iterators, which means you can iterate through the lines of text they contain like so:

for line in io.open('secrets.txt'):
    print line

In addition, the readline method will read a single line at a time. The readlines method will read a file and return a list containing the lines of the file.

If you wish to write text to a file you’ve opened in w mode, you may do so with the write method: Newlines are not automatically appended to text written this way. If you want lines in your file, you must write the newline characters yourself (or place them in your text).

outfile = io.open('output.txt', 'w')
for i in range(10):
    outfile.write("this is line: {0}\n".format(i))

When you’ve opened a file in a “read” mode, you have the following methods available:

f.read() f.readline() f.readlines()

In “write” mode, you have rough equivalents. The write method writes a string to an open file. The writelines method takes a sequence of strings and writes them to a file. Remember, newlines are not automatically added (despite the name).

f.write(str) f.writelines(seq)

In any mode, a file has a few methods you can use to navigate through the file.

The file.seek(offset) method will move the file pointer to the byte of the file given by ‘offset’. The file.tell() method will return the byte number of the current position of the file pointer.

f.seek(offset) f.tell()

Finally, whenever you open a file, you must also close it. On certain operating systems, if you fail to do so it can render the file unusable by any other process.

file.close()

In Python, we say that anything which implements both the read and write method is File-like. There are a number of types which are file-like:

  • loggers
  • sys.stdout
  • urllib.open()
  • pipes, subprocesses
  • StringIO

When you have a file-like object you can treat is as if it were a file. A common use-case for this involves using the io.StringIO class. This class constructs an in-memory buffer that operates just like a file:

In [417]: from io import StringIO
In [420]: f = StringIO()
In [421]: f.write(u"somestuff")
In [422]: f.seek(0)
In [423]: f.read()
Out[423]: 'somestuff'

When writing tests for file-handling code this can be very useful. It allows you to make “fake files” that operate just like the real thing.

Legacy code will often contain references to modules named StringIO or cStringIO. These modules should be considered superseded by the io.StringIO class.

Paths and Directories

In Python, paths are often handled with simple strings (or Unicode strings) You can make absolute paths:

b"/home/cris/stuff.txt"
u'/usr/local/bin/python3'

or relative paths:

u'./secret.txt'
b'src/test_ack.py'

Either relative or absolute paths, bytes or unicode objects will work as the path argument to io.open().

The os module from the Python standard library gives you a number of useful tools for interacting with paths. You can get the current working directory with os.getcwd. You can change directories with os.chdir. You can turn any relative path into an absolute path with os.path.abspath. You can obtain the relative path from your current location to any absolute path with os.path.relpath().

os.getcwd() -- os.getcwdu() (u for Unicode)
os.chdir(path)
os.path.abspath()
os.path.relpath()

It’s possible to list the contents of a directory with os.listdir. You can make new directories with os.mkdir. You can even walk an entire file system using os.walk.

There’s much, much more to learn. Check out the documentation (py3)

Finally, if you’d prefer to work with paths in an Object-oriented style, the Python 3 standard library has added a new module, pathlib. This module can also be pip installed in Python 2 for compatibility.

With the module, you can create paths as objects, and then work with methods on them. This allows you access to all the operations in os.path and more.

In [1]: import pathlib
In [2]: pth = pathlib.Path('.')
In [3]: pth.is_dir()
Out[3]: True
In [4]: pth.absolute()
Out[4]: PosixPath('/Users/nmhw/projects/training/codefellows/existing_course_repos/python-dev-accelerator')
In [5]: for f in pth.iterdir():
   ...:     print(f)
   ...:
.git
.gitignore
bin
build
cfpython.sublime-project
...