Introduction To Python: Part 3¶
During this lecture, we’ll learn a bit about Python sequences, strings and dictionaries. We’ll also cover looping, both determinate and indeterminate. Finally, we’ll talk about how to interact with data from the file system.
Sequences¶
Python is both strongly typed and dynamically typed. This combination leads to an approach to programming we call “Duck Typing”. So long as an object behaves like the kind of thing we want, we can assume it is the kind of thing we want.
Sequences are a prime example of this type of thinking.
In Python, a sequence refers to an ordered collection of objects. To be counted as a sequence, the object should support at least the following operations:
- Indexing
- Slicing
- Membership
- Concatenation
- Length
- Iteration
There are a number of standard data types in Python that fulfill this contract.
Python 2 | Python 3 |
---|---|
byte string (str) | byte string (bytes) |
unicode string (unicode) | unicode string (str) |
list | list |
tuple | tuple |
bytearray | bytearray |
buffer | memoryview |
xrange object | range object |
Of these types, the ones you will most often use are the string types, lists and tuples. The others are largely crafted for special purposes and you will rarely see them. However, the operations we will discuss next apply to all of them (with a few caveats).
Indexing¶
We can look up an object from within a sequence using the subscription operator: []
.
We use the index
(position) of the object in the sequence to look it up.
In Python, indexing always starts at 0
.
In [98]: s = u"this is a string"
In [99]: s[0]
Out[99]: u't'
In [100]: s[5]
Out[100]: u'i'
We can also pass a negative integer as the index.
This returns the object n
positions from the end of the sequence:
In [105]: s = u"this is a string"
In [106]: s[-1]
Out[106]: u'g'
In [107]: s[-6]
Out[107]: u's'
If you ask for an object by an index that is beyond the end of the sequence, this causes an IndexError
:
In [4]: s = [0, 1, 2, 3]
In [5]: s[4]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-5-42efaba84d8b> in <module>()
----> 1 s[4]
IndexError: list index out of range
Slicing¶
Indexing returns one object from a sequence.
To get a new sequence containing elements from the original, we use slicing
.
This also uses the subscription operator, but with a bit of a syntactic twist.
We use one or more colons (:
) to separate the three available arguments, start, stop, and step:
seq[start:stop:step]
In slicing, asking for seq[start:stop]
will return a new sequence (of the same type) containing all the elements of the original where start <= index < stop
.
In [121]: s = u"a bunch of words"
In [122]: s[2]
Out[122]: u'b'
In [123]: s[6]
Out[123]: u'h'
In [124]: s[2:6]
Out[124]: u'bunc'
In [125]: s[2:7]
Out[125]: u'bunch'
It can often be helpful in slicing to think of the index values as pointing to the spaces between the items in the sequence:
a b u n c h o f
| | | | | | | | | |
0 1 2 3 4 5 6 7 8 9
So why do we start with zero?
Why is the stop
index in the slice not included?
Because doing things this way leads to some very nice properties for slices:
len(seq[a:b]) == b-a
seq[:b] + seq[b:] == seq
len(seq[:b]) == b
len(seq[-b:]) == b
As a result of these properties, it’s easier to avoid off-by-one errors in Python.
The third argument to the slice operation is the step. It is used to control which items between start and stop are returned.
In [289]: string = u"a fairly long string"
In [290]: string[0:15]
Out[290]: u'a fairly long s'
In [291]: string[0:15:2]
Out[291]: u'afil ogs'
In [292]: string[0:15:3]
Out[292]: u'aallg'
Using a negative value for step can lead to a nifty way to reverse a sequence:
In [293]: string[::-1]
Out[293]: u'gnirts gnol ylriaf a'
As we’ve mentioned before, indexing a sequence returns a single object. Slicing returns a new sequence. There’s one other major difference between the two. Slicing past the end of a sequence does not cause an error:
In [129]: s = "a bunch of words"
In [130]: s[17]
----> 1 s[17]
IndexError: string index out of range
In [131]: s[10:20]
Out[131]: ' words'
In [132]: s[20:30]
Out[132]: "
Membership¶
Sequence types support using the membership operators: in
(py3
) and not in
(py3
).
These allow us to test for the presence (or absence) of an object in a sequence.
In [15]: s = [1, 2, 3, 4, 5, 6]
In [16]: 5 in s
Out[16]: True
In [17]: 42 in s
Out[17]: False
In [18]: 42 not in s
Out[18]: True
When used with the string types, the membership operators behave like substring
in other languages.
Use them to test whether a string contains another, shorter string:
In [20]: s = u"This is a long string"
In [21]: u"long" in s
Out[21]: True
This is only true for the string-type sequences. Can you think of why that might be?
Concatenation¶
When used with sequences as operands, the +
and *
operators will concatenate sequences.
In [25]: s1 = u"left"
In [26]: s2 = u"right"
In [27]: s1 + s2
Out[27]: u'leftright'
In [28]: (s1 + s2) * 3
Out[28]: u'leftrightleftrightleftright'
Since slicing returns a new sequence, this applies to slices as well. This fact can allow for some very concise code.
For example (from CodingBat) lets assume you need to create a new string that contains three repetitions of a given string. But if the given string is longer than three characters, you only want to use the first three.
A not-particularly-Pythonic solution to the problem might look like this:
def front3(str):
if len(str) < 3:
return str+str+str
else:
return str[:3]+str[:3]+str[:3]
But the truly Pythonic programmer can express the same thing this way:
def front3(str):
return str[:3] * 3
Length¶
Sequences have length.
To get the length of a sequence we use the len
builtin (py3
).
In [36]: s = u"how long is this, anyway?"
In [37]: len(s)
Out[37]: 25
Because of zero-based indexing, you must remember that the last index in a sequence is always len(s) -1
:
In [38]: count = len(s)
In [39]: s[len(s)]
------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-39-5a33b9d3e525> in <module>()
----> 1 s[count]
IndexError: string index out of range
But honestly, using that is not Pythonic anyway.
Always use seq[-1]
to find the last item in a sequence.
If you care (and some do) about why Python uses len(x)
instead of x.length()
, you can read this post with an explanation of the rationale from BDFL Guido Van Rossom.
Miscellaneous¶
There are a few other common operations (py3) on sequences you’ll want to know about.
The min
(py3
) and max
(py3
) builtins work as you might expect:
In [42]: all_letters = u"thequickbrownfoxjumpedoverthelazydog"
In [43]: min(all_letters)
Out[43]: u'a'
In [44]: max(all_letters)
Out[44]: u'z'
The index
method returns the position of an object in a sequence.
If the object is not in the sequence, this causes a ValueError
:
In [46]: all_letters.index(u'd')
Out[46]: 21
In [47]: all_letters.index(u'A')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-47-2db728a46f78> in <module>()
----> 1 all_letters.index(u'A')
ValueError: substring not found
Finally, the count
method will count the total number of occurances of an object within a sequence.
With strings, the object can be a single letter, or a substring.
With the count
method, if the object is not in the sequence, then no error is raised.
The return value is 0
:
In [52]: all_letters.count(u'o')
Out[52]: 4
In [53]: all_letters.count(u'the')
Out[53]: 2
In [54]: all_letters.count(u'A')
Out[54]: 0
Iteration¶
Repetition, Repetition, Repetition, Repe...
For Loops¶
We’ve already seen simple iteration over a sequence using for ... in
:
In [170]: for x in "a string":
.....: print(x)
.....:
a
s
t
r
i
n
g
Other languages build and use an index
, which is then used to extract each item from the sequence:
for(var i=0; i<arr.length; i++) {
var value = arr[i];
console.log(i + ") " + value);
Python does not require this.
But if you need to have the index for some reason, you can use the enumerate
builtin (py3
):
In [140]: for idx, letter in enumerate(u'Python'):
.....: print(idx, letter, end=' ')
.....:
0 P 1 y 2 t 3 h 4 o 5 n
We’ve seen how the range
function (it’s a type in Python3) can be useful for looping a known number of times.
This is especially true when you don’t care about the value of the item from the sequence:
In [171]: for i in range(5):
.....: print('hello')
.....:
hello
hello
hello
hello
hello
Remember that in Python, loops do not create a local namespace. The loop variable you use is still in scope after the loop terminates:
In [172]: x = 10
In [173]: for x in range(3):
.....: pass
.....:
In [174]: x
Out[174]: 2
Loop control¶
Sometimes you want to interrupt or alter the flow of control through a loop.
Loops can be controlled in two ways, with break
and continue
.
The break
statement causes a loop to terminate immediately:
In [141]: for i in range(101):
.....: print(i)
.....: if i > 50:
.....: break
.....:
0 1 2 3 4 5... 46 47 48 49 50 51
And continue
returns you immediately to the head of the loop.
It allows you to skip statements later in the loop block while continuing the loop itself:
In [143]: for i in range(101):
.....: if i > 50:
.....: break
.....: if i < 25:
.....: continue
.....: print(i, end=' ')
.....:
25 26 27 28 29 ... 41 42 43 44 45 46 47 48 49 50
An interesting feature of Python loops is that there is an optional else
clause.
The statements in this optional block are only executed if the loop exits normally.
That means only if break
was not used to stop iteration:
In [147]: for x in range(10):
.....: if x == 11:
.....: break
.....: else:
.....: print(u'finished')
finished
In [148]: for x in range(10):
.....: if x == 5:
.....: print(x)
.....: break
.....: else:
.....: print(u'finished')
5
This can be surprisingly useful, even if the name is a bit hard to remember.
While Loops¶
The while
keyword is for when you don’t know how many loops you need.
It continues to execute the body until condition is not True
:
while a_condition:
some_code
in_the_body
While loops are more general than for
loops.
You can always express a for
loop using the while
structure, but the reverse is not always true.
On the other hand, while
is more error prone.
You must remember to make progress in the body of the loop in order to allow the condition to become False
.
Otherwise you can fall victim to infinite loops.
i = 0;
while i < 5:
print(i)
There are three approaches to terminating a while
loop.
You can use the break
statement to end iteration:
In [150]: while True:
.....: i += 1
.....: if i > 10:
.....: break
.....: print(i, end=' ')
.....:
1 2 3 4 5 6 7 8 9 10
Another approach is to set a flag variable
.
The boolean value of this variable starts as True
Operations inside the loop update it to False
, terminating the loop:
In [156]: import random
In [157]: keep_going = True
In [158]: while keep_going:
.....: num = random.choice(range(5))
.....: print(num)
.....: if num == 3:
.....: keep_going = False
.....:
3
Finally, you can use a straight conditional statement as the test.
Here, you update the value of the test variable
such that the condition will evaluate to False
:
In [161]: while i < 10:
.....: i += random.choice(range(4))
.....: print(i)
.....:
0 0 2 3 4 6 8 8 8 9 12
Similarities¶
Both for
and while
loops can use break
and continue
for internal flow control.
Both for
and while
loops can have an optional else
block.
In both loops, the statements in the else
block are only executed if the loop terminates normally (no break
).
String Features¶
Fun with Strings
Unicode v. Bytes¶
Python has two string types: byte strings
and unicode objects
.
Unicode is a classification system intended to allow a representation of all possible characters in all possible languages. Each character has a code point that is a byte or bytes which represents that character. When printed, these code points are translated into appropriate glyphs by the operating system.
When working in Python, you should always handle text as unicode objects
.
Text can be defined as any string meant to be read by a human via some output device.
Handling of unicode and bytes in Python3 is significantly different from Python2. In order to create compatible code (that will run the same in both systems), you should use one of the following two strategies:
You can import unicode_literals
from the __future__
library.
This must be the first line of code in your Python module.
from __future__ import unicode_literals
'this is a unicode string with élan'
Another approach is to be explicit about what type of string you are writing, using object literals
:
u'this is a unicode string with élan'
The former strategy is a bit easier, but is not always safe in older legacy code bases, as it is an all-or-nothing operation. It makes every single string in the file a unicode object. The latter strategy is safer in this respect, as you get to choose which is which.
You can read more about compatible string handling at the Python-Future website.
Byte strings are strings that are composed entirely of numbers.
This can be a bit confusing because they often appear to be letters.
The string b"a"
appears to contain the letter a
, but really it contains the number 97
(or 01100001
).
Your terminal, your text editor, your OS is responsible for translating those numbers into characters when showing you the content of the string.
But it’s still the number underneath.
Be cautious about your assumptions.
Again, you have two strategies to work with bytestrings safely in Python 2 and Python 3.
You can import unicode_literals
and then specifically mark certain strings as bytestrings.
Or you can mark certain strings as bytestrings.
In either case, you have to mark bytestrings:
from __future__ import unicode_literals
b'polishing my resum\xc3\xa9 this week'
b'polishing my resum\xc3\xa9 this week'
The conversion of bytes to unicode and vice-versa should always take place at the I/O boundary. That means on the point where data is passing out of Python to the filesystem or network. Or the point where data enters Python from the filesystem or network.
At the point of crossing outbound, we can use the encode
method of unicode objects to convert them to bytes.
The argument to this function controls which codec
is used to make the conversion.
UTF8
is the most common codec in web work.
In [1]: fancy = u"Resumé"
In [2]: fancy
Out[2]: 'Resumé'
In [3]: fancy.encode('utf8')
Out[3]: b'Resum\xc3\xa9'
When data is inbound to Python, we can use the decode
method of a byte string to convert it to Unicode.
Again, passing a codec
name selects which should be used for the conversion:
In [4]: bytes = _
In [5]: bytes
Out[5]: b'Resum\xc3\xa9'
In [6]: bytes.decode('utf8')
Out[6]: 'Resumé'
If no codec
is specified, Python defaults to using the default encoding for the Python instance.
This is usually ascii
and is almost never the thing you really want.
Be specific.
In Python 2, conversion of bytes to unicode and back was one of the largest sources of problems in programs.
Both the encode
and decode
methods were supported by both byte strings and unicode objects.
This led to a lot of implicit conversion, which of course uses default encoding.
It’s very easy when working entirely in English to have these types of problems an not know about them. If the characters in a string fall entirely within the ascii set, then no errors will occur. But as soon as characters beyond ascii are used, all sorts of trouble pops up.
Watch for UnicodeDecodeError
and UnicodeEncodeError
and write tests that use non-ascii characters.
String Manipulation¶
You can break strings apart using the split
(py3
) method.
You have to make sure that the string you are splitting and the string you are using to split it are of the same type (bytes or unicode).
The result is a list of the pieces:
In [167]: csv = "comma, separated, values"
In [168]: csv.split(', ')
Out[168]: ['comma', 'separated', 'values']
In the other direction, calling the join
(py3
) method will connect a sequence of pieces using the string on which it is called:
In [169]: psv = '|'.join(csv.split(', '))
In [170]: psv
Out[170]: 'comma|separated|values'
There are methods that allow us to change the case of text:
In [171]: sample = u'A long string of words'
In [172]: sample.upper()
Out[172]: u'A LONG STRING OF WORDS'
In [173]: sample.lower()
Out[173]: u'a long string of words'
In [174]: sample.swapcase()
Out[174]: u'a LONG STRING OF WORDS'
In [175]: sample.title()
Out[175]: u'A Long String Of Words'
And there are methods that allow us to test the nature of the characters in the text:
In [181]: number = u"12345"
In [182]: number.isnumeric()
Out[182]: True
In [183]: number.isalnum()
Out[183]: True
In [184]: number.isalpha()
Out[184]: False
In [185]: fancy = u"Th!$ $tr!ng h@$ $ymb0l$"
In [186]: fancy.isalnum()
Out[186]: False
Every character in a string has a numeric value.
To see this value, use the ord
(py3
) builtin.
The chr
(py3
) builtin reverses the process:
In [109]: for i in 'Cris':
.....: print(ord(i), end=' ')
67 114 105 115
In [110]: for i in (67,114,105,115):
.....: print(chr(i), end=' ')
C r i s
Building Strings¶
The concatenation operator +
works for building strings out of fragments.
But it’s not an efficient way to work.
Avoid it.
Instead, use string formatting:
'Hello {0}!'.format(name)
It’s faster, and easier to maintain over time.
When building a format string, the placeholder is a pair of curly braces.
They can be empty, but it’s better to put an integer into one, indicating the index of the argument to format
(py3
) to use.
You can also pass keyword arguments to format
, if the placeholders contain names instead of integers:
"My name is {1} {0}".format('Ewing', 'Cris')
"The {name} are {status}!".format(
name='Seahawks', status='awesome'
)
Especially in legacy code you will see another method of formatting, using the %
operator.
"This is a %s %s" % ('format', 'template')
This is still a functioning alternative and there is no pressing need to update.
But you should prefer the new style in writing new code.
The only dividing line is that the %
operator supports both bytes and unicode objects, where in Python 3, .format
is only a method on unicode objects.
There is a good website available that will help you learn everything you want to know about the formatting mini-language you can use to control these format specifiers.
Dictionaries and Sets¶
Dictionaries in Python are a mapping of keys to values. In other languages, they are called:
- associative array
- map
- hash table
- hash
- key-value pair
The correct name of the type in Python is dict
(py3
)
You can build a new dict
in a number of ways.
You can use the object literal:
{'key1': 3, 'key2': 5}
You can call the dict
type object with a sequence of two-tuples.
The first in each will become the key, the second the value:
>>> dict([('key1', 3),('key2', 5)])
{'key1': 3, 'key2': 5}
You can also use keyword arguments to the dict
type object.
In this case, you are limited to keys which are legal python names:
>>> dict(key1=3, key2=5)
{'key1': 3, 'key2': 5}
Indexing¶
To look up a value in a dict
, we use the subscription operator, just like with sequences:
>>> d = {'name': 'Brian', 'score': 42}
>>> d['score']
42
>>> d = {1: 'one', 0: 'zero'}
>>> d[0]
'zero'
If you provide a key that is not in the dictionary, a KeyError
is caused:
>>> d['non-existing key']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'non-existing key'
In a certain sense, Python is built on dicts
.
Namespaces are implemented as dicts.
For this reason, the performance of lookup is highly optimized.
Lookup time for any object is constant, regardless of the size of the dict
.
When storing a value in a dict
, you use a key
.
This key can be any immutable object (more on that later).
In actuality, any object that is hashable can be used.
What does that mean, though?
Hashing¶
Hashing is the process of converting arbitrarily large data to a small proxy (usually an integer).
You can use any number of different algorithms to do this, MD5, SHA, etc.
The key (if you’ll forgive the pun) is that the algorithm must always return the same proxy for the same input.
In a dict
, keys are hashed to an integer proxy, which is used to find a location in an array behind the scenes.
This is efficient because a good hashing algorithms means only a very few key/value pairs correlate to any proxy.
What would happen if the proxy changed after a value was stored in the dict
?
Hashability requires that the object being hashed be immutable.
Dicts
are inherently unordered collections.
When you print them out, or look at them in the interpreter, this is not apparent.
You will be fooled into thinking that you can rely on the order of the pairs.
This is not true.
In [352]: d = {'one':1, 'two':2, 'three':3}
In [353]: d
Out[353]: {'one': 1, 'three': 3, 'two': 2}
In [354]: d.keys()
Out[354]: ['three', 'two', 'one']
Iteration and Dicts¶
You can use a dict
with a for loop.
By default, the keys are what are iterated over.
In [15]: d = {'name': 'Brian', 'score': 42}
In [16]: for x in d:
....: print(x)
....:
score
name
If you want to iterate over values, or perhaps over the key/value pairs in the dict
there are methods to support that.
In [2]: d.keys()
Out[2]: dict_keys(['score', 'name'])
In [3]: d.values()
Out[3]: dict_values([42, 'Brian'])
In [4]: d.items()
Out[4]: dict_items([('score', 42), ('name', 'Brian')])
In Python 2, there were nine methods on dicts
that supplied these behaviors.
The keys
, values
and items
methods returned lists.
The iter...
methods (iterkeys
, etc.) returned iterators, which were much more efficient for large dicts
.
The view...
methods (viewkeys
, etc.) return dict views which behaved as iterators, but also updated themselves as the dictionary changed.
In Python 3, the three remainin methods operate like the last of those. To get semantically equivalent code in Python 3, use the following map:
Python 2 | Python 3 |
---|---|
d.keys() | list(d.keys()) |
d.values() | list(d.values()) |
d.items() | list(d.items()) |
d.iterkeys() | iter(d.keys()) |
d.itervalues() | iter(d.values()) |
d.iteritems() | iter(d.items()) |
d.viewkeys() | d.keys() |
d.viewvalues() | d.values() |
d.viewitems() | d.items() |
You should also refer to Python Futures for additional compatible idioms.
Performance¶
Dictionaries are optimized for inserting and retrieving values:
- indexing is fast and constant time: O(1)
- Membership (
x in s
) constant time: O(1) - visiting all is proportional to n: O(n)
- inserting is constant time: O(1)
- deleting is constant time: O(1)
more on what exactly that means soon.
Miscellaneous¶
You can find all the methods of the dict
type in the Python standard library documentation.
But here are a number of interesting methods you may find useful:
Membership (on keys):
In [5]: d
Out[5]: {'that': 7, 'this': 5}
In [6]: 'that' in d
Out[6]: True
In [7]: 'this' not in d
Out[7]: False
The get
method (py3
) allows you to get a value or returns a default if the key you seek is not in the dict
.
The default value returned is None
, but you can control it.
It has the advantage of never causing a KeyError
:
In [9]: d.get('this')
Out[9]: 5
In [11]: d.get(u'something', u'a default')
Out[11]: u'a default'
To remove a key/value pair from a dict
, we use the pop
method (py3
).
It takes a key as the optional argument.
The value corresponding to the key is return and the key/value pair are removed.
If no argument is supplied, an arbitrary key/value pair is removed, and the value returned.
In [19]: d.pop('this')
Out[19]: 5
In [20]: d
Out[20]: {'that': 7}
In [23]: d.popitem()
Out[23]: ('that', 7)
In [24]: d
Out[24]: {}
One of the most useful methods on the dict
type is setdefault
(py3
).
You pass it a key and a default value.
If the key is present in the dict
, the stored value is returned.
If the key is not present, then the default value is stored and returned.
In [26]: d = {}
In [27]: d.setdefault(u'something', u'a value')
Out[27]: u'a value'
In [28]: d
Out[28]: {u'something': u'a value'}
In [29]: d.setdefault(u'something', u'a different value')
Out[29]: u'a value'
In [30]: d
Out[30]: {u'something': u'a value'}
Sets¶
A set
is an unordered collection of distinct values.
You can think of a set as a dict which has only keys and no values.
You can create a set using the set literal ({}
) or the set type object:
In [4]: {1, 2, 3}
Out[4]: {1, 2, 3}
In [5]: set()
Out[5]: set()
In [6]: set([1, 2, 3])
Out[6]: {1, 2, 3}
In [7]: {1, 2, 3}
Out[7]: {1, 2, 3}
In [8]: s = set()
In [9]: s.update([1, 2, 3])
In [10]: s
Out[10]: {1, 2, 3}
In [11]: s.add(4)
In [12]: s
Out[12]: {1, 2, 3, 4}
Sets share a lot of properties with dicts. Members of a set must be hashable, like dictionary keys, and for the same reason. Sets are also unordered, and so you cannot index them:
>>> s[1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing
Set support similar operations to dicts as well.
In [1]: s = set([1])
In [2]: s.pop()
Out[2]: 1
In [3]: s.pop()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-3-e76f41daca5e> in <module>()
----> 1 s.pop()
KeyError: 'pop from an empty set'
In [4]: s = set([1,2,3])
In [5]: s.remove(2)
In [6]: s.remove(2)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-6-542ac1b736c7> in <module>()
----> 1 s.remove(2)
KeyError: 2
Beyond all this, sets also operate as traditional mathematical sets. You get all the operations you might remember from set theory class:
s.isdisjoint(other)
s.issubset(other)
s.union(other, ...)
s.intersection(other, ...)
s.difference(other, ...)
s.symmetric_difference( other, ...)
Finally, if you need to have an immutable object that functions like a set, Python provides the frozenset
type.
It works just like a set, except that once constructed it may not be altered:
>>> fs = frozenset((3,8,5))
>>> fs.add(9)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'frozenset' object has no attribute 'add'
File Reading and Writing¶
It’s often useful in programming to be able to open files in order to read data from them, or write data to them.
Python has, for a long time, had a built-in function open
which handled this operation.
However, this builtin was created in the days before unicode was widely used, and it does not handle text that contains unicode particularly well.
For this reason, we have moved away from using the built-in open
and toward using the open
function from the io
module.
The io.open
function is available in both Python 2 (2.6 and 2.7) and Python 3 and so provides a cross-compatible approach to opening files.
import io
f = io.open('secrets.txt', encoding='utf-8')
secret_data = f.read()
f.close()
By default, files are opened in “read text” mode, which automatically decode the bytes contained in the file to unicode.
You may provide an encoding
keyword argument to control the “codec” that is used to perform this conversion.
The most common codec you will use is “utf-8”.
If you do not provide an encoding
, then Python will default to the value of sys.getdefaultencoding()
, which is nearly always “ascii”.
If you have needs other than reading unicode test, you may use the mode
argument ('rb'
below) to control aspects of your interaction with the file.
For example, the rb
mode opens a file for reading bytes.
When you read data from a file opened in rb
mode, it will be a bytestring.
If you need to convert it to unicode, you can call decode
on it at that point.
f = io.open('secrets.bin', 'rb')
secret_data = f.read()
f.close()
There are a number of modes available for files. Unfortunately, Python’s own documentation of the meaning of these modes is not very clear. You can use the man page for the unix command fopen which supports the same modes, and has much better information. One thing you should be careful of. Merely opening a file in w mode will always truncate the file, rendering it empty.
When you open a file with io.open
it defaults to using “Universal Newline” mode.
This means that while some operating systems use different characters as line endings, Python will always translate them into the *nix-tyle "\n"
when you read data from the file.
The "\n"
characters are translated back to OS-Native line endings when you write data out to a file.
You should always use the "\n"
character as a line ending when writing strings in Python.
You should be aware that although there is no difference between reading in bytes or text mode on Unix operating systems, the two are quite different in Windows.
You’ll be tempted to accept the default and always open files in text
mode.
This will break binary files on Windows.
Don’t do it.
Get in the habit of thinking carefully about whether the content of the file you are reading is text or binary data.
Beyond the mode
and encoding
parameters we’ve discussed, there are a number of other parameters to the io.open
function.
The required first argument is the path
to the file you wish to open.
The errors
parameter allows you to control how errors in decoding file bytes to Python unicode are handled.
You may specify that you wish to ignore
errors, replace
broken characters with a specific identifier, or use strict
mode to force errors to terminate reading.
The other parameters are more advanced and will not come up often in your work with files.
You may read about them in the io module documentation
(py3
).
Once you have an open file, you can read from it.
The read
method accepts an optional argument of a number of bytes to read.
If you provide no value, the entire file (starting at your current position) is read.
header_size = 4096
f = open('secrets.txt')
secret_header = f.read(header_size)
secret_rest = f.read()
f.close()
Files are iterators, which means you can iterate through the lines of text they contain like so:
for line in io.open('secrets.txt'):
print line
In addition, the readline
method will read a single line at a time.
The readlines
method will read a file and return a list containing the lines of the file.
If you wish to write text to a file you’ve opened in w
mode, you may do so with the write
method:
Newlines are not automatically appended to text written this way.
If you want lines in your file, you must write the newline characters yourself (or place them in your text).
outfile = io.open('output.txt', 'w')
for i in range(10):
outfile.write("this is line: {0}\n".format(i))
When you’ve opened a file in a “read” mode, you have the following methods available:
f.read() f.readline() f.readlines()
In “write” mode, you have rough equivalents.
The write
method writes a string to an open file.
The writelines
method takes a sequence of strings and writes them to a file.
Remember, newlines are not automatically added (despite the name).
f.write(str) f.writelines(seq)
In any mode, a file has a few methods you can use to navigate through the file.
The file.seek(offset)
method will move the file pointer to the byte of the file given by ‘offset’.
The file.tell()
method will return the byte number of the current position of the file pointer.
f.seek(offset) f.tell()
Finally, whenever you open a file, you must also close it. On certain operating systems, if you fail to do so it can render the file unusable by any other process.
file.close()
In Python, we say that anything which implements both the read
and write
method is File-like.
There are a number of types which are file-like:
- loggers
sys.stdout
urllib.open()
- pipes, subprocesses
- StringIO
When you have a file-like object you can treat is as if it were a file.
A common use-case for this involves using the io.StringIO
class.
This class constructs an in-memory buffer that operates just like a file:
In [417]: from io import StringIO
In [420]: f = StringIO()
In [421]: f.write(u"somestuff")
In [422]: f.seek(0)
In [423]: f.read()
Out[423]: 'somestuff'
When writing tests for file-handling code this can be very useful. It allows you to make “fake files” that operate just like the real thing.
Legacy code will often contain references to modules named StringIO
or cStringIO
.
These modules should be considered superseded by the io.StringIO
class.
Paths and Directories¶
In Python, paths are often handled with simple strings (or Unicode strings) You can make absolute paths:
b"/home/cris/stuff.txt"
u'/usr/local/bin/python3'
or relative paths:
u'./secret.txt'
b'src/test_ack.py'
Either relative or absolute paths, bytes or unicode objects will work as the path
argument to io.open()
.
The os
module from the Python standard library gives you a number of useful tools for interacting with paths.
You can get the current working directory with os.getcwd
.
You can change directories with os.chdir
.
You can turn any relative path into an absolute path with os.path.abspath
.
You can obtain the relative path from your current location to any absolute path with os.path.relpath()
.
os.getcwd() -- os.getcwdu() (u for Unicode)
os.chdir(path)
os.path.abspath()
os.path.relpath()
It’s possible to list the contents of a directory with os.listdir
.
You can make new directories with os.mkdir
.
You can even walk an entire file system using os.walk
.
There’s much, much more to learn. Check out the documentation
(py3
)
Finally, if you’d prefer to work with paths in an Object-oriented style, the Python 3 standard library has added a new module, pathlib
.
This module can also be pip installed in Python 2 for compatibility.
With the module, you can create paths as objects, and then work with methods on them.
This allows you access to all the operations in os.path
and more.
In [1]: import pathlib
In [2]: pth = pathlib.Path('.')
In [3]: pth.is_dir()
Out[3]: True
In [4]: pth.absolute()
Out[4]: PosixPath('/Users/nmhw/projects/training/codefellows/existing_course_repos/python-dev-accelerator')
In [5]: for f in pth.iterdir():
...: print(f)
...:
.git
.gitignore
bin
build
cfpython.sublime-project
...