A quick run-down of Unicode, its use in Python 2, and some of the gotchas that arise.
The Unicode idea is pretty simple: * one “code point” for all characters in all languages
A good start:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html
Everything is Bytes
Unicode is a biggie
(actually, dealing with numbers rather than bytes is big – but we take that for granted)
Py2 strings are sequences of bytes
Unicode strings are sequences of platonic characters
It’s almost one code point per character – but there are complications with combined characters: accents, etc.
Platonic characters cannot be written to disk or network!
(ANSI: one character == one byte – so easy!)
Python 2 has two types that let you work with text:
And two ways to work with binary data:
but:
In [86]: str is bytes
Out[86]: True
bytes is there for py3 compatibility - -but it’s good for making your intentions clear, too.
The unicode object lets you work with characters
It has all the same methods as the string object.
“encoding” is converting from a unicode object to bytes
“decoding” is converting from bytes to a unicode object
(sometimes this feels backwards...)
Built in functions
ord()
chr()
unichr()
str()
unicode()
The codecs module
import codecs
codecs.encode()
codecs.decode()
codecs.open() # better to use ``io.open``
Encoding
In [17]: u"this".encode('utf-8')
Out[17]: 'this'
In [18]: u"this".encode('utf-16')
Out[18]: '\xff\xfet\x00h\x00i\x00s\x00'
Decoding
In [99]: print '\xff\xfe."+"x\x00\xb2\x00'.decode('utf-16')
∮∫x²
# -*- coding: utf-8 -*-
print u"The integral sign: \u222B"
print u"The integral sign: \N{integral}"
Lots of tables of code points online:
Use unicode objects in all your code
Decode on input
Encode on output
Many packages do this for you: XML processing, databases, ...
Gotcha:
Python has a default encoding (usually ascii)
In [2]: sys.getdefaultencoding()
Out[2]: 'ascii'
The default encoding will get used in unexpected places!
Python 2.6 and above have a nice feature to make it easier to use unicode everywhere
from __future__ import unicode_literals
After running that line, the u'' is assumed
In [1]: s = "this is a regular py2 string"
In [2]: print type(s)
<type 'str'>
In [3]: from __future__ import unicode_literals
In [4]: s = "this is now a unicode string"
In [5]: type(s)
Out[5]: unicode
NOTE: You can still get py2 strings from other sources!
What encoding should I use???
There are a lot:
http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
But only a couple you are likely to need:
and of course, still the one-bytes ones.
Probably the one you’ll use most – most common in Internet protocols (xml, JSON, etc.)
Nice properties:
Gotchas:
Kind of like UTF-8, except it uses at least 16bits (2 bytes) for each character: not ASCII compatible.
But is still needs more than two bytes for some code points, so you still can’t process
In C/C++ held in a “wide char” or “wide string”.
MS Windows uses UTF-16, as does (I think) Java.
There is a lot of criticism on the net about UTF-16 – it’s kind of the worst of both worlds:
But to be fair:
Early versions of Unicode: everything fit into two bytes (65536 code points). MS and Java were fairly early adopters, and it seemed simple enough to just use 2 bytes per character.
When it turned out that 4 bytes were really needed, they were kind of stuck in the middle.
NOT Unicode:
a 1-byte per char encoding.
Python Docs Unicode HowTo:
http://docs.python.org/howto/unicode.html
“Reading Unicode from a file is therefore simple”
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
file names, etc:
If you pass in unicode, you get unicode
In [9]: os.listdir('./')
Out[9]: ['hello_unicode.py', 'text.utf16', 'text.utf32']
In [10]: os.listdir(u'./')
Out[10]: [u'hello_unicode.py', u'text.utf16', u'text.utf32']
Python deals with the file system encoding for you...
But: some more obscure calls don’t support unicode filenames:
os.statvfs() (http://bugs.python.org/issue18695)
Exception messages:
NOPE: it swallows it instead.
The “string” object is unicode.
Py3 has two distinct concepts:
Everything that’s about text is unicode.
Everything that requires binary data uses bytes.
It’s all much cleaner.
(by the way, the recent implementations are very efficient...)
Find some nifty non-ascii characters you might use.
Read the contents into unicode objects:
and/ or
write some of the text from the first exercise to file – read that file back in.
reference: http://inamidst.com/stuff/unidata/
NOTE: if your terminal does not support unicode – you’ll get an error trying to print. Try a different terminal or IDE, or google for a solution.
We saw this earlier
In [38]: u'to \N{INFINITY} and beyond!'.decode('utf-8')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-38-7f87d44dfcfa> in <module>()
----> 1 u'to \N{INFINITY} and beyond!'.decode('utf-8')
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeEncodeError: 'ascii' codec can't encode character u'\u221e' in position 3: ordinal not in range(128)
But why would you decode a unicode object?
And it should be a no-op – why the exception?
And why ‘ascii’? I specified ‘utf-8’!
It’s there for backward compatibility
What’s happening under the hood
u'to \N{INFINITY} and beyond!'.encode().decode('utf-8')
It encodes with the default encoding (ascii), then decodes
In this case, it barfs on attempting to encode to ‘ascii’
So never call decode on a unicode object!
But what if someone passes one into a function of yours that’s expecting a py2 string?
Type checking and converting – yeach!
Read:
http://axialcorps.com/2014/03/20/unicode-str/
See if you can figure out the decorators:
(This is advanced Python JuJu: Aren’t you glad I didn’t ask you to write that yourself?)