第四章-文本与字节
人类使用文本交流,而计算机则使用字节通信。———埃斯特。纳姆,和特雷斯。吠舍。字符解码与Python中的Unicode
Python 3在人类可读文本字符串和原始字节之间引入了非常明显的差异。含混的字节序列转换到Unicode文本已经成为往事。本章着眼于Unicode字符串,二进制序列以及用于转换前两者的编码。
依据Python编程的上下文不同,更深入的理解Unicode对你来说或许很重要或许也不是很重要。最后,本章中遇到的大多数问题都不会影响到那些仅处理ASCII文本的程序员。不过要是你遇到此类问题,也没有必要去讲str转义为byte分开来。这样做带来的好处是,你会发现专门的二进制序列类型提供了Python 2 中的str类型没有提供的“万能”功能。
在这一章我们会谈到以下话题:
- characters, code points and byte representations;
- 二进制序列独一无二的功能以及早期字符集合;
- 避免并处理编码错误;
- 处理文本的最佳实践;
- 默认的编码陷阱与标准I/O问题;
- 安全的Unicode文本正规化比较;
- 正规化的多用途函数,case folding and brute-force diacritic removal;
- proper sorting of Unicode text with locale and the PyUCA library;
- Unicode数据库中的字符元类;
- 能够处理str和bytes的双模式API;
Let’s start with the characters, code points and bytes.
“字符串“的概念太简单了:字符串是一个字符列表。而问题则出现在”字符“定义中。
在2014年我们所知道的最佳“字符”定义就是Unicode字符。因此,你从Python 3 的str得到项便是Unicode字符,就像从Python 2中得到的unicode对象一样————而且不会是从Python str中得到的原始字节。
- The identity of a character—its code point—is a number from 0 to 1,114,111 (base 10), shown in the Unicode standard as 4 to 6 hexadecimal digits with a “U+” prefix. For example, the code point for the letter A is U+0041, the Euro sign is U+20AC and the musical symbol G clef is assigned to code point U+1D11E. About 10% of the valid code points have characters assigned to them in Unicode 6.3, the standard used in Python 3.4.
- The actual bytes that represent a character depend on the encoding in use. An encoding is an algorithm that converts code points to byte sequences and vice-versa. The code point for A (U+0041) is encoded as the single byte \x41 in the UTF-8 encoding, or as the bytes \x41\x00 in UTF-16LE encoding. As another example, the Euro sign (U+20AC) becomes three bytes in UTF-8—\xe2\x82\xac—but in UTF-16LE it is encoded as two bytes: \xac\x20.
转换代码片段到字节就是编码;由字节转换为代码片段便是解码。见例子4-1.
例子4-1. 编码与解码。
>>> s = 'café'
>>> len(s) # 1
4
>>> b = s.encode('utf8') # 2
>>> b
b'caf\xc3\xa9' # 3
>>> len(b) # 4
5
>>> b.decode('utf8') # 5
'café'
#1str 'café'拥有4个Unicode字符。
#2: 利用UTF-8编码将str编码为bytes
#3: bytes literals start with a b prefix.
#4: bytes b拥有5个字节(在UTF-8中代码片段“é”被编码为两个字节)。
#5 利用UTF-8编码将bytes解码为str。
If you need a memory aid to help distinguish .decode() from .encode(), convince yourself that byte sequences can be cryptic machine core dumps while Unicode str objects are “hu‐ man” text. Therefore, it makes sense that we decode bytes to str to get human-readable text, and we encode str to bytes for stor‐ age or transmission.
Although the Python 3 str is pretty much the Python 2 unicode type with a new name, the Python 3 bytes is not simply the old str renamed, and there is also the closely related bytearray type. So it is worthwhile to take a look at the binary sequence types before advancing to encoding/decoding issues.
The new binary sequence types are unlike the Python 2 str in many regards. The first thing to know is that there are two basic built-in types for binary sequences: the im‐ mutable bytes type introduced in Python 3 and the mutable bytearray, added in Python 2.6. (Python 2.6 also introduced bytes, but it’s just an alias to the str type, and does not behave like the Python 3 bytes type.)
Each item in bytes or bytearray is an integer from 0 to 255, and not a one-character string like in the Python 2 str. However, a slice of a binary sequence always produces a binary sequence of the same type—including slices of length 1. See Example 4-2.
Example 4-2. A five-byte sequence as bytes and as bytearray
>>> cafe = bytes('café', encoding='utf_8')
>>> cafe
b'caf\xc3\xa9'
>>> cafe[0]
99
>>> cafe[:1]
b'c'
>>> cafe_arr = bytearray(cafe)
>>> cafe_arr
bytearray(b'caf\xc3\xa9')
>>> cafe_arr[-1:] bytearray(b'\xa9')
bytes can be built from a str, given an encoding. Each item is an integer in range(256). Slices of bytes are also bytes—even slices of a single byte. There is no literal syntax for bytearray: they are shown as bytearray() with a bytes literal as argument. A slice of bytearray is also a bytearray.
The fact that my_bytes[0] retrieves an int but my_bytes[:1] returns a bytes object of length 1 should not be surprising. The only sequence type where s[0] == s[:1] is the str type. Al‐ though practical, this behavior of str is exceptional. For every other sequence, s[i] returns one item, and s[i:i+1] returns a sequence of the same type with the s[1] item inside it.
Although binary sequences are really sequences of integers, their literal notation reflects the fact that ASCII text is often embedded in them. Therefore, three different displays are used, depending on each byte value:
- For bytes in the printable ASCII range—from space to ~—the ASCII character itself is used.
- For bytes corresponding to tab, newline, carriage return, and , the escape sequences \t, \n, \r, and \ are used.
- For every other byte value, a hexadecimal escape sequence is used (e.g., \x00 is the null byte).
That is why in Example 4-2 you see b'caf\xc3\xa9': the first three bytes b'caf' are in the printable ASCII range, the last two are not.
Both bytes and bytearray support every str method except those that do formatting (format, format_map) and a few others that depend on Unicode data, including case fold, isdecimal, isidentifier, isnumeric, isprintable, and encode. This means that you can use familiar string methods like endswith, replace, strip, translate, upper, and dozens of others with binary sequences—only using bytes and not str arguments. In addition, the regular expression functions in the re module also work on binary sequences, if the regex is compiled from a binary sequence instead of a str. The % operator does not work with binary sequences in Python 3.0 to 3.4, but should be sup‐ported in version 3.5 according to PEP 461 — Adding % formatting to bytes and byte‐ array.
Binary sequences have a class method thatstrdoesn’t have, calledfromhex, which builds a binary sequence by parsing pairs of hex digits optionally separated by spaces:
>>> bytes.fromhex('31 4B CE A9') b'1K\xce\xa9'
The other ways of building bytes or bytearray instances are calling their constructors with:
- A str and an encoding keyword argument.
- An iterable providing items with values from 0 to 255.
- A single integer, to create a binary sequence of that size initialized with null bytes. (This signature will be deprecated in Python 3.5 and removed in Python 3.6. See PEP 467 — Minor API improvements for binary sequences.)
- An object that implements the buffer protocol (e.g., bytes, bytearray, memory view, array.array); this copies the bytes from the source object to the newly cre‐ ated binary sequence.
Building a binary sequence from a buffer-like object is a low-level operation that may involve type casting. See a demonstration in Example 4-3.
Example 4-3. Initializing bytes from the raw data of an array
>>> import array
>>> numbers = array.array('h', [-2, -1, 0, 1, 2])
>>> octets = bytes(numbers)
>>> octets b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'
Typecode 'h' creates an array of short integers (16 bits). octets holds a copy of the bytes that make up numbers. These are the 10 bytes that represent the five short integers.
Creating a bytes or bytearray object from any buffer-like source will always copy the bytes. In contrast, memoryview objects let you share memory between binary data struc‐ tures. To extract structured information from binary sequences, the struct module is invaluable. We’ll see it working along with bytes and memoryview in the next section.
The struct module provides functions to parse packed bytes into a tuple of fields of different types and to perform the opposite conversion, from a tuple into packed bytes. struct is used with bytes, bytearray, and memoryview objects.
As we’ve seen in “Memory Views” on page 51, the memoryview class does not let you create or store byte sequences, but provides shared memory access to slices of data from other binary sequences, packed arrays, and buffers such as Python Imaging Library (PIL) images,2 without copying the bytes.
Example 4-4 shows the use of memoryview and struct together to extract the width and height of a GIF image.
Example 4-4. Using memoryview and struct to inspect a GIF image header
>>> import struct
>>> fmt = '<3s3sHH' #
>>> with open('filter.gif', 'rb') as fp: ... img = memoryview(fp.read()) # ...
>>> header = img[:10] #
>>> bytes(header) # b'GIF89a+\x02\xe6\x00'
>>> struct.unpack(fmt, header) # (b'GIF', b'89a', 555, 230)
>>> del header #
>>> del img
struct format: < little-endian; 3s3s two sequences of 3 bytes; HH two 16-bit integers. Create memoryview from file contents in memory... ...then another memoryview by slicing the first one; no bytes are copied here. Convert to bytes for display only; 10 bytes are copied here. Unpack memoryview into tuple of: type, version, width, and height. Delete references to release the memory associated with the memoryview instances.
Note that slicing a memoryview returns a new memoryview, without copying bytes (Leo‐ nardo Rochael—one of the technical reviewers—pointed out that even less byte copying would happen if I used the mmap module to open the image as a memory-mapped file.
I will not cover mmap in this book, but if you read and change binary files frequently, learning more about mmap — Memory-mapped file support will be very fruitful).
We will not go deeper into memoryview or the struct module in this book, but if you work with binary data, you’ll find it worthwhile to study their docs: Built-in Types » Memory Views and struct — Interpret bytes as packed binary data.
After this brief exploration of binary sequence types in Python, let’s see how they are converted to/from strings.
The Python distribution bundles more than 100 codecs (encoder/decoder) for text to byte conversion and vice versa. Each codec has a name, like 'utf_8', and often aliases, such as 'utf8', 'utf-8', and 'U8', which you can use as the encoding argument in functions like open(), str.encode(), bytes.decode(), and so on. Example 4-5 shows the same text encoded as three different byte sequences.
Example 4-5. The string “El Niño” encoded with three codecs producing very different byte sequences
>>> for codec in ['latin_1', 'utf_8', 'utf_16']:
... print(codec, 'El Niño'.encode(codec), sep='\t')
...
latin_1 b'El Ni\xf1o'
utf_8 b'El Ni\xc3\xb1o'
utf_16 b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'
Figure 4-1 demonstrates a variety of codecs generating bytes from characters like the letter “A” through the G-clef musical symbol. Note that the last three encodings are variable-length, multibyte encodings.
Figure 4-1. Twelve characters, their code points, and their byte representation (in hex) in seven different encodings (asterisks indicate that the character cannot be represented in that encoding)
All those asterisks in Figure 4-1 make clear that some encodings, like ASCII and even the multibyte GB2312, cannot represent every Unicode character. The UTF encodings, however, are designed to handle every Unicode code point.
The encodings shown in Figure 4-1 were chosen as a representative sample:
latin1 a.k.a. iso8859_1 Important because it is the basis for other encodings, such as cp1252 and Unicode itself (note how the latin1 byte values appear in the cp1252 bytes and even in the code points). cp1252 A latin1 superset by Microsoft, adding useful symbols like curly quotes and the € (euro); some Windows apps call it “ANSI,” but it was never a real ANSI standard. cp437 The original character set of the IBM PC, with box drawing characters. Incompat‐ ible with latin1, which appeared later. gb2312 Legacy standard to encode the simplified Chinese ideographs used in mainland China; one of several widely deployed multibyte encodings for Asian languages. utf-8 The most common 8-bit encoding on the Web, by far;3 backward-compatible with ASCII (pure ASCII text is valid UTF-8). utf-16le One form of the UTF-16 16-bit encoding scheme; all UTF-16 encodings support code points beyond U+FFFF through escape sequences called “surrogate pairs.”
UTF-16 superseded the original 16-bit Unicode 1.0 encoding— UCS-2—way back in 1996. UCS-2 is still deployed in many sys‐ tems, but it only supports code points up to U+FFFF. As of Uni‐ code 6.3, more than 50% of the allocated code points are above U +10000, including the increasingly popular emoji pictographs.
With this overview of common encodings now complete, we move to handling issues in encoding and decoding operations.
Although there is a generic UnicodeError exception, the error reported is almost always more specific: either a UnicodeEncodeError (when converting str to binary sequences) or a UnicodeDecodeError (when reading binary sequences into str). Loading Python modules may also generate a SyntaxError when the source encoding is unexpected. We’ll show how to handle all of these errors in the next sections.
The first thing to note when you get a Unicode error is the exact type of the exception. Is it a UnicodeEncodeError, a UnicodeDeco deError, or some other error (e.g., SyntaxError) that mentions an encoding problem? To solve the problem, you have to under‐ stand it first.
Most non-UTF codecs handle only a small subset of the Unicode characters. When converting text to bytes, if a character is not defined in the target encoding, UnicodeEn codeError will be raised, unless special handling is provided by passing an errors argument to the encoding method or function. The behavior of the error handlers is shown in Example 4-6.
[3]. As of September, 2014, W3Techs: Usage of Character Encodings for Websites claims that 81.4% of sites use UTF-8, while Built With: Encoding Usage Statistics estimates 79.4%.
Example 4-6. Encoding to bytes: success and error handling
>>> city = 'São Paulo'
>>> city.encode('utf_8')
b'S\xc3\xa3o Paulo'
>>> city.encode('utf_16')
b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'
>>> city.encode('iso8859_1')
b'S\xe3o Paulo'
>>> city.encode('cp437') Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../lib/python3.4/encodings/cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>
>>> city.encode('cp437', errors='ignore')
b'So Paulo'
>>> city.encode('cp437', errors='replace')
b'S?o Paulo'
>>> city.encode('cp437', errors='xmlcharrefreplace')
b'São Paulo'
The 'utf_?' encodings handle any str. 'iso8859_1' also works for the 'São Paulo' str. 'cp437' can’t encode the 'ã' (“a” with tilde). The default error handler —'strict'—raises UnicodeEncodeError. The error='ignore' handler silently skips characters that cannot be encoded; this is usually a very bad idea. When encoding, error='replace' substitutes unencodable characters with '?'; data is lost, but users will know something is amiss. 'xmlcharrefreplace' replaces unencodable characters with an XML entity.
The codecs error handling is extensible. You may register extra strings for the errors argument by passing a name and an error handling function to the codecs.register_error function. See the codecs.register_error documentation.
Not every byte holds a valid ASCII character, and not every byte sequence is valid UTF-8 or UTF-16; therefore, when you assume one of these encodings while converting a binary sequence to text, you will get a UnicodeDecodeError if unexpected bytes are found.
On the other hand, many legacy 8-bit encodings like 'cp1252', 'iso8859_1', and 'koi8_r' are able to decode any stream of bytes, including random noise, without generating errors. Therefore, if your program assumes the wrong 8-bit encoding, it will silently decode garbage.