### 導航
- [索引](../genindex.xhtml "總目錄")
- [模塊](../py-modindex.xhtml "Python 模塊索引") |
- [下一頁](urllib2.xhtml "如何使用urllib包獲取網絡資源") |
- [上一頁](sorting.xhtml "排序指南") |
- 
- [Python](https://www.python.org/) ?
- zh\_CN 3.7.3 [文檔](../index.xhtml) ?
- [Python 常用指引](index.xhtml) ?
- $('.inline-search').show(0); |
# Unicode 指南
發布版本1\.12
本指南討論了 Python 對于表達文本數據的 Unicode 規范的支持,并且解釋了人們試圖使用 Unicode 時經常遇到的問題。
## Unicode 概述
### 定義
如今的程序需要具有處理許多不同類型字符的能力。應用程序常常需要國際化以便以用戶可選擇的不同語言顯示信息和輸出。同一個程序可能需要以英語、法語、日語、希伯來語或俄語輸出錯誤信息。網頁內容可能由任何語言寫成,并且可能包含不同的表情符號。Python 的字符串類型使用 Unicode 標準來表示字符,這使 Python 程序能夠正常處理所有這些可能的字符。
Unicode 規范 (<https://www.unicode.org/>) 旨在列出人類語言中用到的每個字符,并賦予每個字符唯一的編碼。該規范持續進行修訂和更新以添加新的語言和符號。
A **character** is the smallest possible component of a text. 'A', 'B', 'C', etc., are all different characters. So are 'è' and 'í'. Characters vary depending on the language or context you're talking about. For example, there's a character for "Roman Numeral One", 'Ⅰ', that's separate from the uppercase letter 'I'. They'll usually look the same, but these are two different characters that have different meanings.
The Unicode standard describes how characters are represented by **code points**. A code point value is an integer in the range 0 to 0x10FFFF (about 1.1 million values, with some 110 thousand assigned so far). In the standard and in this document, a code point is written using the notation `U+265E` to mean the character with value `0x265e` (9,822 in decimal).
The Unicode standard contains a lot of tables listing characters and their corresponding code points:
```
0061 'a'; LATIN SMALL LETTER A
0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
...
2167 'Ⅶ': ROMAN NUMERAL EIGHT
2168 'Ⅸ': ROMAN NUMERAL NINE
...
265E '?': BLACK CHESS KNIGHT
265F '?': BLACK CHESS PAWN
...
1F600 '??': GRINNING FACE
1F609 '??': WINKING FACE
...
```
Strictly, these definitions imply that it's meaningless to say 'this is character `U+265E`'. `U+265E` is a code point, which represents some particular character; in this case, it represents the character 'BLACK CHESS KNIGHT', '?'. In informal contexts, this distinction between code points and characters will sometimes be forgotten.
A character is represented on a screen or on paper by a set of graphical elements that's called a **glyph**. The glyph for an uppercase A, for example, is two diagonal strokes and a horizontal stroke, though the exact details will depend on the font being used. Most Python code doesn't need to worry about glyphs; figuring out the correct glyph to display is generally the job of a GUI toolkit or a terminal's font renderer.
### Encodings
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through `0x10FFFF` (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of **code units**, and **code units** are then mapped to 8-bit bytes. The rules for translating a Unicode string into a sequence of bytes are called a **character encoding**, or just an **encoding**.
The first encoding you might think of is using 32-bit integers as the code unit, and then using the CPU's representation of 32-bit integers. In this representation, the string "Python" might look like this:
```
P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
```
This representation is straightforward but using it presents a number of problems.
1. It's not portable; different processors order the bytes differently.
2. It's very wasteful of space. In most texts, the majority of the code points are less than 127, or less than 255, so a lot of space is occupied by `0x00`bytes. The above string takes 24 bytes compared to the 6 bytes needed for an ASCII representation. Increased RAM usage doesn't matter too much (desktop computers have gigabytes of RAM, and strings aren't usually that large), but expanding our usage of disk and network bandwidth by a factor of 4 is intolerable.
3. It's not compatible with existing C functions such as `strlen()`, so a new family of wide string functions would need to be used.
Therefore this encoding isn't used very much, and people instead choose other encodings that are more efficient and convenient, such as UTF-8.
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for "Unicode Transformation Format", and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.) UTF-8 uses the following rules:
1. If the code point is < 128, it's represented by the corresponding byte value.
2. If the code point is >= 128, it's turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties:
1. It can handle any Unicode code point.
2. A Unicode string is turned into a sequence of bytes that contains embedded zero bytes only where they represent the null character (U+0000). This means that UTF-8 strings can be processed by C functions such as `strcpy()` and sent through protocols that can't handle zero bytes for anything other than end-of-string markers.
3. A string of ASCII text is also valid UTF-8 text.
4. UTF-8 is fairly compact; the majority of commonly used characters can be represented with one or two bytes.
5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8.
6. UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes. This avoids the byte-ordering issues that can occur with integer and word oriented encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending on the hardware on which the string was encoded.
### 引用文獻
The [Unicode Consortium site](http://www.unicode.org) \[http://www.unicode.org\] has character charts, a glossary, and PDF versions of the Unicode specification. Be prepared for some difficult reading. [A chronology](http://www.unicode.org/history/) \[http://www.unicode.org/history/\] of the origin and development of Unicode is also available on the site.
On the Computerphile Youtube channel, Tom Scott briefly discusses the history of Unicode and UTF-8 <https://www.youtube.com/watch?v=MijmeoH9LT4>(9 minutes 36 seconds).
To help understand the standard, Jukka Korpela has written [an introductory guide](http://jkorpela.fi/unicode/guide.html) \[http://jkorpela.fi/unicode/guide.html\] to reading the Unicode character tables.
Another [good introductory article](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) \[https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/\]was written by Joel Spolsky. If this introduction didn't make things clear to you, you should try reading this alternate article before continuing.
Wikipedia entries are often helpful; see the entries for "[character encoding](https://en.wikipedia.org/wiki/Character_encoding) \[https://en.wikipedia.org/wiki/Character\_encoding\]" and [UTF-8](https://en.wikipedia.org/wiki/UTF-8) \[https://en.wikipedia.org/wiki/UTF-8\], for example.
## Python's Unicode Support
Now that you've learned the rudiments of Unicode, we can look at Python's Unicode features.
### The String Type
Since Python 3.0, the language's [`str`](../library/stdtypes.xhtml#str "str") type contains Unicode characters, meaning any string created using `"unicode rocks!"`,
```
'unicode
rocks!'
```
, or the triple-quoted string syntax is stored as Unicode.
The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal:
```
try:
with open('/tmp/input.txt', 'r') as f:
...
except OSError:
# 'File not found' error message.
print("Fichier non trouvé")
```
Side note: Python 3 also supports using Unicode characters in identifiers:
```
répertoire = "/tmp/records.log"
with open(répertoire, "w") as f:
f.write("test\n")
```
If you can't enter a particular character in your editor or want to keep the source code ASCII-only for some reason, you can also use escape sequences in string literals. (Depending on your system, you may see the actual capital-delta glyph instead of a u escape.)
```
>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
'\u0394'
>>> "\u0394" # Using a 16-bit hex value
'\u0394'
>>> "\U00000394" # Using a 32-bit hex value
'\u0394'
```
In addition, one can create a string using the [`decode()`](../library/stdtypes.xhtml#bytes.decode "bytes.decode") method of [`bytes`](../library/stdtypes.xhtml#bytes "bytes"). This method takes an *encoding* argument, such as `UTF-8`, and optionally an *errors* argument.
The *errors* argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument are `'strict'` (raise a [`UnicodeDecodeError`](../library/exceptions.xhtml#UnicodeDecodeError "UnicodeDecodeError") exception), `'replace'` (use `U+FFFD`, `REPLACEMENT CHARACTER`), `'ignore'` (just leave the character out of the Unicode result), or `'backslashreplace'` (inserts a `\xNN` escape sequence). The following examples show the differences:
```
>>> b'\x80abc'.decode("utf-8", "strict")
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'
```
Encodings are specified as strings containing the encoding's name. Python comes with roughly 100 different encodings; see the Python Library Reference at [標準編碼](../library/codecs.xhtml#standard-encodings) for a list. Some encodings have multiple names; for example, `'latin-1'`, `'iso_8859_1'` and `'8859`' are all synonyms for the same encoding.
One-character Unicode strings can also be created with the [`chr()`](../library/functions.xhtml#chr "chr")built-in function, which takes integers and returns a Unicode string of length 1 that contains the corresponding code point. The reverse operation is the built-in [`ord()`](../library/functions.xhtml#ord "ord") function that takes a one-character Unicode string and returns the code point value:
```
>>> chr(57344)
'\ue000'
>>> ord('\ue000')
57344
```
### Converting to Bytes
The opposite method of [`bytes.decode()`](../library/stdtypes.xhtml#bytes.decode "bytes.decode") is [`str.encode()`](../library/stdtypes.xhtml#str.encode "str.encode"), which returns a [`bytes`](../library/stdtypes.xhtml#bytes "bytes") representation of the Unicode string, encoded in the requested *encoding*.
The *errors* parameter is the same as the parameter of the [`decode()`](../library/stdtypes.xhtml#bytes.decode "bytes.decode") method but supports a few more possible handlers. As well as `'strict'`, `'ignore'`, and `'replace'` (which in this case inserts a question mark instead of the unencodable character), there is also `'xmlcharrefreplace'` (inserts an XML character reference), `backslashreplace` (inserts a `\uNNNN` escape sequence) and `namereplace` (inserts a `\N{...}` escape sequence).
The following example shows the different results:
```
>>> u = chr(40960) + 'abcd' + chr(1972)
>>> u.encode('utf-8')
b'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
b'abcd'
>>> u.encode('ascii', 'replace')
b'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
b'?abcd?'
>>> u.encode('ascii', 'backslashreplace')
b'\\ua000abcd\\u07b4'
>>> u.encode('ascii', 'namereplace')
b'\\N{YI SYLLABLE IT}abcd\\u07b4'
```
The low-level routines for registering and accessing the available encodings are found in the [`codecs`](../library/codecs.xhtml#module-codecs "codecs: Encode and decode data and streams.") module. Implementing new encodings also requires understanding the [`codecs`](../library/codecs.xhtml#module-codecs "codecs: Encode and decode data and streams.") module. However, the encoding and decoding functions returned by this module are usually more low-level than is comfortable, and writing new encodings is a specialized task, so the module won't be covered in this HOWTO.
### Unicode Literals in Python Source Code
In Python source code, specific Unicode code points can be written using the `\u` escape sequence, which is followed by four hex digits giving the code point. The `\U` escape sequence is similar, but expects eight hex digits, not four:
```
>>> s = "a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> [ord(c) for c in s]
[97, 172, 4660, 8364, 32768]
```
Using escape sequences for code points greater than 127 is fine in small doses, but becomes an annoyance if you're using many accented characters, as you would in a program with messages in French or some other accent-using language. You can also assemble strings using the [`chr()`](../library/functions.xhtml#chr "chr") built-in function, but this is even more tedious.
Ideally, you'd want to be able to write literals in your language's natural encoding. You could then edit Python source code with your favorite editor which would display the accented characters naturally, and have the right characters used at runtime.
Python supports writing source code in UTF-8 by default, but you can use almost any encoding if you declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:
```
#!/usr/bin/env python
# -*- coding: latin-1 -*-
u = 'abcdé'
print(ord(u[-1]))
```
The syntax is inspired by Emacs's notation for specifying variables local to a file. Emacs supports many different variables, but Python only supports 'coding'. The `-*-` symbols indicate to Emacs that the comment is special; they have no significance to Python but are a convention. Python looks for `coding: name` or `coding=name` in the comment.
If you don't include such a comment, the default encoding used will be UTF-8 as already mentioned. See also [**PEP 263**](https://www.python.org/dev/peps/pep-0263) \[https://www.python.org/dev/peps/pep-0263\] for more information.
### Unicode Properties
The Unicode specification includes a database of information about code points. For each defined code point, the information includes the character's name, its category, the numeric value if applicable (for characters representing numeric concepts such as the Roman numerals, fractions such as one-third and four-fifths, etc.). There are also display-related properties, such as how to use the code point in bidirectional text.
The following program displays some information about several characters, and prints the numeric value of one particular character:
```
import unicodedata
u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
for i, c in enumerate(u):
print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
print(unicodedata.name(c))
# Get numeric value of second character
print(unicodedata.numeric(u[1]))
```
When run, this prints:
```
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0
```
The category codes are abbreviations describing the nature of the character. These are grouped into categories such as "Letter", "Number", "Punctuation", or "Symbol", which in turn are broken up into subcategories. To take the codes from the above output, `'Ll'` means 'Letter, lowercase', `'No'` means "Number, other", `'Mn'` is "Mark, nonspacing", and `'So'` is "Symbol, other". See [the General Category Values section of the Unicode Character Database documentation](http://www.unicode.org/reports/tr44/#General_Category_Values) \[http://www.unicode.org/reports/tr44/#General\_Category\_Values\] for a list of category codes.
### Comparing Strings
Unicode adds some complication to comparing strings, because the same set of characters can be represented by different sequences of code points. For example, a letter like 'ê' can be represented as a single code point U+00EA, or as U+0065 U+0302, which is the code point for 'e' followed by a code point for 'COMBINING CIRCUMFLEX ACCENT'. These will produce the same output when printed, but one is a string of length 1 and the other is of length 2.
One tool for a case-insensitive comparison is the [`casefold()`](../library/stdtypes.xhtml#str.casefold "str.casefold") string method that converts a string to a case-insensitive form following an algorithm described by the Unicode Standard. This algorithm has special handling for characters such as the German letter '?' (code point U+00DF), which becomes the pair of lowercase letters 'ss'.
```
>>> street = 'Gürzenichstra?e'
>>> street.casefold()
'gürzenichstrasse'
```
A second tool is the [`unicodedata`](../library/unicodedata.xhtml#module-unicodedata "unicodedata: Access the Unicode Database.") module's [`normalize()`](../library/unicodedata.xhtml#unicodedata.normalize "unicodedata.normalize") function that converts strings to one of several normal forms, where letters followed by a combining character are replaced with single characters. `normalize()` can be used to perform string comparisons that won't falsely report inequality if two strings use combining characters differently:
```
import unicodedata
def compare_strs(s1, s2):
def NFD(s):
return unicodedata.normalize('NFD', s)
return NFD(s1) == NFD(s2)
single_char = 'ê'
multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
print('length of first string=', len(single_char))
print('length of second string=', len(multiple_chars))
print(compare_strs(single_char, multiple_chars))
```
When run, this outputs:
```
$ python3 compare-strs.py
length of first string= 1
length of second string= 2
True
```
The first argument to the [`normalize()`](../library/unicodedata.xhtml#unicodedata.normalize "unicodedata.normalize") function is a string giving the desired normalization form, which can be one of 'NFC', 'NFKC', 'NFD', and 'NFKD'.
The Unicode Standard also specifies how to do caseless comparisons:
```
import unicodedata
def compare_caseless(s1, s2):
def NFD(s):
return unicodedata.normalize('NFD', s)
return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())
# Example usage
single_char = 'ê'
multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
print(compare_caseless(single_char, multiple_chars))
```
This will print `True`. (Why is `NFD()` invoked twice? Because there are a few characters that make `casefold()` return a non-normalized string, so the result needs to be normalized again. See section 3.13 of the Unicode Standard for a discussion and an example.)
### Unicode Regular Expressions
The regular expressions supported by the [`re`](../library/re.xhtml#module-re "re: Regular expression operations.") module can be provided either as bytes or strings. Some of the special character sequences such as `\d` and `\w` have different meanings depending on whether the pattern is supplied as bytes or a string. For example, `\d` will match the characters `[0-9]` in bytes but in strings will match any character that's in the `'Nd'` category.
The string in this example has the number 57 written in both Thai and Arabic numerals:
```
import re
p = re.compile(r'\d+')
s = "Over \u0e55\u0e57 57 flavours"
m = p.search(s)
print(repr(m.group()))
```
When executed, `\d+` will match the Thai numerals and print them out. If you supply the [`re.ASCII`](../library/re.xhtml#re.ASCII "re.ASCII") flag to [`compile()`](../library/re.xhtml#re.compile "re.compile"), `\d+` will match the substring "57" instead.
Similarly, `\w` matches a wide variety of Unicode characters but only `[a-zA-Z0-9_]` in bytes or if [`re.ASCII`](../library/re.xhtml#re.ASCII "re.ASCII") is supplied, and `\s` will match either Unicode whitespace characters or `[ \t\n\r\f\v]`.
### 引用文獻
Some good alternative discussions of Python's Unicode support are:
- [Processing Text Files in Python 3](http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html) \[http://python-notes.curiousefficiency.org/en/latest/python3/text\_file\_processing.html\], by Nick Coghlan.
- [Pragmatic Unicode](https://nedbatchelder.com/text/unipain.html) \[https://nedbatchelder.com/text/unipain.html\], a PyCon 2012 presentation by Ned Batchelder.
The [`str`](../library/stdtypes.xhtml#str "str") type is described in the Python library reference at [文本序列類型 --- str](../library/stdtypes.xhtml#textseq).
The documentation for the [`unicodedata`](../library/unicodedata.xhtml#module-unicodedata "unicodedata: Access the Unicode Database.") module.
The documentation for the [`codecs`](../library/codecs.xhtml#module-codecs "codecs: Encode and decode data and streams.") module.
Marc-André Lemburg gave [a presentation titled "Python and Unicode" (PDF slides)](https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf) \[https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf\] at EuroPython 2002. The slides are an excellent overview of the design of Python 2's Unicode features (where the Unicode string type is called `unicode` and literals start with `u`).
## Reading and Writing Unicode Data
Once you've written some code that works with Unicode data, the next problem is input/output. How do you get Unicode strings into your program, and how do you convert Unicode into a form suitable for storage or transmission?
It's possible that you may not need to do anything depending on your input sources and output destinations; you should check whether the libraries used in your application support Unicode natively. XML parsers often return Unicode data, for example. Many relational databases also support Unicode-valued columns and can return Unicode values from an SQL query.
Unicode data is usually converted to a particular encoding before it gets written to disk or sent over a socket. It's possible to do all the work yourself: open a file, read an 8-bit bytes object from it, and convert the bytes with `bytes.decode(encoding)`. However, the manual approach is not recommended.
One problem is the multi-byte nature of encodings; one Unicode character can be represented by several bytes. If you want to read the file in arbitrary-sized chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case where only part of the bytes encoding a single Unicode character are read at the end of a chunk. One solution would be to read the entire file into memory and then perform the decoding, but that prevents you from working with files that are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM. (More, really, since for at least a moment you'd need to have both the encoded string and its Unicode version in memory.)
The solution would be to use the low-level decoding interface to catch the case of partial coding sequences. The work of implementing this has already been done for you: the built-in [`open()`](../library/functions.xhtml#open "open") function can return a file-like object that assumes the file's contents are in a specified encoding and accepts Unicode parameters for methods such as [`read()`](../library/io.xhtml#io.TextIOBase.read "io.TextIOBase.read") and [`write()`](../library/io.xhtml#io.TextIOBase.write "io.TextIOBase.write"). This works through [`open()`](../library/functions.xhtml#open "open")'s *encoding* and *errors* parameters which are interpreted just like those in [`str.encode()`](../library/stdtypes.xhtml#str.encode "str.encode")and [`bytes.decode()`](../library/stdtypes.xhtml#bytes.decode "bytes.decode").
Reading Unicode from a file is therefore simple:
```
with open('unicode.txt', encoding='utf-8') as f:
for line in f:
print(repr(line))
```
It's also possible to open files in update mode, allowing both reading and writing:
```
with open('test', encoding='utf-8', mode='w+') as f:
f.write('\u4500 blah blah blah\n')
f.seek(0)
print(repr(f.readline()[:1]))
```
The Unicode character `U+FEFF` is used as a byte-order mark (BOM), and is often written as the first character of a file in order to assist with autodetection of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be present at the start of a file; when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as 'utf-16-le' and 'utf-16-be' for little-endian and big-endian encodings, that specify one particular byte ordering and don't skip the BOM.
In some areas, it is also convention to use a "BOM" at the start of UTF-8 encoded files; the name is misleading since UTF-8 is not byte-order dependent. The mark simply announces that the file is encoded in UTF-8. For reading such files, use the 'utf-8-sig' codec to automatically skip the mark if present.
### Unicode filenames
Most of the operating systems in common use today support filenames that contain arbitrary Unicode characters. Usually this is implemented by converting the Unicode string into some encoding that varies depending on the system. Today Python is converging on using UTF-8: Python on MacOS has used UTF-8 for several versions, and Python 3.6 switched to using UTF-8 on Windows as well. On Unix systems, there will only be a filesystem encoding if you've set the `LANG` or `LC_CTYPE` environment variables; if you haven't, the default encoding is again UTF-8.
The [`sys.getfilesystemencoding()`](../library/sys.xhtml#sys.getfilesystemencoding "sys.getfilesystemencoding") function returns the encoding to use on your current system, in case you want to do the encoding manually, but there's not much reason to bother. When opening a file for reading or writing, you can usually just provide the Unicode string as the filename, and it will be automatically converted to the right encoding for you:
```
filename = 'filename\u4500abc'
with open(filename, 'w') as f:
f.write('blah\n')
```
Functions in the [`os`](../library/os.xhtml#module-os "os: Miscellaneous operating system interfaces.") module such as [`os.stat()`](../library/os.xhtml#os.stat "os.stat") will also accept Unicode filenames.
The [`os.listdir()`](../library/os.xhtml#os.listdir "os.listdir") function returns filenames, which raises an issue: should it return the Unicode version of filenames, or should it return bytes containing the encoded versions? [`os.listdir()`](../library/os.xhtml#os.listdir "os.listdir") can do both, depending on whether you provided the directory path as bytes or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem's encoding and a list of Unicode strings will be returned, while passing a byte path will return the filenames as bytes. For example, assuming the default filesystem encoding is UTF-8, running the following program:
```
fn = 'filename\u4500abc'
f = open(fn, 'w')
f.close()
import os
print(os.listdir(b'.'))
print(os.listdir('.'))
```
will produce the following output:
```
$ python listdir-test.py
[b'filename\xe4\x94\x80abc', ...]
['filename\u4500abc', ...]
```
The first list contains UTF-8-encoded filenames, and the second list contains the Unicode versions.
Note that on most occasions, you should can just stick with using Unicode with these APIs. The bytes APIs should only be used on systems where undecodable file names can be present; that's pretty much only Unix systems now.
### Tips for Writing Unicode-aware Programs
This section provides some suggestions on writing software that deals with Unicode.
The most important tip is:
> Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end.
If you attempt to write processing functions that accept both Unicode and byte strings, you will find your program vulnerable to bugs wherever you combine the two different kinds of strings. There is no automatic encoding or decoding: if you do e.g. `str + bytes`, a [`TypeError`](../library/exceptions.xhtml#TypeError "TypeError") will be raised.
When using data coming from a web browser or some other untrusted source, a common technique is to check for illegal characters in a string before using the string in a generated command line or storing it in a database. If you're doing this, be careful to check the decoded string, not the encoded bytes data; some encodings may have interesting properties, such as not being bijective or not being fully ASCII-compatible. This is especially true if the input data also specifies the encoding, since the attacker can then choose a clever way to hide malicious text in the encoded bytestream.
#### Converting Between File Encodings
The [`StreamRecoder`](../library/codecs.xhtml#codecs.StreamRecoder "codecs.StreamRecoder") class can transparently convert between encodings, taking a stream that returns data in encoding #1 and behaving like a stream returning data in encoding #2.
For example, if you have an input file *f* that's in Latin-1, you can wrap it with a [`StreamRecoder`](../library/codecs.xhtml#codecs.StreamRecoder "codecs.StreamRecoder") to return bytes encoded in UTF-8:
```
new_f = codecs.StreamRecoder(f,
# en/decoder: used by read() to encode its results and
# by write() to decode its input.
codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
# reader/writer: used to read and write to the stream.
codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
```
#### Files in an Unknown Encoding
What can you do if you need to make a change to a file, but don't know the file's encoding? If you know the encoding is ASCII-compatible and only want to examine or modify the ASCII parts, you can open the file with the `surrogateescape` error handler:
```
with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
data = f.read()
# make changes to the string 'data'
with open(fname + '.new', 'w',
encoding="ascii", errors="surrogateescape") as f:
f.write(data)
```
The `surrogateescape` error handler will decode any non-ASCII bytes as code points in a special range running from U+DC80 to U+DCFF. These code points will then turn back into the same bytes when the `surrogateescape` error handler is used to encode the data and write it back out.
### 引用文獻
One section of [Mastering Python 3 Input/Output](http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o) \[http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o\], a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
The [PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware Applications in Python"](https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf) \[https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf\]discuss questions of character encodings as well as how to internationalize and localize an application. These slides cover Python 2.x only.
[The Guts of Unicode in Python](http://pyvideo.org/video/1768/the-guts-of-unicode-in-python) \[http://pyvideo.org/video/1768/the-guts-of-unicode-in-python\]is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode representation in Python 3.3.
## Acknowledgements
The initial draft of this document was written by Andrew Kuchling. It has since been revised further by Alexander Belopolsky, Georg Brandl, Andrew Kuchling, and Ezio Melotti.
Thanks to the following people who have noted errors or offered suggestions on this article: éric Araujo, Nicholas Bastin, Nick Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André Lemburg, Martin von L?wis, Terry J. Reedy, Serhiy Storchaka, Eryk Sun, Chad Whitacre, Graham Wideman.
### 導航
- [索引](../genindex.xhtml "總目錄")
- [模塊](../py-modindex.xhtml "Python 模塊索引") |
- [下一頁](urllib2.xhtml "如何使用urllib包獲取網絡資源") |
- [上一頁](sorting.xhtml "排序指南") |
- 
- [Python](https://www.python.org/) ?
- zh\_CN 3.7.3 [文檔](../index.xhtml) ?
- [Python 常用指引](index.xhtml) ?
- $('.inline-search').show(0); |
? [版權所有](../copyright.xhtml) 2001-2019, Python Software Foundation.
Python 軟件基金會是一個非盈利組織。 [請捐助。](https://www.python.org/psf/donations/)
最后更新于 5月 21, 2019. [發現了問題](../bugs.xhtml)?
使用[Sphinx](http://sphinx.pocoo.org/)1.8.4 創建。
- Python文檔內容
- Python 有什么新變化?
- Python 3.7 有什么新變化
- 摘要 - 發布重點
- 新的特性
- 其他語言特性修改
- 新增模塊
- 改進的模塊
- C API 的改變
- 構建的改變
- 性能優化
- 其他 CPython 實現的改變
- 已棄用的 Python 行為
- 已棄用的 Python 模塊、函數和方法
- 已棄用的 C API 函數和類型
- 平臺支持的移除
- API 與特性的移除
- 移除的模塊
- Windows 專屬的改變
- 移植到 Python 3.7
- Python 3.7.1 中的重要變化
- Python 3.7.2 中的重要變化
- Python 3.6 有什么新變化A
- 摘要 - 發布重點
- 新的特性
- 其他語言特性修改
- 新增模塊
- 改進的模塊
- 性能優化
- Build and C API Changes
- 其他改進
- 棄用
- 移除
- 移植到Python 3.6
- Python 3.6.2 中的重要變化
- Python 3.6.4 中的重要變化
- Python 3.6.5 中的重要變化
- Python 3.6.7 中的重要變化
- Python 3.5 有什么新變化
- 摘要 - 發布重點
- 新的特性
- 其他語言特性修改
- 新增模塊
- 改進的模塊
- Other module-level changes
- 性能優化
- Build and C API Changes
- 棄用
- 移除
- Porting to Python 3.5
- Notable changes in Python 3.5.4
- What's New In Python 3.4
- 摘要 - 發布重點
- 新的特性
- 新增模塊
- 改進的模塊
- CPython Implementation Changes
- 棄用
- 移除
- Porting to Python 3.4
- Changed in 3.4.3
- What's New In Python 3.3
- 摘要 - 發布重點
- PEP 405: Virtual Environments
- PEP 420: Implicit Namespace Packages
- PEP 3118: New memoryview implementation and buffer protocol documentation
- PEP 393: Flexible String Representation
- PEP 397: Python Launcher for Windows
- PEP 3151: Reworking the OS and IO exception hierarchy
- PEP 380: Syntax for Delegating to a Subgenerator
- PEP 409: Suppressing exception context
- PEP 414: Explicit Unicode literals
- PEP 3155: Qualified name for classes and functions
- PEP 412: Key-Sharing Dictionary
- PEP 362: Function Signature Object
- PEP 421: Adding sys.implementation
- Using importlib as the Implementation of Import
- 其他語言特性修改
- A Finer-Grained Import Lock
- Builtin functions and types
- 新增模塊
- 改進的模塊
- 性能優化
- Build and C API Changes
- 棄用
- Porting to Python 3.3
- What's New In Python 3.2
- PEP 384: Defining a Stable ABI
- PEP 389: Argparse Command Line Parsing Module
- PEP 391: Dictionary Based Configuration for Logging
- PEP 3148: The concurrent.futures module
- PEP 3147: PYC Repository Directories
- PEP 3149: ABI Version Tagged .so Files
- PEP 3333: Python Web Server Gateway Interface v1.0.1
- 其他語言特性修改
- New, Improved, and Deprecated Modules
- 多線程
- 性能優化
- Unicode
- Codecs
- 文檔
- IDLE
- Code Repository
- Build and C API Changes
- Porting to Python 3.2
- What's New In Python 3.1
- PEP 372: Ordered Dictionaries
- PEP 378: Format Specifier for Thousands Separator
- 其他語言特性修改
- New, Improved, and Deprecated Modules
- 性能優化
- IDLE
- Build and C API Changes
- Porting to Python 3.1
- What's New In Python 3.0
- Common Stumbling Blocks
- Overview Of Syntax Changes
- Changes Already Present In Python 2.6
- Library Changes
- PEP 3101: A New Approach To String Formatting
- Changes To Exceptions
- Miscellaneous Other Changes
- Build and C API Changes
- 性能
- Porting To Python 3.0
- What's New in Python 2.7
- The Future for Python 2.x
- Changes to the Handling of Deprecation Warnings
- Python 3.1 Features
- PEP 372: Adding an Ordered Dictionary to collections
- PEP 378: Format Specifier for Thousands Separator
- PEP 389: The argparse Module for Parsing Command Lines
- PEP 391: Dictionary-Based Configuration For Logging
- PEP 3106: Dictionary Views
- PEP 3137: The memoryview Object
- 其他語言特性修改
- New and Improved Modules
- Build and C API Changes
- Other Changes and Fixes
- Porting to Python 2.7
- New Features Added to Python 2.7 Maintenance Releases
- Acknowledgements
- Python 2.6 有什么新變化
- Python 3.0
- Changes to the Development Process
- PEP 343: The 'with' statement
- PEP 366: Explicit Relative Imports From a Main Module
- PEP 370: Per-user site-packages Directory
- PEP 371: The multiprocessing Package
- PEP 3101: Advanced String Formatting
- PEP 3105: print As a Function
- PEP 3110: Exception-Handling Changes
- PEP 3112: Byte Literals
- PEP 3116: New I/O Library
- PEP 3118: Revised Buffer Protocol
- PEP 3119: Abstract Base Classes
- PEP 3127: Integer Literal Support and Syntax
- PEP 3129: Class Decorators
- PEP 3141: A Type Hierarchy for Numbers
- 其他語言特性修改
- New and Improved Modules
- Deprecations and Removals
- Build and C API Changes
- Porting to Python 2.6
- Acknowledgements
- What's New in Python 2.5
- PEP 308: Conditional Expressions
- PEP 309: Partial Function Application
- PEP 314: Metadata for Python Software Packages v1.1
- PEP 328: Absolute and Relative Imports
- PEP 338: Executing Modules as Scripts
- PEP 341: Unified try/except/finally
- PEP 342: New Generator Features
- PEP 343: The 'with' statement
- PEP 352: Exceptions as New-Style Classes
- PEP 353: Using ssize_t as the index type
- PEP 357: The 'index' method
- 其他語言特性修改
- New, Improved, and Removed Modules
- Build and C API Changes
- Porting to Python 2.5
- Acknowledgements
- What's New in Python 2.4
- PEP 218: Built-In Set Objects
- PEP 237: Unifying Long Integers and Integers
- PEP 289: Generator Expressions
- PEP 292: Simpler String Substitutions
- PEP 318: Decorators for Functions and Methods
- PEP 322: Reverse Iteration
- PEP 324: New subprocess Module
- PEP 327: Decimal Data Type
- PEP 328: Multi-line Imports
- PEP 331: Locale-Independent Float/String Conversions
- 其他語言特性修改
- New, Improved, and Deprecated Modules
- Build and C API Changes
- Porting to Python 2.4
- Acknowledgements
- What's New in Python 2.3
- PEP 218: A Standard Set Datatype
- PEP 255: Simple Generators
- PEP 263: Source Code Encodings
- PEP 273: Importing Modules from ZIP Archives
- PEP 277: Unicode file name support for Windows NT
- PEP 278: Universal Newline Support
- PEP 279: enumerate()
- PEP 282: The logging Package
- PEP 285: A Boolean Type
- PEP 293: Codec Error Handling Callbacks
- PEP 301: Package Index and Metadata for Distutils
- PEP 302: New Import Hooks
- PEP 305: Comma-separated Files
- PEP 307: Pickle Enhancements
- Extended Slices
- 其他語言特性修改
- New, Improved, and Deprecated Modules
- Pymalloc: A Specialized Object Allocator
- Build and C API Changes
- Other Changes and Fixes
- Porting to Python 2.3
- Acknowledgements
- What's New in Python 2.2
- 概述
- PEPs 252 and 253: Type and Class Changes
- PEP 234: Iterators
- PEP 255: Simple Generators
- PEP 237: Unifying Long Integers and Integers
- PEP 238: Changing the Division Operator
- Unicode Changes
- PEP 227: Nested Scopes
- New and Improved Modules
- Interpreter Changes and Fixes
- Other Changes and Fixes
- Acknowledgements
- What's New in Python 2.1
- 概述
- PEP 227: Nested Scopes
- PEP 236: future Directives
- PEP 207: Rich Comparisons
- PEP 230: Warning Framework
- PEP 229: New Build System
- PEP 205: Weak References
- PEP 232: Function Attributes
- PEP 235: Importing Modules on Case-Insensitive Platforms
- PEP 217: Interactive Display Hook
- PEP 208: New Coercion Model
- PEP 241: Metadata in Python Packages
- New and Improved Modules
- Other Changes and Fixes
- Acknowledgements
- What's New in Python 2.0
- 概述
- What About Python 1.6?
- New Development Process
- Unicode
- 列表推導式
- Augmented Assignment
- 字符串的方法
- Garbage Collection of Cycles
- Other Core Changes
- Porting to 2.0
- Extending/Embedding Changes
- Distutils: Making Modules Easy to Install
- XML Modules
- Module changes
- New modules
- IDLE Improvements
- Deleted and Deprecated Modules
- Acknowledgements
- 更新日志
- Python 下一版
- Python 3.7.3 最終版
- Python 3.7.3 發布候選版 1
- Python 3.7.2 最終版
- Python 3.7.2 發布候選版 1
- Python 3.7.1 最終版
- Python 3.7.1 RC 2版本
- Python 3.7.1 發布候選版 1
- Python 3.7.0 正式版
- Python 3.7.0 release candidate 1
- Python 3.7.0 beta 5
- Python 3.7.0 beta 4
- Python 3.7.0 beta 3
- Python 3.7.0 beta 2
- Python 3.7.0 beta 1
- Python 3.7.0 alpha 4
- Python 3.7.0 alpha 3
- Python 3.7.0 alpha 2
- Python 3.7.0 alpha 1
- Python 3.6.6 final
- Python 3.6.6 RC 1
- Python 3.6.5 final
- Python 3.6.5 release candidate 1
- Python 3.6.4 final
- Python 3.6.4 release candidate 1
- Python 3.6.3 final
- Python 3.6.3 release candidate 1
- Python 3.6.2 final
- Python 3.6.2 release candidate 2
- Python 3.6.2 release candidate 1
- Python 3.6.1 final
- Python 3.6.1 release candidate 1
- Python 3.6.0 final
- Python 3.6.0 release candidate 2
- Python 3.6.0 release candidate 1
- Python 3.6.0 beta 4
- Python 3.6.0 beta 3
- Python 3.6.0 beta 2
- Python 3.6.0 beta 1
- Python 3.6.0 alpha 4
- Python 3.6.0 alpha 3
- Python 3.6.0 alpha 2
- Python 3.6.0 alpha 1
- Python 3.5.5 final
- Python 3.5.5 release candidate 1
- Python 3.5.4 final
- Python 3.5.4 release candidate 1
- Python 3.5.3 final
- Python 3.5.3 release candidate 1
- Python 3.5.2 final
- Python 3.5.2 release candidate 1
- Python 3.5.1 final
- Python 3.5.1 release candidate 1
- Python 3.5.0 final
- Python 3.5.0 release candidate 4
- Python 3.5.0 release candidate 3
- Python 3.5.0 release candidate 2
- Python 3.5.0 release candidate 1
- Python 3.5.0 beta 4
- Python 3.5.0 beta 3
- Python 3.5.0 beta 2
- Python 3.5.0 beta 1
- Python 3.5.0 alpha 4
- Python 3.5.0 alpha 3
- Python 3.5.0 alpha 2
- Python 3.5.0 alpha 1
- Python 教程
- 課前甜點
- 使用 Python 解釋器
- 調用解釋器
- 解釋器的運行環境
- Python 的非正式介紹
- Python 作為計算器使用
- 走向編程的第一步
- 其他流程控制工具
- if 語句
- for 語句
- range() 函數
- break 和 continue 語句,以及循環中的 else 子句
- pass 語句
- 定義函數
- 函數定義的更多形式
- 小插曲:編碼風格
- 數據結構
- 列表的更多特性
- del 語句
- 元組和序列
- 集合
- 字典
- 循環的技巧
- 深入條件控制
- 序列和其它類型的比較
- 模塊
- 有關模塊的更多信息
- 標準模塊
- dir() 函數
- 包
- 輸入輸出
- 更漂亮的輸出格式
- 讀寫文件
- 錯誤和異常
- 語法錯誤
- 異常
- 處理異常
- 拋出異常
- 用戶自定義異常
- 定義清理操作
- 預定義的清理操作
- 類
- 名稱和對象
- Python 作用域和命名空間
- 初探類
- 補充說明
- 繼承
- 私有變量
- 雜項說明
- 迭代器
- 生成器
- 生成器表達式
- 標準庫簡介
- 操作系統接口
- 文件通配符
- 命令行參數
- 錯誤輸出重定向和程序終止
- 字符串模式匹配
- 數學
- 互聯網訪問
- 日期和時間
- 數據壓縮
- 性能測量
- 質量控制
- 自帶電池
- 標準庫簡介 —— 第二部分
- 格式化輸出
- 模板
- 使用二進制數據記錄格式
- 多線程
- 日志
- 弱引用
- 用于操作列表的工具
- 十進制浮點運算
- 虛擬環境和包
- 概述
- 創建虛擬環境
- 使用pip管理包
- 接下來?
- 交互式編輯和編輯歷史
- Tab 補全和編輯歷史
- 默認交互式解釋器的替代品
- 浮點算術:爭議和限制
- 表示性錯誤
- 附錄
- 交互模式
- 安裝和使用 Python
- 命令行與環境
- 命令行
- 環境變量
- 在Unix平臺中使用Python
- 獲取最新版本的Python
- 構建Python
- 與Python相關的路徑和文件
- 雜項
- 編輯器和集成開發環境
- 在Windows上使用 Python
- 完整安裝程序
- Microsoft Store包
- nuget.org 安裝包
- 可嵌入的包
- 替代捆綁包
- 配置Python
- 適用于Windows的Python啟動器
- 查找模塊
- 附加模塊
- 在Windows上編譯Python
- 其他平臺
- 在蘋果系統上使用 Python
- 獲取和安裝 MacPython
- IDE
- 安裝額外的 Python 包
- Mac 上的圖形界面編程
- 在 Mac 上分發 Python 應用程序
- 其他資源
- Python 語言參考
- 概述
- 其他實現
- 標注
- 詞法分析
- 行結構
- 其他形符
- 標識符和關鍵字
- 字面值
- 運算符
- 分隔符
- 數據模型
- 對象、值與類型
- 標準類型層級結構
- 特殊方法名稱
- 協程
- 執行模型
- 程序的結構
- 命名與綁定
- 異常
- 導入系統
- importlib
- 包
- 搜索
- 加載
- 基于路徑的查找器
- 替換標準導入系統
- Package Relative Imports
- 有關 main 的特殊事項
- 開放問題項
- 參考文獻
- 表達式
- 算術轉換
- 原子
- 原型
- await 表達式
- 冪運算符
- 一元算術和位運算
- 二元算術運算符
- 移位運算
- 二元位運算
- 比較運算
- 布爾運算
- 條件表達式
- lambda 表達式
- 表達式列表
- 求值順序
- 運算符優先級
- 簡單語句
- 表達式語句
- 賦值語句
- assert 語句
- pass 語句
- del 語句
- return 語句
- yield 語句
- raise 語句
- break 語句
- continue 語句
- import 語句
- global 語句
- nonlocal 語句
- 復合語句
- if 語句
- while 語句
- for 語句
- try 語句
- with 語句
- 函數定義
- 類定義
- 協程
- 最高層級組件
- 完整的 Python 程序
- 文件輸入
- 交互式輸入
- 表達式輸入
- 完整的語法規范
- Python 標準庫
- 概述
- 可用性注釋
- 內置函數
- 內置常量
- 由 site 模塊添加的常量
- 內置類型
- 邏輯值檢測
- 布爾運算 — and, or, not
- 比較
- 數字類型 — int, float, complex
- 迭代器類型
- 序列類型 — list, tuple, range
- 文本序列類型 — str
- 二進制序列類型 — bytes, bytearray, memoryview
- 集合類型 — set, frozenset
- 映射類型 — dict
- 上下文管理器類型
- 其他內置類型
- 特殊屬性
- 內置異常
- 基類
- 具體異常
- 警告
- 異常層次結構
- 文本處理服務
- string — 常見的字符串操作
- re — 正則表達式操作
- 模塊 difflib 是一個計算差異的助手
- textwrap — Text wrapping and filling
- unicodedata — Unicode 數據庫
- stringprep — Internet String Preparation
- readline — GNU readline interface
- rlcompleter — GNU readline的完成函數
- 二進制數據服務
- struct — Interpret bytes as packed binary data
- codecs — Codec registry and base classes
- 數據類型
- datetime — 基礎日期/時間數據類型
- calendar — General calendar-related functions
- collections — 容器數據類型
- collections.abc — 容器的抽象基類
- heapq — 堆隊列算法
- bisect — Array bisection algorithm
- array — Efficient arrays of numeric values
- weakref — 弱引用
- types — Dynamic type creation and names for built-in types
- copy — 淺層 (shallow) 和深層 (deep) 復制操作
- pprint — 數據美化輸出
- reprlib — Alternate repr() implementation
- enum — Support for enumerations
- 數字和數學模塊
- numbers — 數字的抽象基類
- math — 數學函數
- cmath — Mathematical functions for complex numbers
- decimal — 十進制定點和浮點運算
- fractions — 分數
- random — 生成偽隨機數
- statistics — Mathematical statistics functions
- 函數式編程模塊
- itertools — 為高效循環而創建迭代器的函數
- functools — 高階函數和可調用對象上的操作
- operator — 標準運算符替代函數
- 文件和目錄訪問
- pathlib — 面向對象的文件系統路徑
- os.path — 常見路徑操作
- fileinput — Iterate over lines from multiple input streams
- stat — Interpreting stat() results
- filecmp — File and Directory Comparisons
- tempfile — Generate temporary files and directories
- glob — Unix style pathname pattern expansion
- fnmatch — Unix filename pattern matching
- linecache — Random access to text lines
- shutil — High-level file operations
- macpath — Mac OS 9 路徑操作函數
- 數據持久化
- pickle —— Python 對象序列化
- copyreg — Register pickle support functions
- shelve — Python object persistence
- marshal — Internal Python object serialization
- dbm — Interfaces to Unix “databases”
- sqlite3 — SQLite 數據庫 DB-API 2.0 接口模塊
- 數據壓縮和存檔
- zlib — 與 gzip 兼容的壓縮
- gzip — 對 gzip 格式的支持
- bz2 — 對 bzip2 壓縮算法的支持
- lzma — 用 LZMA 算法壓縮
- zipfile — 在 ZIP 歸檔中工作
- tarfile — Read and write tar archive files
- 文件格式
- csv — CSV 文件讀寫
- configparser — Configuration file parser
- netrc — netrc file processing
- xdrlib — Encode and decode XDR data
- plistlib — Generate and parse Mac OS X .plist files
- 加密服務
- hashlib — 安全哈希與消息摘要
- hmac — 基于密鑰的消息驗證
- secrets — Generate secure random numbers for managing secrets
- 通用操作系統服務
- os — 操作系統接口模塊
- io — 處理流的核心工具
- time — 時間的訪問和轉換
- argparse — 命令行選項、參數和子命令解析器
- getopt — C-style parser for command line options
- 模塊 logging — Python 的日志記錄工具
- logging.config — 日志記錄配置
- logging.handlers — Logging handlers
- getpass — 便攜式密碼輸入工具
- curses — 終端字符單元顯示的處理
- curses.textpad — Text input widget for curses programs
- curses.ascii — Utilities for ASCII characters
- curses.panel — A panel stack extension for curses
- platform — Access to underlying platform's identifying data
- errno — Standard errno system symbols
- ctypes — Python 的外部函數庫
- 并發執行
- threading — 基于線程的并行
- multiprocessing — 基于進程的并行
- concurrent 包
- concurrent.futures — 啟動并行任務
- subprocess — 子進程管理
- sched — 事件調度器
- queue — 一個同步的隊列類
- _thread — 底層多線程 API
- _dummy_thread — _thread 的替代模塊
- dummy_threading — 可直接替代 threading 模塊。
- contextvars — Context Variables
- Context Variables
- Manual Context Management
- asyncio support
- 網絡和進程間通信
- asyncio — 異步 I/O
- socket — 底層網絡接口
- ssl — TLS/SSL wrapper for socket objects
- select — Waiting for I/O completion
- selectors — 高級 I/O 復用庫
- asyncore — 異步socket處理器
- asynchat — 異步 socket 指令/響應 處理器
- signal — Set handlers for asynchronous events
- mmap — Memory-mapped file support
- 互聯網數據處理
- email — 電子郵件與 MIME 處理包
- json — JSON 編碼和解碼器
- mailcap — Mailcap file handling
- mailbox — Manipulate mailboxes in various formats
- mimetypes — Map filenames to MIME types
- base64 — Base16, Base32, Base64, Base85 數據編碼
- binhex — 對binhex4文件進行編碼和解碼
- binascii — 二進制和 ASCII 碼互轉
- quopri — Encode and decode MIME quoted-printable data
- uu — Encode and decode uuencode files
- 結構化標記處理工具
- html — 超文本標記語言支持
- html.parser — 簡單的 HTML 和 XHTML 解析器
- html.entities — HTML 一般實體的定義
- XML處理模塊
- xml.etree.ElementTree — The ElementTree XML API
- xml.dom — The Document Object Model API
- xml.dom.minidom — Minimal DOM implementation
- xml.dom.pulldom — Support for building partial DOM trees
- xml.sax — Support for SAX2 parsers
- xml.sax.handler — Base classes for SAX handlers
- xml.sax.saxutils — SAX Utilities
- xml.sax.xmlreader — Interface for XML parsers
- xml.parsers.expat — Fast XML parsing using Expat
- 互聯網協議和支持
- webbrowser — 方便的Web瀏覽器控制器
- cgi — Common Gateway Interface support
- cgitb — Traceback manager for CGI scripts
- wsgiref — WSGI Utilities and Reference Implementation
- urllib — URL 處理模塊
- urllib.request — 用于打開 URL 的可擴展庫
- urllib.response — Response classes used by urllib
- urllib.parse — Parse URLs into components
- urllib.error — Exception classes raised by urllib.request
- urllib.robotparser — Parser for robots.txt
- http — HTTP 模塊
- http.client — HTTP協議客戶端
- ftplib — FTP protocol client
- poplib — POP3 protocol client
- imaplib — IMAP4 protocol client
- nntplib — NNTP protocol client
- smtplib —SMTP協議客戶端
- smtpd — SMTP Server
- telnetlib — Telnet client
- uuid — UUID objects according to RFC 4122
- socketserver — A framework for network servers
- http.server — HTTP 服務器
- http.cookies — HTTP state management
- http.cookiejar — Cookie handling for HTTP clients
- xmlrpc — XMLRPC 服務端與客戶端模塊
- xmlrpc.client — XML-RPC client access
- xmlrpc.server — Basic XML-RPC servers
- ipaddress — IPv4/IPv6 manipulation library
- 多媒體服務
- audioop — Manipulate raw audio data
- aifc — Read and write AIFF and AIFC files
- sunau — 讀寫 Sun AU 文件
- wave — 讀寫WAV格式文件
- chunk — Read IFF chunked data
- colorsys — Conversions between color systems
- imghdr — 推測圖像類型
- sndhdr — 推測聲音文件的類型
- ossaudiodev — Access to OSS-compatible audio devices
- 國際化
- gettext — 多語種國際化服務
- locale — 國際化服務
- 程序框架
- turtle — 海龜繪圖
- cmd — 支持面向行的命令解釋器
- shlex — Simple lexical analysis
- Tk圖形用戶界面(GUI)
- tkinter — Tcl/Tk的Python接口
- tkinter.ttk — Tk themed widgets
- tkinter.tix — Extension widgets for Tk
- tkinter.scrolledtext — 滾動文字控件
- IDLE
- 其他圖形用戶界面(GUI)包
- 開發工具
- typing — 類型標注支持
- pydoc — Documentation generator and online help system
- doctest — Test interactive Python examples
- unittest — 單元測試框架
- unittest.mock — mock object library
- unittest.mock 上手指南
- 2to3 - 自動將 Python 2 代碼轉為 Python 3 代碼
- test — Regression tests package for Python
- test.support — Utilities for the Python test suite
- test.support.script_helper — Utilities for the Python execution tests
- 調試和分析
- bdb — Debugger framework
- faulthandler — Dump the Python traceback
- pdb — The Python Debugger
- The Python Profilers
- timeit — 測量小代碼片段的執行時間
- trace — Trace or track Python statement execution
- tracemalloc — Trace memory allocations
- 軟件打包和分發
- distutils — 構建和安裝 Python 模塊
- ensurepip — Bootstrapping the pip installer
- venv — 創建虛擬環境
- zipapp — Manage executable Python zip archives
- Python運行時服務
- sys — 系統相關的參數和函數
- sysconfig — Provide access to Python's configuration information
- builtins — 內建對象
- main — 頂層腳本環境
- warnings — Warning control
- dataclasses — 數據類
- contextlib — Utilities for with-statement contexts
- abc — 抽象基類
- atexit — 退出處理器
- traceback — Print or retrieve a stack traceback
- future — Future 語句定義
- gc — 垃圾回收器接口
- inspect — 檢查對象
- site — Site-specific configuration hook
- 自定義 Python 解釋器
- code — Interpreter base classes
- codeop — Compile Python code
- 導入模塊
- zipimport — Import modules from Zip archives
- pkgutil — Package extension utility
- modulefinder — 查找腳本使用的模塊
- runpy — Locating and executing Python modules
- importlib — The implementation of import
- Python 語言服務
- parser — Access Python parse trees
- ast — 抽象語法樹
- symtable — Access to the compiler's symbol tables
- symbol — 與 Python 解析樹一起使用的常量
- token — 與Python解析樹一起使用的常量
- keyword — 檢驗Python關鍵字
- tokenize — Tokenizer for Python source
- tabnanny — 模糊縮進檢測
- pyclbr — Python class browser support
- py_compile — Compile Python source files
- compileall — Byte-compile Python libraries
- dis — Python 字節碼反匯編器
- pickletools — Tools for pickle developers
- 雜項服務
- formatter — Generic output formatting
- Windows系統相關模塊
- msilib — Read and write Microsoft Installer files
- msvcrt — Useful routines from the MS VC++ runtime
- winreg — Windows 注冊表訪問
- winsound — Sound-playing interface for Windows
- Unix 專有服務
- posix — The most common POSIX system calls
- pwd — 用戶密碼數據庫
- spwd — The shadow password database
- grp — The group database
- crypt — Function to check Unix passwords
- termios — POSIX style tty control
- tty — 終端控制功能
- pty — Pseudo-terminal utilities
- fcntl — The fcntl and ioctl system calls
- pipes — Interface to shell pipelines
- resource — Resource usage information
- nis — Interface to Sun's NIS (Yellow Pages)
- Unix syslog 庫例程
- 被取代的模塊
- optparse — Parser for command line options
- imp — Access the import internals
- 未創建文檔的模塊
- 平臺特定模塊
- 擴展和嵌入 Python 解釋器
- 推薦的第三方工具
- 不使用第三方工具創建擴展
- 使用 C 或 C++ 擴展 Python
- 自定義擴展類型:教程
- 定義擴展類型:已分類主題
- 構建C/C++擴展
- 在Windows平臺編譯C和C++擴展
- 在更大的應用程序中嵌入 CPython 運行時
- Embedding Python in Another Application
- Python/C API 參考手冊
- 概述
- 代碼標準
- 包含文件
- 有用的宏
- 對象、類型和引用計數
- 異常
- 嵌入Python
- 調試構建
- 穩定的應用程序二進制接口
- The Very High Level Layer
- Reference Counting
- 異常處理
- Printing and clearing
- 拋出異常
- Issuing warnings
- Querying the error indicator
- Signal Handling
- Exception Classes
- Exception Objects
- Unicode Exception Objects
- Recursion Control
- 標準異常
- 標準警告類別
- 工具
- 操作系統實用程序
- 系統功能
- 過程控制
- 導入模塊
- Data marshalling support
- 語句解釋及變量編譯
- 字符串轉換與格式化
- 反射
- 編解碼器注冊與支持功能
- 抽象對象層
- Object Protocol
- 數字協議
- Sequence Protocol
- Mapping Protocol
- 迭代器協議
- 緩沖協議
- Old Buffer Protocol
- 具體的對象層
- 基本對象
- 數值對象
- 序列對象
- 容器對象
- 函數對象
- 其他對象
- Initialization, Finalization, and Threads
- 在Python初始化之前
- 全局配置變量
- Initializing and finalizing the interpreter
- Process-wide parameters
- Thread State and the Global Interpreter Lock
- Sub-interpreter support
- Asynchronous Notifications
- Profiling and Tracing
- Advanced Debugger Support
- Thread Local Storage Support
- 內存管理
- 概述
- 原始內存接口
- Memory Interface
- 對象分配器
- 默認內存分配器
- Customize Memory Allocators
- The pymalloc allocator
- tracemalloc C API
- 示例
- 對象實現支持
- 在堆中分配對象
- Common Object Structures
- Type 對象
- Number Object Structures
- Mapping Object Structures
- Sequence Object Structures
- Buffer Object Structures
- Async Object Structures
- 使對象類型支持循環垃圾回收
- API 和 ABI 版本管理
- 分發 Python 模塊
- 關鍵術語
- 開源許可與協作
- 安裝工具
- 閱讀指南
- 我該如何...?
- ...為我的項目選擇一個名字?
- ...創建和分發二進制擴展?
- 安裝 Python 模塊
- 關鍵術語
- 基本使用
- 我應如何 ...?
- ... 在 Python 3.4 之前的 Python 版本中安裝 pip ?
- ... 只為當前用戶安裝軟件包?
- ... 安裝科學計算類 Python 軟件包?
- ... 使用并行安裝的多個 Python 版本?
- 常見的安裝問題
- 在 Linux 的系統 Python 版本上安裝
- 未安裝 pip
- 安裝二進制編譯擴展
- Python 常用指引
- 將 Python 2 代碼遷移到 Python 3
- 簡要說明
- 詳情
- 將擴展模塊移植到 Python 3
- 條件編譯
- 對象API的更改
- 模塊初始化和狀態
- CObject 替換為 Capsule
- 其他選項
- Curses Programming with Python
- What is curses?
- Starting and ending a curses application
- Windows and Pads
- Displaying Text
- User Input
- For More Information
- 實現描述器
- 摘要
- 定義和簡介
- 描述器協議
- 發起調用描述符
- 描述符示例
- Properties
- 函數和方法
- Static Methods and Class Methods
- 函數式編程指引
- 概述
- 迭代器
- 生成器表達式和列表推導式
- 生成器
- 內置函數
- itertools 模塊
- The functools module
- Small functions and the lambda expression
- Revision History and Acknowledgements
- 引用文獻
- 日志 HOWTO
- 日志基礎教程
- 進階日志教程
- 日志級別
- 有用的處理程序
- 記錄日志中引發的異常
- 使用任意對象作為消息
- 優化
- 日志操作手冊
- 在多個模塊中使用日志
- 在多線程中使用日志
- 使用多個日志處理器和多種格式化
- 在多個地方記錄日志
- 日志服務器配置示例
- 處理日志處理器的阻塞
- Sending and receiving logging events across a network
- Adding contextual information to your logging output
- Logging to a single file from multiple processes
- Using file rotation
- Use of alternative formatting styles
- Customizing LogRecord
- Subclassing QueueHandler - a ZeroMQ example
- Subclassing QueueListener - a ZeroMQ example
- An example dictionary-based configuration
- Using a rotator and namer to customize log rotation processing
- A more elaborate multiprocessing example
- Inserting a BOM into messages sent to a SysLogHandler
- Implementing structured logging
- Customizing handlers with dictConfig()
- Using particular formatting styles throughout your application
- Configuring filters with dictConfig()
- Customized exception formatting
- Speaking logging messages
- Buffering logging messages and outputting them conditionally
- Formatting times using UTC (GMT) via configuration
- Using a context manager for selective logging
- 正則表達式HOWTO
- 概述
- 簡單模式
- 使用正則表達式
- 更多模式能力
- 修改字符串
- 常見問題
- 反饋
- 套接字編程指南
- 套接字
- 創建套接字
- 使用一個套接字
- 斷開連接
- 非阻塞的套接字
- 排序指南
- 基本排序
- 關鍵函數
- Operator 模塊函數
- 升序和降序
- 排序穩定性和排序復雜度
- 使用裝飾-排序-去裝飾的舊方法
- 使用 cmp 參數的舊方法
- 其它
- Unicode 指南
- Unicode 概述
- Python's Unicode Support
- Reading and Writing Unicode Data
- Acknowledgements
- 如何使用urllib包獲取網絡資源
- 概述
- Fetching URLs
- 處理異常
- info and geturl
- Openers and Handlers
- Basic Authentication
- Proxies
- Sockets and Layers
- 腳注
- Argparse 教程
- 概念
- 基礎
- 位置參數介紹
- Introducing Optional arguments
- Combining Positional and Optional arguments
- Getting a little more advanced
- Conclusion
- ipaddress模塊介紹
- 創建 Address/Network/Interface 對象
- 審查 Address/Network/Interface 對象
- Network 作為 Address 列表
- 比較
- 將IP地址與其他模塊一起使用
- 實例創建失敗時獲取更多詳細信息
- Argument Clinic How-To
- The Goals Of Argument Clinic
- Basic Concepts And Usage
- Converting Your First Function
- Advanced Topics
- 使用 DTrace 和 SystemTap 檢測CPython
- Enabling the static markers
- Static DTrace probes
- Static SystemTap markers
- Available static markers
- SystemTap Tapsets
- 示例
- Python 常見問題
- Python常見問題
- 一般信息
- 現實世界中的 Python
- 編程常見問題
- 一般問題
- 核心語言
- 數字和字符串
- 性能
- 序列(元組/列表)
- 對象
- 模塊
- 設計和歷史常見問題
- 為什么Python使用縮進來分組語句?
- 為什么簡單的算術運算得到奇怪的結果?
- 為什么浮點計算不準確?
- 為什么Python字符串是不可變的?
- 為什么必須在方法定義和調用中顯式使用“self”?
- 為什么不能在表達式中賦值?
- 為什么Python對某些功能(例如list.index())使用方法來實現,而其他功能(例如len(List))使用函數實現?
- 為什么 join()是一個字符串方法而不是列表或元組方法?
- 異常有多快?
- 為什么Python中沒有switch或case語句?
- 難道不能在解釋器中模擬線程,而非得依賴特定于操作系統的線程實現嗎?
- 為什么lambda表達式不能包含語句?
- 可以將Python編譯為機器代碼,C或其他語言嗎?
- Python如何管理內存?
- 為什么CPython不使用更傳統的垃圾回收方案?
- CPython退出時為什么不釋放所有內存?
- 為什么有單獨的元組和列表數據類型?
- 列表是如何在CPython中實現的?
- 字典是如何在CPython中實現的?
- 為什么字典key必須是不可變的?
- 為什么 list.sort() 沒有返回排序列表?
- 如何在Python中指定和實施接口規范?
- 為什么沒有goto?
- 為什么原始字符串(r-strings)不能以反斜杠結尾?
- 為什么Python沒有屬性賦值的“with”語句?
- 為什么 if/while/def/class語句需要冒號?
- 為什么Python在列表和元組的末尾允許使用逗號?
- 代碼庫和插件 FAQ
- 通用的代碼庫問題
- 通用任務
- 線程相關
- 輸入輸出
- 網絡 / Internet 編程
- 數據庫
- 數學和數字
- 擴展/嵌入常見問題
- 可以使用C語言中創建自己的函數嗎?
- 可以使用C++語言中創建自己的函數嗎?
- C很難寫,有沒有其他選擇?
- 如何從C執行任意Python語句?
- 如何從C中評估任意Python表達式?
- 如何從Python對象中提取C的值?
- 如何使用Py_BuildValue()創建任意長度的元組?
- 如何從C調用對象的方法?
- 如何捕獲PyErr_Print()(或打印到stdout / stderr的任何內容)的輸出?
- 如何從C訪問用Python編寫的模塊?
- 如何從Python接口到C ++對象?
- 我使用Setup文件添加了一個模塊,為什么make失敗了?
- 如何調試擴展?
- 我想在Linux系統上編譯一個Python模塊,但是缺少一些文件。為什么?
- 如何區分“輸入不完整”和“輸入無效”?
- 如何找到未定義的g++符號__builtin_new或__pure_virtual?
- 能否創建一個對象類,其中部分方法在C中實現,而其他方法在Python中實現(例如通過繼承)?
- Python在Windows上的常見問題
- 我怎樣在Windows下運行一個Python程序?
- 我怎么讓 Python 腳本可執行?
- 為什么有時候 Python 程序會啟動緩慢?
- 我怎樣使用Python腳本制作可執行文件?
- *.pyd 文件和DLL文件相同嗎?
- 我怎樣將Python嵌入一個Windows程序?
- 如何讓編輯器不要在我的 Python 源代碼中插入 tab ?
- 如何在不阻塞的情況下檢查按鍵?
- 圖形用戶界面(GUI)常見問題
- 圖形界面常見問題
- Python 是否有平臺無關的圖形界面工具包?
- 有哪些Python的GUI工具是某個平臺專用的?
- 有關Tkinter的問題
- “為什么我的電腦上安裝了 Python ?”
- 什么是Python?
- 為什么我的電腦上安裝了 Python ?
- 我能刪除 Python 嗎?
- 術語對照表
- 文檔說明
- Python 文檔貢獻者
- 解決 Bug
- 文檔錯誤
- 使用 Python 的錯誤追蹤系統
- 開始為 Python 貢獻您的知識
- 版權
- 歷史和許可證
- 軟件歷史
- 訪問Python或以其他方式使用Python的條款和條件
- Python 3.7.3 的 PSF 許可協議
- Python 2.0 的 BeOpen.com 許可協議
- Python 1.6.1 的 CNRI 許可協議
- Python 0.9.0 至 1.2 的 CWI 許可協議
- 集成軟件的許可和認可
- Mersenne Twister
- 套接字
- Asynchronous socket services
- Cookie management
- Execution tracing
- UUencode and UUdecode functions
- XML Remote Procedure Calls
- test_epoll
- Select kqueue
- SipHash24
- strtod and dtoa
- OpenSSL
- expat
- libffi
- zlib
- cfuhash
- libmpdec