Chapter 13 Case study: data structure selection 案例學習：數據結構的選擇 · Think Python 2e 中文版

# Chapter 13 Case study: data structure selection 案例學習：數據結構的選擇 At this point you have learned about Python’s core data structures, and you have seen some of the algorithms that use them. If you would like to know more about algorithms, this might be a good time to read Chapter 13. But you don’t have to read it before you go on; you can read it whenever you are interested. This chapter presents a case study with exercises that let you think about choosing data structures and practice using them. > 到現在位置，你已經學過 Python 中最核心的數據結構了，也學過了一些與之對應的各種算法了。如果你想要對算法進行深入的了解，就可以來讀一下第十三章。但不讀這一章也可以繼續；無論什么時候讀都可以，感興趣了就來看即可。 > 本章通過一個案例和一些練習，來講解一下如何選擇和使用數據結構。 ## 13.1 Word frequency analysis 詞頻統計 As usual, you should at least attempt the exercises before you read my solutions. > 跟往常一樣，你最起碼也得先自己嘗試做一下這些練習，然后再看參考答案。 ### Exercise 1 練習1 Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase. > 寫一個讀取文件的程序，把每一行拆分成一個個詞，去掉空白和標點符號，然后把所有單詞都轉換成小寫字母的。 Hint: The string module provides a string named whitespace, which contains space, tab, newline, etc., and punctuation which contains the punctuation characters. Let’s see if we can make Python swear: > 提示：字符串模塊 string 提供了一個名為 whitespace 的字符串，包含了空格、跳表符、另起一行等等，然后還有個 punctuation 模塊，包含了各種標點符號的字符。咱們可以試試讓 Python 把標點符號都給顯示一下： ```Python >>> import string >>> import string >>>string.punctuation >>>string.punctuation '!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~' ``` Also, you might consider using the string methods strip, replace and translate. > 另外你也可以試試字符串模塊的其他方法，比如 strip、replace 以及 translate。 ### Exercise 2 練習2 Go to [Project Gutenberg] (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format. > 訪問[古登堡計劃網站] (http://gutenberg.org)，然后下載一個你最喜歡的公有領域的書，要下載純文本格式的哈。 Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before. > 修改一下剛才上一個練習你寫的程序，讓這個程序能讀取你下載的這本書，跳過文件開頭部分的信息，繼續以上個練習中的方式來處理一下整本書的正文。 Then modify the program to count the total number of words in the book, and the number of times each word is used. > 然后再修改一下程序，讓程序能統計一下整本書的總單詞數目，以及每個單詞出現的次數。 Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary? > 輸出一下這本書中不重復的單詞的個數。對比一下不同作者、不同地域的書籍。哪個作者的詞匯量最豐富？ ### Exercise 3 練習3 Modify the program from the previous exercise to print the 20 most frequently-used words in the book. > 再接著修改程序，輸出一下每本書中最頻繁出現的20個詞。 ### Exercise 4 練習4 Modify the previous program to read a word list (see Section 9.1) and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure? > 接著修改，讓程序能讀取一個單詞列表（參考9.1），然后輸出一下所有包含在書中，但不包含于單詞列表中的單詞。看看這些單詞中有多少是排版錯誤的？有多少是本應被單詞列表包含的常用單詞？有多少是很晦澀艱深的罕見詞匯？ ## 13.2 Random numbers 隨機數 Given the same inputs, most computer programs generate the same outputs every time, so they are said to be deterministic. Determinism is usually a good thing, since we expect the same calculation to yield the same result. For some applications, though, we want the computer to be unpredictable. Games are an obvious example, but there are more. > 輸入相同的情況下，大多數計算機程序每次都會給出相同的輸出，這也叫做確定性。確定性通常是一件好事，因為我們都希望同樣的運算產生同樣的結構。但有時候為了一些特定用途，咱們就需要讓計算機能有不可預測性。比如游戲等等，有很多很多這樣的情景。 Making a program truly nondeterministic turns out to be difficult, but there are ways to make it at least seem nondeterministic. One of them is to use algorithms that generate pseudorandom numbers. Pseudorandom numbers are not truly random because they are generated by a deterministic computation, but just by looking at the numbers it is all but impossible to distinguish them from random. > 然而，想讓一個程序真正變成不可預測的，也是很難的，但好在有辦法能讓程序看上去不太確定。其中一種方法就是通過算法來產生假隨機數。假隨機數，顧名思義就知道不是真正隨機的了，因為它們是通過一種確定性的運算來得到的，但這些數字看上去是隨機的，很難與真正的隨機數區分。 > （譯者注：這里大家很容易一帶而過，而不去探究到底怎樣能確定是真隨機數。實際上隨機數是否能得到以及是否存在會影響哲學判斷，可知論和不可知論等等。那么就建議大家思考和搜索一下，隨機數算法產生的隨機數和真正隨機數有什么本質的區別，以及是否有辦法得到真正的隨機數。如果有，如何得到呢？） The random module provides functions that generate pseudorandom numbers (which I will simply call “random” from here on). > random 模塊提供了生成假隨機數的函數（從這里開始，咱們就用隨機數來簡稱假隨機數了哈）。 The function random returns a random float between 0.0 and 1.0 (including 0.0 but not 1.0). Each time you call random, you get the next number in a long series. To see a sample, run this loop: > 函數 random 返回一個在0.0到1.0的前閉后開區間（就是包括0.0但不包括1.0，這個特性在 Python 隨處都是，比如序列的索引等等）的隨機數。每次調用 random，就會得到一個很長的數列中的下一個數。如下這個循環就是一個例子了： ```Python import random for i in range(10): x = random.random() print(x) ``` The function randint takes parameters low and high and returns an integer between low and high (including both). > randint函數接收兩個參數作為下界和上界，然后返回一個二者之間的整數，這個整數可以是下界或者上界。 ```Python >>> random.randint(5, 10) >>> random.randint(5, 10) 5 >>> random.randint(5, 10) >>> random.randint(5, 10) 9 ``` To choose an element from a sequence at random, you can use choice: > choice 函數可以用來從一個序列中隨機選出一個元素： ```Python >>> t = [1, 2, 3] >>> t = [1, 2, 3] >>> random.choice(t) >>> random.choice(t) 2 >>> random.choice(t) >>> random.choice(t) 3 ``` The random module also provides functions to generate random values from continuous distributions including Gaussian, exponential, gamma, and a few more. > random 模塊還提供了其他一些函數，可以計算某些連續分布的隨機值，比如Gaussian高斯分布, exponential指數分布, gamma γ分布等等。 ### Exercise 5 練習5 Write a function named choose_from_hist that takes a histogram as defined in Section 11.2 and returns a random value from the histogram, chosen with probability in proportion to frequency. For example, for this histogram: > 寫一個名為 choose_from_hist 的函數，用這個函數來處理一下11.2中定義的那個histogram函數，從histogram 的值當中隨機選擇一個，這個選擇的概率按照比例來定。比如下面這個histogram： ```Python >>> t = ['a', 'a', 'b'] >>> t = ['a', 'a', 'b'] >>> hist = histogram(t) >>> hist = histogram(t) >>> hist >>> hist {'a': 2, 'b': 1} ``` your function should return ’a’ with probability 2/3 and ’b’ with probability 1/3. > 你的函數就應該返回a 的概率為2/3，返回b 的概率為1/3 ## 13.3 Word histogram 詞頻 You should attempt the previous exercises before you go on. You can download my solution from [Here](http://thinkpython2.com/code/analyze_book1.py). > 你得先把前面的練習作一下，然后再繼續哈。可以從[這里](http://thinkpython2.com/code/analyze_book1.py)下載我的樣例代碼。 You will also need [This](http://thinkpython2.com/code/emma.txt). > 此外還要下載[這個](http://thinkpython2.com/code/emma.txt)。 Here is a program that reads a file and builds a histogram of the words in the file: > 下面這個程序先讀取一個文件，然后對該文件中的詞頻進行了統計： ```Python import string def process_file(filename): hist = dict() fp = open(filename) for line in fp: process_line(line, hist) return hist def process_line(line, hist): line = line.replace('-', ' ') for word in line.split(): word = word.strip(string.punctuation + string.whitespace) word = word.lower() hist[word] = hist.get(word, 0) + 1 hist = process_file('emma.txt') ``` This program reads emma.txt, which contains the text of Emma by Jane Austen. > 上面這個程序讀取的是 emma.txt 這個文件，該文件是簡奧斯汀的小說《艾瑪》。 process_file loops through the lines of the file, passing them one at a time to process_line. The histogram hist is being used as an accumulator. > process_file這個函數遍歷整個文件，逐行讀取，然后把每行的內容發給process_line函數。詞頻統計函數 hist 在該程序中是一個累加器。 process_line uses the string method replace to replace hyphens with spaces before using split to break the line into a list of strings. It traverses the list of words and uses strip and lower to remove punctuation and convert to lower case. (It is a shorthand to say that strings are “converted”; remember that string are immutable, so methods like strip and lower return new strings.) > process_line使用字符串的方法 replace把各種連字符都用空格替換，然后用 split 方法把整行打散成一個字符串列表。程序遍歷整個單詞列表，然后用 strip 和 lower 這兩個方法移除了標點符號，并且把所有字母都轉換成小寫的。（一定要記住，這里說的『轉換』是圖方便而已，實際上并不能轉換，要記住字符串是不可以修改的，strip 和 lower 這些方法都是返回了新的字符串，一定要記得！） Finally, process_line updates the histogram by creating a new item or incrementing an existing one. To count the total number of words in the file, we can add up the frequencies in the histogram: > 最終，process_line 函數通過建立新項或者累加已有項，對字頻統計 histogram 進行了更新。 > 要計算整個文件中的單詞總數，就可以把 histogram 中的所有頻數加到一起就可以了： ```Python def total_words(hist): return sum(hist.values()) ``` The number of different words is just the number of items in the dictionary: > 不重復的單詞的數目也就是字典中項的個數了： ```Python def different_words(hist): return len(hist) ``` Here is some code to print the results: > 輸出結果的代碼如下： ```Python print('Total number of words:', total_words(hist)) print('Number of different words:', different_words(hist)) ``` And the results: > 結果如下所示： ```Python Total number of words: 161080 Number of different words: 7214 ``` ## 13.4 Most common words 最常用的單詞 To find the most common words, we can make a list of tuples, where each tuple contains a word and its frequency, and sort it. The following function takes a histogram and returns a list of word-frequency tuples: > 要找到最常用的詞，可以做一個元組列表，每一個元組包含一個單詞和該單詞出現的次數，然后整理一下這個列表，就可以了。 > 下面的函數就接收了詞頻統計結果，然后返回一個『單詞-次數』元組組成的列表： ```Python def most_common(hist): t = [] for key, value in hist.items(): t.append((value, key)) t.sort(reverse=True) return t ``` In each tuple, the frequency appears first, so the resulting list is sorted by frequency. Here is a loop that prints the ten most common words: > 這些元組中，要先考慮詞頻，返回的列表因此根據詞頻來排序。下面是一個輸出最常用單詞的循環體： ```Python t = most_common(hist) print('The most common words are:') for freq, word in t[:10]: print(word, freq, sep='\t') ``` I use the keyword argument sep to tell print to use a tab character as a “separator”, rather than a space, so the second column is lined up. Here are the results from Emma: > 此處用了關鍵詞 sep 來讓 print 輸出的時候以一個tab跳表符來作為分隔，而不是一個空格，這樣第二列就會對齊。下面就是對《艾瑪》這本小說的統計結果： > (譯者注：這個效果在 Python 下很明顯，此處 markdown 我剛開始熟悉，不清楚咋實現。) ```Python The most common words are: to 5242 the 5205 and 4897 of 4295 i 3191 a 3130 it 2529 her 2483 was 2400 she 2364 ``` This code can be simplified using the key parameter of the sort function. If you are curious, you can read about it at [Here](https://wiki.python.org/moin/HowTo/Sorting). > 如果使用 sort 函數的 key 參數，上面的代碼還可以進一步簡化。如果你好奇的話，可以進一步閱讀一下[說明](https://wiki.python.org/moin/HowTo/Sorting) ## 13.5 Optional parameters 可選的參數 We have seen built-in functions and methods that take optional arguments. It is possible to write programmer-defined functions with optional arguments, too. For example, here is a function that prints the most common words in a histogram: > 咱們已經看過好多有可選參數的內置函數和方法了。實際上咱們自己也可以寫，寫這種有可選參數的自定義函數。比如下面就是一個根據詞頻數據來統計最常用單詞的函數： ```Python def print_most_common(hist, num=10): t = most_common(hist) print('The most common words are:') for freq, word in t[:num]: print(word, freq, sep='\t') ``` The first parameter is required; the second is optional. The default value of num is 10. If you only provide one argument: > 上面這個函數中，第一個參數是必須輸入的；而第二個參數就是可選的了。第二個參數 num 的默認值是10. > 如果只提供第一個參數： ```Python print_most_common(hist) ``` num gets the default value. If you provide two arguments: > 這樣 num 就用默認值了。如果提供兩個參數： ```Python print_most_common(hist, 20) ``` num gets the value of the argument instead. In other words, the optional argument overrides the default value. If a function has both required and optional parameters, all the required parameters have to come first, followed by the optional ones. > 這樣 num 就用參數值來賦值了。換句話說，可選參數可以覆蓋默認值。 > 如果一個函數同時含有必需參數和可選參數，就必須在定義函數的時候，把必需參數全都放到前面，而可選的參數要放到后面。 ## 13.6 Dictionary subtraction 字典減法 Finding the words from the book that are not in the word list from words.txt is a problem you might recognize as set subtraction; that is, we want to find all the words from one set (the words in the book) that are not in the other (the words in the list). > 有的單詞存在于書當中，但沒有包含在文件 words.txt 的單詞列表中，找這些單詞就有點難了，你估計已經意識到了，這是一種集合的減法；也就是要從一個集合（也就是書）中所有不被另一個集合（也就是單詞列表）包含的單詞。 subtract takes dictionaries d1 and d2 and returns a new dictionary that contains all the keys from d1 that are not in d2. Since we don’t really care about the values, we set them all to None. > 下面的代碼中定義的 subtrac t這個函數，接收兩個字典 d1和 d2，然后返回一個新字典，這個新字典包含所有 d1中包含而 d2中不包含的鍵。鍵值就無所謂了，就都設置為空即可。 ```Python def subtract(d1, d2): res = dict() for key in d1: if key not in d2: res[key] = None return res ``` To find the words in the book that are not in words.txt, we can use process_file to build a histogram for words.txt, and then subtract: > 要找到書中含有而words.txt 中不含有的單詞，就可以用 process_file 函數來建立一個 words.txt 的詞頻統計，然后用 subtract 函數來相減： ```Python words = process_file('words.txt') diff = subtract(hist, words) print("Words in the book that aren't in the word list:") for word in diff.keys(): print(word, end=' ') ``` Here are some of the results from Emma: > 下面依然還是對《艾瑪》得到的結果： ```Python Words in the book that aren't in the word list: rencontre jane's blanche woodhouses disingenuousness friend's venice apartment ... ``` Some of these words are names and possessives. Others, like “rencontre”, are no longer in common use. But a few are common words that should really be in the list! > 這些單詞有的是名字或者所有格之類的。另外的一些，比如『rencontre』，都是現在不怎么常用的了。不過也確實有一些單詞是挺常用的，挺應該被列表所包含的！ ### Exercise 6 練習6 Python provides a data structure called set that provides many common set operations. You can read about them in Section 19.5, or read the documentation at [Here](http://docs.python.org/3/library/stdtypes.html#types-set). > Python 提供了一個數據結構叫 set（集合），該類型提供了很多常見的集合運算。可以在19.5閱讀一下，或者閱讀一下[這里的官方文檔](http://docs.python.org/3/library/stdtypes.html#types-set)。 Write a program that uses set subtraction to find words in the book that are not in the word list. [Solution](http://thinkpython2.com/code/analyze_book2.py). > 寫一個程序吧，用集合的減法，來找一下書中包含而列表中不包含的單詞吧。 [樣例代碼](http://thinkpython2.com/code/analyze_book2.py)。 ## 13.7 Random words 隨機單詞 To choose a random word from the histogram, the simplest algorithm is to build a list with multiple copies of each word, according to the observed frequency, and then choose from the list: > 要從詞頻數據中選一個隨機的單詞，最簡單的算法就是根據已知的單詞頻率來將每個單詞復制相對應的個數的副本，然后組建成一個列表，從列表中選擇單詞： ```Python def random_word(h): t = [] for word, freq in h.items(): t.extend([word] * freq) return random.choice(t) ``` The expression [word] * freq creates a list with freq copies of the string word. The extend method is similar to append except that the argument is a sequence. > 上面代碼中的[word] * freq表達式建立了一個列表，列表中字符串單詞的出現次數即其原來的詞頻數。extend 方法和 append 方法相似，區別是前者的參數是一個序列，而后者是單獨的元素。 This algorithm works, but it is not very efficient; each time you choose a random word, it rebuilds the list, which is as big as the original book. An obvious improvement is to build the list once and then make multiple selections, but the list is still big. > 上面這個算法確實能用，但效率實在不怎么好；每次選擇隨機單詞的時候，程序都要重建列表，這個列表就和源書一樣大了。很顯然，一次性建立列表，而多次使用該列表來進行選擇，這樣能有明顯改善，但列表依然還是很大。 An alternative is: > 備選的思路如下： 1. Use keys to get a list of the words in the book. > 用鍵來存儲書中單詞的列表。 2. Build a list that contains the cumulative sum of the word frequencies (see Exercise 2). The last item in this list is the total number of words in the book, n. > 再建立一個列表，該列表包含所有詞頻的累加總和（參考練習2）。該列表的最后一個元素是書中所有單詞的總數 n。 3. Choose a random number from 1 to n. Use a bisection search (See Exercise 10) to find the index where the random number would be inserted in the cumulative sum. > 選擇一個1到 n 之間的隨機數。使用折半法搜索（參考練習10），找到隨機數在累計總和中所在位置的索引值。 4. Use the index to find the corresponding word in the word list. > 用該索引值來找到單詞列表中對應的單詞。 ### Exercise 7 練習7 Write a program that uses this algorithm to choose a random word from the book. [Solution](http://thinkpython2.com/code/analyze_book3.py). > 寫一個程序，用上面說的算法來從一本書中隨機挑選單詞。[樣例代碼](http://thinkpython2.com/code/analyze_book3.py)。 ## 13.8 Markov analysis 馬科夫分析法 If you choose words from the book at random, you can get a sense of the vocabulary, but you probably won’t get a sentence: > 如果讓你從一本書中隨機挑選一些單詞，這些單詞都能理解，但估計難以成為一句話： ```Python this the small regard harriet which knightley's it most things ``` A series of random words seldom makes sense because there is no relationship between successive words. For example, in a real sentence you would expect an article like “the” to be followed by an adjective or a noun, and probably not a verb or adverb. > 一系列隨機次很少能組成整句的話，因為這些單詞連接起來并沒有什么關系。例如，成句的話中，冠詞 the 后面應該是跟著形容詞或者名詞，而不應該是動詞或者副詞。 One way to measure these kinds of relationships is Markov analysis, which characterizes, for a given sequence of words, the probability of the words that might come next. For example, the song Eric, the Half a Bee begins: > 衡量單詞之間關系的一種方法就是馬科夫分析法，這一方法就是：對給定的單詞序列，分析一個詞跟著另一個詞后面出現的概率。比如，Eric, the Half a Bee這首歌的開頭： ```Python Half a bee, philosophically, Must, ipso facto, half not be. But half the bee has got to be Vis a vis, its entity. D’you see? But can a bee be said to be Or not to be an entire bee When half the bee is not a bee Due to some ancient injury? ``` In this text, the phrase “half the” is always followed by the word “bee”, but the phrase “the bee” might be followed by either “has” or “is”. > 在上面的文本中，『half the』這個詞組后面總是跟著『bee』，但詞組『the bee』后面可以是『has』，也可以是『is』。 The result of Markov analysis is a mapping from each prefix (like “half the” and “the bee”) to all possible suffixes (like “has” and “is”). > 馬科夫分析的結果是從每個前綴（比如『half the』和『the bee』）到所有可能的后綴（比如『has』和『is』）的映射。 Given this mapping, you can generate a random text by starting with any prefix and choosing at random from the possible suffixes. Next, you can combine the end of the prefix and the new suffix to form the next prefix, and repeat. > 有了這一映射，你就可以制造隨機文本了，用任意的前綴開頭，然后從可能的后綴中隨機選一個。下一次就把前綴的末尾和新的后綴結合起來，作為新的前綴，然后重復上面的步驟。 For example, if you start with the prefix “Half a”, then the next word has to be “bee”, because the prefix only appears once in the text. The next prefix is “a bee”, so the next suffix might be “philosophically”, “be” or “due”. In this example the length of the prefix is always two, but you can do Markov analysis with any prefix length. > 例如，你用前綴『Half a』來開始，那接下來的就必須是『bee』了，因為這個前綴只在文本中出現了一次。接下來，第二次了，前綴就變成了『a bee』了，所以接下來的后綴可以是『philosophically』, 『be』或者『due』。 > 在這個例子中，前綴的長度總是兩個單詞，但你可以以任意長度的前綴來進行馬科夫分析。 ### Exercise 8 練習8 Markov analysis: > 馬科夫分析： 1. Write a program to read a text from a file and perform Markov analysis. The result should be a dictionary that maps from prefixes to a collection of possible suffixes. The collection might be a list, tuple, or dictionary; it is up to you to make an appropriate choice. You can test your program with prefix length two, but you should write the program in a way that makes it easy to try other lengths. > 寫一個程序，讀取文件中的文本，然后進行馬科夫分析。結果應該是一個字典，從前綴到一個可能的后綴組成的序列的映射。這個序列可以是列表，元組，也可以是字典；你自己來選擇合適的類型來寫就好。你可以用兩個單詞長度的前綴來測試你的程序，但應該讓程序能夠兼容其他長度的前綴。 2. Add a function to the previous program to generate random text based on the Markov analysis. Here is an example from Emma with prefix length 2: > 在上面的程序中添加一個函數，基于馬科夫分析來生成隨機文本。下面是用《艾瑪》使用兩個單詞長度的前綴來生成的一個隨機文本樣例： ```Python He was very clever, be it sweetness or be angry, ashamed or only amused, at such a stroke. She had never thought of Hannah till you were never meant for me?" "I cannot make speeches, Emma:" he soon cut it all himself. ``` For this example, I left the punctuation attached to the words. The result is almost syntactically correct, but not quite. Semantically, it almost makes sense, but not quite. > 這個例子中，我保留了單詞中連接的標點符號。得到的結果在語法上基本是正確的，但也還差點。語義上，這些單詞連起來也還能有一些意義，但也不咋對勁。 What happens if you increase the prefix length? Does the random text make more sense? > 如果增加前綴的單詞長度會怎么樣？隨機文本是不是讀起來更通順呢？ 3. Once your program is working, you might want to try a mash-up: if you combine text from two or more books, the random text you generate will blend the vocabulary and phrases from the sources in interesting ways. > 一旦你的程序能用了，你可以試試混搭一下：如果你把兩本以上的書合并起來，生成的隨機文本就會以很有趣的方式從多種來源混合單詞和短語來生成隨機文本。 Credit: This case study is based on an example from Kernighan and Pike, The Practice of Programming, Addison-Wesley, 1999. > 引用：這個案例研究是基于Kernighan 和 Pike 在1999年由Addison-Wesley出版社出版的《The Practice of Programming》一書中的一個例子。 You should attempt this exercise before you go on; then you can can download my solution from [Here](http://thinkpython2.com/code/markov.py). You will also need [This, the txt file of Emma](http://thinkpython2.com/code/emma.txt). > 你應該自己獨立嘗試一下這些練習，然后再繼續；然后你可以下載[我的樣例代碼](http://thinkpython2.com/code/markov.py)。另外你可能需要下載 [《艾瑪》這本書的文本文件](http://thinkpython2.com/code/emma.txt)。 ## 13.9 Data structures 數據結構 Using Markov analysis to generate random text is fun, but there is also a point to this exercise: data structure selection. In your solution to the previous exercises, you had to choose: > 使用馬科夫分析來生成隨機文本挺有意思的，但這個練習還有另外一個要點：數據結構的選擇。在前面這些練習中，你必須要選擇以下內容： * How to represent the prefixes. > 如何表示前綴。 * How to represent the collection of possible suffixes. > 如何表示可能后綴的集合。 * How to represent the mapping from each prefix to the collection of possible suffixes. > 如何表示每個前綴與對應的后綴集合之間的映射。 The last one is easy: a dictionary is the obvious choice for a mapping from keys to corresponding values. > 最后一個最簡單了：明顯就應該用字典了，這樣來把每個鍵映射到對應的多個值。 For the prefixes, the most obvious options are string, list of strings, or tuple of strings. > 前綴的選擇，明顯可以使用字符串、字符串列表，或者字符串元組。 For the suffixes, one option is a list; another is a histogram (dictionary). > 后綴的先澤，要么是用列表，要么就用咱們之前寫過的詞頻函數 histogram（這個也是個字典）。 How should you choose? The first step is to think about the operations you will need to implement for each data structure. For the prefixes, we need to be able to remove words from the beginning and add to the end. For example, if the current prefix is “Half a”, and the next word is “bee”, you need to be able to form the next prefix, “a bee”. > 該咋選呢？第一步就是想一下，每種數據結構都需要用到哪些運算。比如對前綴來說，咱們就得能刪掉頭部，然后在尾部添加新詞。例如，加入現在的前綴是『Half a』，接下來的單詞是『bee』，就得能夠組成下一個前綴，也就是『a bee』。 Your first choice might be a list, since it is easy to add and remove elements, but we also need to be able to use the prefixes as keys in a dictionary, so that rules out lists. With tuples, you can’t append or remove, but you can use the addition operator to form a new tuple: > 你的首選估計就是列表了，因為列表很容易增加和剔除元素，但我們還需要能用前綴做字典中的鍵，所以列表就不合格了。那就剩元組了，元組沒法添加和刪除元素，但可以用加法運算符來建立新的元組。 ```Python def shift(prefix, word): return prefix[1:] + (word,) ``` shift takes a tuple of words, prefix, and a string, word, and forms a new tuple that has all the words in prefix except the first, and word added to the end. > 上面這個 shift 函數，接收一個單詞的元組，也就是前綴，然后還接收一個字符串，也就是單詞了，然后形成一個新的元組，就是把原有的前綴去掉頭部，用新單詞拼接到尾部。 For the collection of suffixes, the operations we need to perform include adding a new suffix (or increasing the frequency of an existing one), and choosing a random suffix. > 對后綴的集合來說，我們需要進行的運算包括添加新的后綴（或者增加一個已有后綴的頻次），然后選擇一個隨機的后綴。 Adding a new suffix is equally easy for the list implementation or the histogram. Choosing a random element from a list is easy; choosing from a histogram is harder to do efficiently (see Exercise 7). > 添加新后綴，無論是用列表還是用詞頻字典，實現起來都一樣容易。不過從列表中選擇一個隨機元素很容易；但從詞頻字典中選擇隨機元素實現起來就不太有效率了（參考練習7）。 So far we have been talking mostly about ease of implementation, but there are other factors to consider in choosing data structures. One is run time. Sometimes there is a theoretical reason to expect one data structure to be faster than other; for example, I mentioned that the in operator is faster for dictionaries than for lists, at least when the number of elements is large. > 目前為止，我們說完了實現難度，但對數據結構的選擇還要考慮一些其他的因素。比如運行時間。有時候要考慮理論上的原因來考慮，最好的數據結構要比其他的快；例如我之前提到了 in 運算符在字典中要比列表中速度快，最起碼當元素數量增多的時候會體現出來。 But often you don’t know ahead of time which implementation will be faster. One option is to implement both of them and see which is better. This approach is called benchmarking. A practical alternative is to choose the data structure that is easiest to implement, and then see if it is fast enough for the intended application. If so, there is no need to go on. If not, there are tools, like the profile module, that can identify the places in a program that take the most time. > 但一般情況下，咱們不能提前知道哪一種實現方法的速度更快。所以就可以兩種都寫出來，然后運行對比一下，看看到底哪個快。這種方法就叫對比測試。另外一種方法是選一個實現起來最簡單的數據結構，然后看看運行速度是不是符合問題的要求。如果可以，就不用再改進了。如果速度不夠快，就虧用到一些工具，比如 profile 模塊，來判斷程序的哪些部分消耗了最多的運行時間。 The other factor to consider is storage space. For example, using a histogram for the collection of suffixes might take less space because you only have to store each word once, no matter how many times it appears in the text. In some cases, saving space can also make your program run faster, and in the extreme, your program might not run at all if you run out of memory. But for many applications, space is a secondary consideration after run time. > 此外還要考慮的一個因素就是存儲空間了。比如用一個詞頻字典作為后綴集合就可能省一些存儲空間，因為無論這些單詞在穩重出現了多少次，在該字典中每個單詞只存儲一次。有的情況下，節省空間也能讓你的程序運行更快，此外在一些極端情況下，比如內存耗盡了，你的程序就根本無法運行了。不過對大多數應用來說，都是優先考慮運行時間，存儲空間只是次要因素了。 One final thought: in this discussion, I have implied that we should use one data structure for both analysis and generation. But since these are separate phases, it would also be possible to use one structure for analysis and then convert to another structure for generation. This would be a net win if the time saved during generation exceeded the time spent in conversion. > 最后再考慮一下：上文中，我已經暗示了，咱們選擇某種數據結構，要兼顧分析和生成。但這二者是分開的步驟，所以也可以分析的時候用一種數據結構，而生成的時候再轉換成另外一種結構。只要生成時候節省的時間勝過轉換所花費的時間，相權衡之下依然是劃算的。 ## 13.10 Debugging 調試 When you are debugging a program, and especially if you are working on a hard bug, there are five things to try: > 調試一個程序的時候，尤其是遇到特別嚴峻的問題的時候，有以下五個步驟一定要做好： * Reading: > 閱讀代碼： Examine your code, read it back to yourself, and check that it says what you meant to say. > 好好檢查代碼，多讀幾次，好好看看代碼所表述的內容是不是跟你的設想相一致。 * Running: > 運行程序： Experiment by making changes and running different versions. Often if you display the right thing at the right place in the program, the problem becomes obvious, but sometimes you have to build scaffolding. > 做一些修改，然后運行各個版本來對比實驗一下。通常來說，只要你在程序對應的位置加上輸出，問題就能比較明確了，不過有時候你還是得搭建一些腳手架代碼來幫忙找錯誤。 * Ruminating: > 反復思考： Take some time to think! What kind of error is it: syntax, runtime, or semantic? What information can you get from the error messages, or from the output of the program? What kind of error could cause the problem you’re seeing? What did you change last, before the problem appeared? > 多花點時間去思考！想下到底是什么類型的錯誤：語法，運行，還是語義錯誤？從錯誤信息以及程序的輸出能得到什么信息？想想哪種錯誤能引起你所看到的問題？問題出現之前的那一次你做了什么修改？ * Rubberducking: > 小黃鴨調試法： If you explain the problem to someone else, you sometimes find the answer before you finish asking the question. Often you don’t need the other person; you could just talk to a rubber duck. And that’s the origin of the well-known strategy called rubber duck debugging. I am not making this up; see [Here](https://en.wikipedia.org/wiki/Rubber_duck_debugging). > 如果你對另外一個人解釋問題，你有時候就能在問完問題之前就找到答案。通常你根本不用找另外一個人；就根一個橡膠鴨子說就可以了。這就是很著名的所謂小黃鴨調試法的起源了。我可不是瞎編的哈；看[這里的解釋](https://en.wikipedia.org/wiki/Rubber_duck_debugging)。 * Retreating: > 以退為進： At some point, the best thing to do is back off, undoing recent changes, until you get back to a program that works and that you understand. Then you can start rebuilding. > 有時候，最佳的策略反而就是后撤，取消最近的修改，一直到程序恢復工作，并且你能清楚理解。然后再重頭來改進。 Beginning programmers sometimes get stuck on one of these activities and forget the others. Each activity comes with its own failure mode. > 新手程序員經常會在上面這些步驟中的某一項上卡殼，然后忘了其他的步驟。上面的每一步都有各自的失靈情況。 For example, reading your code might help if the problem is a typographical error, but not if the problem is a conceptual misunderstanding. If you don’t understand what your program does, you can read it 100 times and never see the error, because the error is in your head. > 比如，錯誤很典型的情況下，閱讀代碼也許有效，但如果錯誤是概念上誤解導致的，這就沒啥用了。如果你不理解你程序的功能，你就算讀上一百測也找不到錯誤，因為是你腦中的理解有錯誤。 Running experiments can help, especially if you run small, simple tests. But if you run experiments without thinking or reading your code, you might fall into a pattern I call “random walk programming”, which is the process of making random changes until the program does the right thing. Needless to say, random walk programming can take a long time. > 在你進行小規模的簡單測試的時候，進行試驗會有用。但如果不思考和閱讀代碼，你就可以陷入到我稱之為『隨機走路編程』的陷阱中，這種過程就是隨機做一些修改，一直到程序工作位置。毋庸置疑，這種隨機修改肯定得浪費好多時間的。 You have to take time to think. Debugging is like an experimental science. You should have at least one hypothesis about what the problem is. If there are two or more possibilities, try to think of a test that would eliminate one of them. > 最重要的就是思考，一定要花時間去思考。調試就像是一種實驗科學。你至少應該對問題的本質有一種假設。如果有兩種或者兩種以上的可能性，就要設計個測試，來逐個排除可能性。 But even the best debugging techniques will fail if there are too many errors, or if the code you are trying to fix is too big and complicated. Sometimes the best option is to retreat, simplifying the program until you get to something that works and that you understand. > 然而一旦錯誤特別多了，再好的調試技術也不管用的，程序太大太復雜也會容易有類似情況。所以有時候最好的方法就是以退為進，簡化一下程序，直到能工作了，并且你能理解整個程序了為止。 Beginning programmers are often reluctant to retreat because they can’t stand to delete a line of code (even if it’s wrong). If it makes you feel better, copy your program into another file before you start stripping it down. Then you can copy the pieces back one at a time. > 新手程序員經常不愿意后撤，因為他們不情愿刪掉一行代碼（哪怕是錯誤的代碼）。可以這樣，復制一下整個代碼到另外一個文件中做個備份，然后再刪減，這樣是不是感覺好些。然后你可以再復制回來的。 Finding a hard bug requires reading, running, ruminating, and sometimes retreating. If you get stuck on one of these activities, try the others. > 找到一個困難問題的解決方法，需要閱讀、測試、分析，有時候還要后撤。如果你在某一步驟中卡住了，試試其他方法。 ## 13.11 Glossary 術語列表 deterministic: Pertaining to a program that does the same thing each time it runs, given the same inputs. > 確定性：給定同樣的輸出，程序每次運行結果都相同。 pseudorandom: Pertaining to a sequence of numbers that appears to be random, but is generated by a deterministic program. > 假隨機數：一段數字序列中的數，看上去似乎是隨機的，但實際上也是由確定的算法來生成的。 default value: The value given to an optional parameter if no argument is provided. > 默認值：如果不對可選參數進行賦值的話，該參數會用默認設置的值。 override: To replace a default value with an argument. > 覆蓋：用戶在調用函數的時候給可選參數提供了參數，這個參數就覆蓋掉默認值。 benchmarking: The process of choosing between data structures by implementing alternatives and testing them on a sample of the possible inputs. > 對比測試： rubber duck debugging: Debugging by explaining your problem to an inanimate object such as a rubber duck. Articulating the problem can help you solve it, even if the rubber duck doesn’t know Python. > 小黃鴨調試法：對一個無生命的對象來解釋你的問題，比如小黃鴨之類的，這樣來調試。描述清楚問題很有助于解決問題，所以雖然小黃鴨并不會理解 Python 也不要緊。 ## 13.12 Exercises 練習 ### Exercise 9 練習9 The “rank” of a word is its position in a list of words sorted by frequency: the most common word has rank 1, the second most common has rank 2, etc. > 單詞的『排名』就是在一個單詞列表中，按照出現頻率而排的位置：最常見的單詞就排名第一了，第二常見的就排第二，依此類推。 [Zipf’s law](http://en.wikipedia.org/wiki/Zipf's_law) describes a relationship between the ranks and frequencies of words in natural languages . Specifically, it predicts that the frequency, f, of the word with rank r is: > [Zipf定律](http://en.wikipedia.org/wiki/Zipf's_law) 描述了自然語言中排名和頻率的關系。該定律預言了排名 r 與詞頻 f 之間的關系如下： ``` f = cr^{?s} ``` where s and c are parameters that depend on the language and the text. If you take the logarithm of both sides of this equation, you get: > 這里的 s 和 c 都是參數，依據語言和文本而定。如果對等式兩邊同時取對數，得到如下公式： ``` \log f = \log c ? s*\log r ``` > （譯者注：Zipf定律是美國學者G.K.齊普夫提出的。可以表述為：在自然語言的語料庫里，一個單詞出現的頻率與它在頻率表里的排名成反比。） So if you plot log f versus log r, you should get a straight line with slope ?s and intercept log c. > 因此如果你將 log f 和 log r 進行二維坐標系投點，就應該得到一條直線，斜率是-s，截距是 log c。 Write a program that reads a text from a file, counts word frequencies, and prints one line for each word, in descending order of frequency, with log f and log r. > 寫一個程序，從一個文件中讀取文本，統計單詞頻率，然后每個單詞一行來輸出，按照詞頻的降序，同時輸出一下 log f 和 log r。 Use the graphing program of your choice to plot the results and check whether they form a straight line. Can you estimate the value of s? > 選一種投圖程序，把結果進行投圖，然后檢查一下是否為一條直線。 > 能否估計一下 s 的值呢？ [Solution](http://thinkpython2.com/code/zipf.py). To run my solution, you need the plotting module matplotlib. If you installed Anaconda, you already have matplotlib; otherwise you might have to install it. > [樣例代碼](http://thinkpython2.com/code/zipf.py)。要運行剛剛這個代碼的話，你需要有投圖模塊 matplotlib。如果你安裝了 Anaconda，就已經有 matplotlib 了；或者你就可能需要安裝一下了。 > > （譯者注：matplotlib 的安裝方法有很多，比如 pip install matplotlib 或者 easy_install -U matplotlib）