正則表達式 · UCB DS100 數據科學的原理與技巧

# 正則表達式 > 原文：[Regular Expressions](https://www.textbook.ds100.org/ch/08/text_regex.html) > > 校驗：[Kitty Du](https://github.com/miaoxiaozui2017) > > 自豪地采用[谷歌翻譯](https://translate.google.cn/) ```python # HIDDEN # Clear previously defined variables %reset -f # Set directory for data loading to work properly import os os.chdir(os.path.expanduser('~/notebooks/08')) ``` 在本節中，我們將介紹正則表達式，這是一個在字符串中指定模式的重要工具。 ## 動機在較大篇幅的文本中，許多有用的子字符串以特定的格式出現。例如，下面的句子包含了一個美國電話號碼。 `"give me a call, my number is 123-456-7890."` 電話號碼包含以下模式： 1. 三個數字 2. 然后是破折號 3. 后面是三個數字 4. 然后是破折號 5. 后面是四個數字如果有一段自由格式的文本，我們自然會希望檢測和提取電話號碼。我們還可能希望提取特定的電話號碼片段，例如，通過提取區號，我們可以推斷文本中提到的個人的位置。為了檢測字符串是否包含電話號碼，我們可以嘗試編寫如下方法： ```python def is_phone_number(string): digits = '0123456789' def is_not_digit(token): return token not in digits # Three numbers for i in range(3): if is_not_digit(string[i]): return False # Followed by a dash if string[3] != '-': return False # Followed by three numbers for i in range(4, 7): if is_not_digit(string[i]): return False # Followed by a dash if string[7] != '-': return False # Followed by four numbers for i in range(8, 12): if is_not_digit(string[i]): return False return True ``` ```python is_phone_number("382-384-3840") ``` > True ```python is_phone_number("phone number") ``` > False 上面的代碼令人不快且冗長。比起手動循環字符串中的字符，我們可能更喜歡指定一個模式并命令 python 來匹配該模式。 **正則表達式**（通常縮寫為**regex**）通過允許我們為字符串創建通用模式，方便地解決了這個確切的問題。使用正則表達式，我們可以在短短兩行 python 中重新實現`is_phone_number`方法： ```python import re def is_phone_number(string): regex = r"[0-9]{3}-[0-9]{3}-[0-9]{4}" return re.search(regex, string) is not None is_phone_number("382-384-3840") ``` > True 在上面的代碼中，我們使用 regex`[0-9]{3}-[0-9]{3}-[0-9]{4}`來匹配電話號碼。雖然乍一看很神秘，但幸運的是，正則表達式的語法要比 Python 語言本身簡單得多；我們僅在本節中介紹幾乎所有的語法。我們還將介紹使用 regex 執行字符串操作的內置 python 模塊`re`。 ## regex 語法[](#Regex-Syntax) 我們從正則表達式的語法開始。在 Python 中，正則表達式最常見的存儲形式是原始字符串。原始字符串的行為類似于普通的 python 字符串，除了需要對反斜杠進行特殊處理。例如，要將字符串`hello \ world`存儲在普通的 python 字符串中，我們必須編寫： ```python # Backslashes need to be escaped in normal Python strings some_string = 'hello \\ world' print(some_string) ``` > hello \ world 使用原始字符串可以消除對反斜杠的轉義： ```python # Note the `r` prefix on the string some_raw_string = r'hello \ world' print(some_raw_string) ``` > hello \ world 因為反斜杠經常出現在正則表達式中，所以我們將在本節中對所有正則表達式使用原始字符串。 ### 文字[](#Literals) 正則表達式中的**文字**字符與字符本身匹配。例如，regex`r"a"`將與`"Say! I like green eggs and ham!"`中的任何`"a"`匹配。所有字母數字字符和大多數標點符號都是 regex 文字。 ```python # HIDDEN def show_regex_match(text, regex): """ Prints the string with the regex match highlighted. """ print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text)) ``` ```python # The show_regex_match method highlights all regex matches in the input string regex = r"green" show_regex_match("Say! I like green eggs and ham!", regex) ``` > Say! I like **green** eggs and ham! ```python show_regex_match("Say! I like green eggs and ham!", r"a") ``` > S**a**y! I like green eggs **a**nd h**a**m! 在上面的示例中，我們觀察到正則表達式可以匹配出現在輸入字符串中任何位置的模式。在 python 中，此行為的差異取決于匹配 regex 所的方法，有些方法僅在 regex 出現在字符串開頭時返回匹配；有些方法在字符串的任何位置返回匹配。還要注意，`show_regex_match`方法突出顯示了輸入字符串中所有出現的 regex。同樣，這取決于使用的python 方法，某些方法返回所有匹配項，而有些方法只返回第一個匹配項。正則表達式區分大小寫。在下面的示例中，regex 只匹配`eggs`中的小寫`s`，而不匹配`Say`中的大寫`S`。 ```python show_regex_match("Say! I like green eggs and ham!", r"s") ``` > Say! I like green egg**s** and ham! ### 通配符[](#Wildcard-Character) 有些字符在正則表達式中有特殊的含義。這些元字符允許正則表達式匹配各種模式。在正則表達式中，句點字符`.`與除換行符以外的任何字符匹配。 ```python show_regex_match("Call me at 382-384-3840.", r".all") ``` > **Call** me at 382-384-3840. 為了只匹配句點文字字符，我們必須用反斜杠轉義它： ```python show_regex_match("Call me at 382-384-3840.", r"\.") ``` > Call me at 382-384-3840 **.** 通過使用句點字符來標記不同模式的各個部分，我們構造了一個 regex 來匹配電話號碼。例如，我們可以將原始電話號碼`382-384-3840`中的數字替換為`.`，將破折號保留為文字。得到的 regex 結果為 `...-...-....`。 ```python show_regex_match("Call me at 382-384-3840.", "...-...-....") ``` > Call me at **382-384-3840**. 但是，由于句點字符與所有字符匹配，下面的輸入字符串將產生一個偽匹配。 ```python show_regex_match("My truck is not-all-blue.", "...-...-....") ``` > My truck is **not-all-blue**. ### 字符類[](#Character-Classes) **字符類**匹配指定的字符集，允許我們創建限制比只有`.`字符更嚴格的匹配。要創建字符類，請將所需字符集括在括號`[ ]`中。 ```python show_regex_match("I like your gray shirt.", "gr[ae]y") ``` > I like your **gray** shirt. ```python show_regex_match("I like your grey shirt.", "gr[ae]y") ``` > I like your **grey** shirt. ```python # Does not match; a character class only matches one character from a set show_regex_match("I like your graey shirt.", "gr[ae]y") ``` > I like your graey shirt. ```python # In this example, repeating the character class will match show_regex_match("I like your graey shirt.", "gr[ae][ae]y") ``` > I like your **graey** shirt. 在字符類中，`.`字符被視為文字，而不是通配符。 ```python show_regex_match("I like your grey shirt.", "irt[.]") ``` > I like your grey sh**irt.** 對于常用的字符類，我們可以使用一些特殊的速記符號： | 速記 | 意義 | | ---: | ---: | | [0－9] | 所有的數字 | | [a-z] | 小寫字母 | | [A-Z] | 大寫字母 | ```python show_regex_match("I like your gray shirt.", "y[a-z]y") ``` > I like your gray shirt. 字符類允許我們為電話號碼創建更具體的 regex。 ```python # We replaced every `.` character in ...-...-.... with [0-9] to restrict # matches to digits. phone_regex = r'[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]' show_regex_match("Call me at 382-384-3840.", phone_regex) ``` > Call me at **382-384-3840**. ```python # Now we no longer match this string: show_regex_match("My truck is not-all-blue.", phone_regex) ``` > My truck is not-all-blue. ### 否定的字符類[](#Negated-Character-Classes) **否定的字符類**匹配**除類中字符以外**的任何字符。要創建否定字符類，請將否定字符括在`[^ ]`中。 ```python show_regex_match("The car parked in the garage.", r"[^c]ar") ``` > The car **par**ked in the **gar**age. ### 量詞[](#Quantifiers) 為了創建與電話號碼匹配的 regex，我們編寫了： ``` [0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] ``` 這匹配 3 位數字、一個破折號、3 位數字、一個破折號和 4 位數字。量詞允許我們匹配一個模式的多個連續出現。我們通過將數字放在大括號`{ }`中指定重復次數。 ```python phone_regex = r'[0-9]{3}-[0-9]{3}-[0-9]{4}' show_regex_match("Call me at 382-384-3840.", phone_regex) ``` > Call me at **382-384-3840**. ```python # No match phone_regex = r'[0-9]{3}-[0-9]{3}-[0-9]{4}' show_regex_match("Call me at 12-384-3840.", phone_regex) ``` > Call me at 12-384-3840. 量詞總是將字符或字符類修改為最左邊。下表給出了量詞的完整語法。 | 量詞 | 意義 | | ---: | ---: | | {m,n} | 與前面的字符匹配 m ~ n 次。 | | {m} | 與前面的字符剛好匹配 m 次。 | | {m,} | 至少與前面的字符匹配 m 次。 | | {,n} | 最多與前面的字符匹配 n 次。 | **速記量詞** 一些常用的量詞有一個簡寫： | 符號 | 量詞 | 意義 | | ---: | ---: | ---: | | * | {0,} | 與前面的字符匹配 0 次或更多次 | | + | {1,} | 與前面的字符匹配 1 次或更多次 | | ? | {0,1} | 與前面的字符匹配 0 或 1 次 | 在下面的示例中，我們使用`*`字符而不是`{0,}`。 ```python # 3 a's show_regex_match('He screamed "Aaaah!" as the cart took a plunge.', "Aa*h!") ``` > He screamed "**Aaaah!**" as the cart took a plunge. ```python # Lots of a's show_regex_match( 'He screamed "Aaaaaaaaaaaaaaaaaaaah!" as the cart took a plunge.', "Aa*h!" ) ``` > He screamed "**Aaaaaaaaaaaaaaaaaaaah!**" as the cart took a plunge. ```python # No lowercase a's show_regex_match('He screamed "Ah!" as the cart took a plunge.', "Aa*h!") ``` > He screamed "**Ah!**" as the cart took a plunge. ### 量詞貪婪[](#Quantifiers-are-greedy) 量詞總是返回盡可能長的匹配。這有時會導致令人驚訝的行為： ```python # We tried to match 311 and 911 but matched the ` and ` as well because # `<311> and <911>` is the longest match possible for `<.+>`. show_regex_match("Remember the numbers <311> and <911>", "<.+>") ``` > Remember the numbers **<311> and <911>** 在許多情況下，使用更具體的字符類可以防止這些錯誤匹配： ```python show_regex_match("Remember the numbers <311> and <911>", "<[0-9]+>") ``` > Remember the numbers **<311>** and **<911>** ### 固定[](#Anchoring) 有時模式應該只在字符串的開頭或結尾匹配。特殊字符`^`僅當模式出現在字符串的開頭時才固定匹配 regex ；特殊字符`$`僅當模式出現在字符串的結尾時固定匹配 regex 。例如，regex `well$` 只匹配字符串末尾的`well`。 ```python show_regex_match('well, well, well', r"well$") ``` > well, well, **well** 同時使用`^`和`$`需要 regex 匹配整個字符串。 ```python phone_regex = r"^[0-9]{3}-[0-9]{3}-[0-9]{4}$" show_regex_match('382-384-3840', phone_regex) ``` > **382-384-3840** ```python # No match show_regex_match('You can call me at 382-384-3840.', phone_regex) ``` > You can call me at 382-384-3840. ### 轉義元字符[](#Escaping-Meta-Characters) 在正則表達式中，所有 regex 元字符都具有特殊意義。為了將元字符匹配為文字，我們使用`\`字符對它們進行轉義。 ```python # `[` is a meta character and requires escaping show_regex_match("Call me at [382-384-3840].", "\[") ``` > Call me at **\[** 382-384-3840]. ```python # `.` is a meta character and requires escaping show_regex_match("Call me at [382-384-3840].", "\.") ``` > Call me at [382-384-3840]**.** ## 參考表[](#Reference-Tables) 我們現在已經介紹了最重要的 regex 語法和元字符。為了更完整的參考，我們包括了下表。 **元字符** 此表包含大多數重要的 *元字符* ，這有助于我們在字符串中指定要匹配的某些模式。 | 字符 | 說明 | 例子 | 匹配 | 不匹配 | | ---: | ---: | ---: | ---: | ---: | | . | 除\n以外的任何字符 | `...` | abc | ab abcd | | [] | 括號內的任何字符 | `[cb.]ar` | car .ar | jar | | [^ ] | *除了*括號內的任何字符 | `[^b]ar` | car par | bar ar | | * | ≥0次的前一個字符 | `[pb]*ark` | bbark ark | dark | | + | ≥1次的前一個字符 | `[pb]+ark` | bbpark bark | dark ark | | ? | 0或1次的前一個字符 | `s?he` | she he | the | | {n} | 剛好n次的前一個字符 | `hello{3}` | hellooo | hello | | \| | \|前或\|后的模式 | `we\|[ui]s` | we us is | e s | | \ | 轉義下一個字符 | `\[hi\]` | [hi] | hi | | ^ | 行首 | `^ark` | ark&ensp;two | dark | | $ | 行尾 | `ark$` | noahs&ensp;ark | noahs&ensp;arks | **速記字符集** 一些常用的字符集有速記。 | 說明 | 括號形式 | 速記 | | ---: | ---: | ---: | | 字母數字字符 | `[a-zA-Z0-9]` | `\w` | | 不是字母數字字符 | `[^a-zA-Z0-9]` | `\W` | | 數字 | `[0-9]` | `\d` | | 不是數字 | `[^0-9]` | `\D` | | 空白 | `[\t\n\f\r\p{Z}]` | `\s` | | 不是空白 | `[^\t\n\f\r\p{z}]` | `\S` | ## 摘要[](#Summary) 幾乎所有的編程語言都有一個庫，可以使用正則表達式來匹配模式，使它們無論使用哪種特定的語言都很有用。在本節中，我們介紹了 regex 語法和最有用的元字符。