Python 正則表達式 · ZetCode 中文系列教程

# Python 正則表達式 > 原文： [http://zetcode.com/python/regularexpressions/](http://zetcode.com/python/regularexpressions/) Python 正則表達式教程展示了如何在 Python 中使用正則表達式。對于 Python 中的正則表達式，我們使用`re`模塊。正則表達式用于文本搜索和更高級的文本操作。正則表達式是內置工具，如`grep`，`sed`，文本編輯器（如 vi，emacs），編程語言（如 Tcl，Perl 和 Python）。 ## Python `re`模塊在 Python 中，`re`模塊提供了正則表達式匹配操作。模式是一個正則表達式，用于定義我們正在搜索或操縱的文本。它由文本文字和元字符組成。用`compile()`函數編譯該模式。由于正則表達式通常包含特殊字符，因此建議使用原始字符串。（原始字符串以`r`字符開頭。）這樣，在將字符編譯為模式之前，不會對這些字符進行解釋。編譯模式后，可以使用其中一個函數將模式應用于文本字符串。函數包括`match()`，`search()`，`find()`和`finditer()`。下表顯示了一些正則表達式： | 正則表達式 | 含義 | | --- | --- | | `.` | 匹配任何單個字符。 | | `?` | 一次匹配或根本不匹配前面的元素。 | | `+` | 與前面的元素匹配一次或多次。 | | `*` | 與前面的元素匹配零次或多次。 | | `^` | 匹配字符串中的起始位置。 | | `$` | 匹配字符串中的結束位置。 | | <code>|</code> | 備用運算符。 | | `[abc]` | 匹配`a`或`b`或`c`。 | | `[a-c]` | 范圍; 匹配`a`或`b`或`c`。 | | `[^abc]` | 否定，匹配除`a`或`b`或`c`之外的所有內容。 | | `\s` | 匹配空白字符。 | | `\w` | 匹配單詞字符；等同于`[a-zA-Z_0-9]` | ## 匹配函數以下是一個代碼示例，演示了 Python 中簡單正則表達式的用法。 `match_fun.py` ```py #!/usr/bin/python3 import re words = ('book', 'bookworm', 'Bible', 'bookish','cookbook', 'bookstore', 'pocketbook') pattern = re.compile(r'book') for word in words: if re.match(pattern, word): print('The {} matches '.format(word)) ``` 在示例中，我們有一個單詞元組。編譯后的模式將在每個單詞中尋找一個`"book"`字符串。 ```py pattern = re.compile(r'book') ``` 使用`compile()`函數，我們可以創建模式串。正則表達式是一個原始字符串，由四個普通字符組成。 ```py for word in words: if re.match(pattern, word): print('The {} matches '.format(word)) ``` 我們遍歷元組并調用`match()`函數。它將模式應用于單詞。如果字符串開頭有匹配項，則`match()`函數將返回匹配對象。 ```py $ ./match_fun.py The book matches The bookworm matches The bookish matches The bookstore matches ``` 元組中的四個單詞與模式匹配。請注意，以`"book"`一詞開頭的單詞不匹配。為了包括這些詞，我們使用`search()`函數。 ## 搜索函數 `search()`函數查找正則表達式模式產生匹配項的第一個位置。 `search_fun.py` ```py #!/usr/bin/python3 import re words = ('book', 'bookworm', 'Bible', 'bookish','cookbook', 'bookstore', 'pocketbook') pattern = re.compile(r'book') for word in words: if re.search(pattern, word): print('The {} matches '.format(word)) ``` 在示例中，我們使用`search()`函數查找`"book"`一詞。 ```py $ ./search_fun.py The book matches The bookworm matches The bookish matches The cookbook matches The bookstore matches The pocketbook matches ``` 這次還包括菜譜和袖珍書中的單詞。 ## 點元字符點（。）元字符代表文本中的任何單個字符。 `dot_meta.py` ```py #!/usr/bin/python3 import re words = ('seven', 'even', 'prevent', 'revenge', 'maven', 'eleven', 'amen', 'event') pattern = re.compile(r'.even') for word in words: if re.match(pattern, word): print('The {} matches '.format(word)) ``` 在示例中，我們有一個帶有八個單詞的元組。我們在每個單詞上應用一個包含點元字符的模式。 ```py pattern = re.compile(r'.even') ``` 點代表文本中的任何單個字符。字符必須存在。 ```py $ ./dot_meta.py The seven matches The revenge matches ``` 兩個字匹配模式：七個和復仇。 ## 問號元字符問號（？）元字符是與上一個元素零或一次匹配的量詞。 `question_mark_meta.py` ```py #!/usr/bin/python3 import re words = ('seven', 'even','prevent', 'revenge', 'maven', 'eleven', 'amen', 'event') pattern = re.compile(r'.?even') for word in words: if re.match(pattern, word): print('The {} matches '.format(word)) ``` 在示例中，我們在點字符后添加問號。這意味著在模式中我們可以有一個任意字符，也可以在那里沒有任何字符。 ```py $ ./question_mark_meta.py The seven matches The even matches The revenge matches The event matches ``` 這次，除了七個和復仇外，偶數和事件詞也匹配。 ## 錨點錨點匹配給定文本內字符的位置。當使用`^`錨時，匹配必須發生在字符串的開頭，而當使用$錨時，匹配必須發生在字符串的結尾。 `anchors.py` ```py #!/usr/bin/python3 import re sentences = ('I am looking for Jane.', 'Jane was walking along the river.', 'Kate and Jane are close friends.') pattern = re.compile(r'^Jane') for sentence in sentences: if re.search(pattern, sentence): print(sentence) ``` 在示例中，我們有三個句子。搜索模式為`^Jane`。該模式檢查`"Jane"`字符串是否位于文本的開頭。 `Jane\.`將在句子結尾處查找`"Jane"`。 ## `fullmatch` 可以使用`fullmatch()`函數或通過將術語放在錨點之間來進行精確匹配：^和$。 `exact_match.py` ```py #!/usr/bin/python3 import re words = ('book', 'bookworm', 'Bible', 'bookish','cookbook', 'bookstore', 'pocketbook') pattern = re.compile(r'^book$') for word in words: if re.search(pattern, word): print('The {} matches'.format(word)) ``` 在示例中，我們尋找與`"book"`一詞完全匹配的內容。 ```py $ ./exact_match.py The book matches ``` 這是輸出。 ## 字符類字符類定義了一組字符，任何字符都可以出現在輸入字符串中以使匹配成功。 `character_class.py` ```py #!/usr/bin/python3 import re words = ('a gray bird', 'grey hair', 'great look') pattern = re.compile(r'gr[ea]y') for word in words: if re.search(pattern, word): print('{} matches'.format(word)) ``` 在該示例中，我們使用字符類同時包含灰色和灰色單詞。 ```py pattern = re.compile(r'gr[ea]y') ``` `[ea]`類允許在模式中使用'e'或'a'字符。 ## 命名字符類有一些預定義的字符類。 `\s`與空白字符`[\t\n\t\f\v]`匹配，`\d`與數字`[0-9]`匹配，`\w`與單詞字符`[a-zA-Z0-9_]`匹配。 `named_character_class.py` ```py #!/usr/bin/python3 import re text = 'We met in 2013\. She must be now about 27 years old.' pattern = re.compile(r'\d+') found = re.findall(pattern, text) if found: print('There are {} numbers'.format(len(found))) ``` 在示例中，我們計算文本中的數字。 ```py pattern = re.compile(r'\d+') ``` `\d+`模式在文本中查找任意數量的數字集。 ```py found = re.findall(pattern, text) ``` 使用`findall()`方法，我們可以查找文本中的所有數字。 ```py $ ./named_character_classes.py There are 2 numbers ``` 這是輸出。 ## 不區分大小寫的匹配默認情況下，模式匹配區分大小寫。通過將`re.IGNORECASE`傳遞給`compile()`函數，我們可以使其不區分大小寫。 `case_insensitive.py` ```py #!/usr/bin/python3 import re words = ('dog', 'Dog', 'DOG', 'Doggy') pattern = re.compile(r'dog', re.IGNORECASE) for word in words: if re.match(pattern, word): print('{} matches'.format(word)) ``` 在示例中，無論大小寫如何，我們都將模式應用于單詞。 ```py $ ./case_insensitive.py dog matches Dog matches DOG matches Doggy matches ``` 所有四個單詞都與模式匹配。 ## 交替交替運算符`|`創建具有多種選擇的正則表達式。 `alternations.py` ```py #!/usr/bin/python3 import re words = ("Jane", "Thomas", "Robert", "Lucy", "Beky", "John", "Peter", "Andy") pattern = re.compile(r'Jane|Beky|Robert') for word in words: if re.match(pattern, word): print(word) ``` 列表中有八個名稱。 ```py pattern = re.compile(r'Jane|Beky|Robert') ``` 此正則表達式查找`Jane`，`"Beky"`或`"Robert"`字符串。 ## 查找方法 `finditer()`方法返回一個迭代器，該迭代器在字符串中的模式的所有不重疊匹配上產生匹配對象。 `find_iter.py` ```py #!/usr/bin/python3 import re text = ('I saw a fox in the wood. The fox had red fur.') pattern = re.compile(r'fox') found = re.finditer(pattern, text) for item in found: s = item.start() e = item.end() print('Found {} at {}:{}'.format(text[s:e], s, e)) ``` 在示例中，我們在文本中搜索`"fox"`一詞。我們遍歷找到的匹配項的迭代器，并使用它們的索引進行打印。 ```py s = item.start() e = item.end() ``` `start()`和`end()`方法分別返回起始索引和結束索引。 ```py $ ./find_iter.py Found fox at 8:11 Found fox at 29:32 ``` 這是輸出。 ## 捕獲組捕獲組是一種將多個字符視為一個單元的方法。通過將字符放置在一組圓括號內來創建它們。例如，`(book)`是包含`'b', 'o', 'o', 'k'`字符的單個組。捕獲組技術使我們能夠找出字符串中與常規模式匹配的那些部分。 `capturing_groups.py` ```py #!/usr/bin/python3 import re content = '''<p>The <code>Pattern</code> is a compiled representation of a regular expression.</p>''' pattern = re.compile(r'(</?[a-z]*>)') found = re.findall(pattern, content) for tag in found: print(tag) ``` 該代碼示例通過捕獲一組字符來打印提供的字符串中的所有 HTML 標簽。 ```py found = re.findall(pattern, content) ``` 為了找到所有標簽，我們使用`findall()`方法。 ```py $ ./capturing_groups.py <p> <code> </code> </p> ``` 我們找到了四個 HTML 標簽。 ## Python 正則表達式電子郵件示例在以下示例中，我們創建一個用于檢查電子郵件地址的正則表達式模式。 `emails.py` ```py #!/usr/bin/python3 import re emails = ("luke@gmail.com", "andy@yahoocom", "34234sdfa#2345", "f344@gmail.com") pattern = re.compile(r'^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$') for email in emails: if re.match(pattern, email): print("{} matches".format(email)) else: print("{} does not match".format(email)) ``` 本示例提供了一種可能的解決方案。 ```py pattern = re.compile(r'^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$') ``` 前`^`和后`$`個字符提供精確的模式匹配。模式前后不允許有字符。電子郵件分為五個部分。第一部分是本地部分。這通常是公司，個人或昵稱的名稱。 `[a-zA-Z0-9._-]+`列出了所有可能的字符，我們可以在本地使用。它們可以使用一次或多次。第二部分由文字`@`字符組成。第三部分是領域部分。通常是電子郵件提供商的域名，例如 yahoo 或 gmail。 `[a-zA-Z0-9-]+`是一個字符類，提供可在域名中使用的所有字符。 `+`量詞允許使用這些字符中的一個或多個。第四部分是點字符。它前面帶有轉義字符（`\`），以獲取文字點。最后一部分是頂級域：`[a-zA-Z.]{2,18}`。頂級域可以包含 2 到 18 個字符，例如`sk, net, info, travel, cleaning, travelinsurance`。最大長度可以為 63 個字符，但是今天大多數域都少于 18 個字符。還有一個點字符。這是因為某些頂級域包含兩個部分：例如`co.uk`。 ```py $ ./emails.py luke@gmail.com matches andy@yahoocom does not match 34234sdfa#2345 does not match f344@gmail.com matches ``` 這是輸出。在本章中，我們介紹了 Python 中的正則表達式。您可能也對以下相關教程感興趣： [Python CSV 教程](/python/csv/)和 [Python 教程](/lang/python/)。