# Python 和 pandas 中的 Regex
> 原文:[Regex in Python and pandas](https://www.textbook.ds100.org/ch/08/text_re.html)
>
> 校驗:[Kitty Du](https://github.com/miaoxiaozui2017)
>
> 自豪地采用[谷歌翻譯](https://translate.google.cn/)
```python
# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))
```
```python
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
```
在本節中,我們將介紹python內置的`re`模塊中 regex 的用法。因為我們只介紹了一些最常用的方法,所以您也可以參考[有關`re`模塊的官方文檔](https://docs.python.org/3/library/re.html)。
#### `re.search`[](#re.search)
`re.search(pattern, string)`在`string`中的任意位置搜索 regex`pattern`的匹配項。如果找到模式,則返回一個 TruthyMatch 對象;如果沒有,則返回`None`。
```python
phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text = "Call me at 382-384-3840."
match = re.search(phone_re, text)
match
```
```
<_sre.SRE_Match object; span=(11, 23), match='382-384-3840'>
```
雖然返回的 match 對象有各種有用的屬性,但我們最常用`re.search`來測試模式是否出現在字符串中。
```python
if re.search(phone_re, text):
print("Found a match!")
```
```
Found a match!
```
```python
if re.search(phone_re, 'Hello world'):
print("No match; this won't print")
```
另一個常用的方法`re.match(pattern, string)`的行為與`re.search`相同,但只檢查`string`開頭的匹配項,而不是字符串中任何位置的匹配項。
#### `re.findall`[](#re.findall)
我們使用`re.findall(pattern, string)`提取與 regex 匹配的子字符串。此方法返回`string`中所有匹配項的列表。
```python
gmail_re = r'[a-zA-Z0-9]+@gmail\.com'
text = '''
From: email1@gmail.com
To: email2@yahoo.com and email3@gmail.com
'''
re.findall(gmail_re, text)
```
```
['email1@gmail.com', 'email3@gmail.com']
```
## Regex 組[](#Regex-Groups)
使用**regex 組**,我們通過將子模式括在括號`( )`中指定要從 regex 提取的子模式。當 regex 包含 regex 組時,`re.findall`返回包含子模式內容的元組列表。
例如,以下是我們熟悉的用 regex 從字符串中提取電話號碼:
```python
phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)
```
```
['382-384-3840', '123-456-7890']
```
為了將一個電話號碼的三位或四位組成部分分開,我們可以將每個數字組用括號括起來。
```python
# Same regex with parentheses around the digit groups
phone_re = r"([0-9]{3})-([0-9]{3})-([0-9]{4})"
text = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)
```
```
[('382', '384', '3840'), ('123', '456', '7890')]
```
正如所承諾的那樣,`re.findall`返回包含匹配電話號碼的各個組成部分的元組列表。
#### `re.sub`[](#re.sub)
`re.sub(pattern, replacement, string)`用`replacement`替換`string`中所有出現的`pattern`。此方法的行為類似于 python 字符串方法`str.sub`,但使用 regex 來匹配模式。
在下面的代碼中,我們通過用破折號替換日期分隔符來將日期更改為通用格式。
```python
messy_dates = '03/12/2018, 03.13.18, 03/14/2018, 03:15:2018'
regex = r'[/.:]'
re.sub(regex, '-', messy_dates)
```
```
'03-12-2018, 03-13-18, 03-14-2018, 03-15-2018'
```
#### `re.split`[](#re.split)
`re.split(pattern, string)`在每次出現regex `pattern`時分割輸入的`string`。此方法的行為類似于 python 字符串方法`str.split`,但使用 regex 進行分割。
在下面的代碼中,我們使用`re.split`將一本書目錄的章節名稱和它們的頁碼分開。
```python
toc = '''
PLAYING PILGRIMS============3
A MERRY CHRISTMAS===========13
THE LAURENCE BOY============31
BURDENS=====================55
BEING NEIGHBORLY============76
'''.strip()
# First, split into individual lines
lines = re.split('\n', toc)
lines
```
```
['PLAYING PILGRIMS============3',
'A MERRY CHRISTMAS===========13',
'THE LAURENCE BOY============31',
'BURDENS=====================55',
'BEING NEIGHBORLY============76']
```
```python
# Then, split into chapter title and page number
split_re = r'=+' # Matches any sequence of = characters
[re.split(split_re, line) for line in lines]
```
```
[['PLAYING PILGRIMS', '3'],
['A MERRY CHRISTMAS', '13'],
['THE LAURENCE BOY', '31'],
['BURDENS', '55'],
['BEING NEIGHBORLY', '76']]
```
## Regex 和 pandas[](#Regex-and-pandas)
回想一下,`pandas` Series對象有一個`.str`屬性,它支持使用 python 字符串方法進行字符串操作。很方便的是,`.str`屬性還支持一些`re`模塊的函數。我們演示了regex在`pandas`中的基本用法,完整的方法列表在[有關字符串方法的`pandas`文檔](https://pandas.pydata.org/pandas-docs/stable/text.html)中。
我們在下面的DataFrame中保存了小說《小女人》(*Little Women*)前五句話的文本。我們可以使用`pandas`提供的字符串方法來提取每個句子中的口語對話。
```python
# HIDDEN
text = '''
"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
'''.strip()
little = pd.DataFrame({
'sentences': text.split('\n')
})
```
```
little
```
| | sentences |
| --- | ---: |
| 0 | "Christmas won't be Christmas without any pres... |
| 1 | "It's so dreadful to be poor!" sighed Meg, loo... |
| 2 | "I don't think it's fair for some girls to hav... |
| 3 | "We've got Father and Mother, and each other,"... |
| 4 | The four young faces on which the firelight sh... |
由于口語對話位于雙引號內,因此我們創建一個 regex,它捕獲左雙引號、除雙引號外的任何字符序列和右雙引號。
```python
quote_re = r'"[^"]+"'
little['sentences'].str.findall(quote_re)
```
```
0 ["Christmas won't be Christmas without any pre...
1 ["It's so dreadful to be poor!"]
2 ["I don't think it's fair for some girls to ha...
3 ["We've got Father and Mother, and each other,"]
4 ["We haven't got Father, and shall not have hi...
Name: sentences, dtype: object
```
由于`Series.str.findall`方法返回匹配項列表,`pandas`還提供`Series.str.extract`和`Series.str.extractall`方法將匹配項提取到Series或DataFrame中。這些方法要求 regex 至少包含一個 regex 組。
```python
# Extract text within double quotes
quote_re = r'"([^"]+)"'
spoken = little['sentences'].str.extract(quote_re)
spoken
```
```
0 Christmas won't be Christmas without any prese...
1 It's so dreadful to be poor!
2 I don't think it's fair for some girls to have...
3 We've got Father and Mother, and each other,
4 We haven't got Father, and shall not have him ...
Name: sentences, dtype: object
```
我們可以將此序列添加為`little`DataFrame的列:
```python
little['dialog'] = spoken
little
```
| | sentences | dialog |
| --- | ---: | ---: |
| 0 | "Christmas won't be Christmas without any pres... | Christmas won't be Christmas without any prese... |
| 1 | "It's so dreadful to be poor!" sighed Meg, loo... | It's so dreadful to be poor! |
| 2 | "I don't think it's fair for some girls to hav... | I don't think it's fair for some girls to have... |
| 3 | "We've got Father and Mother, and each other,"... | We've got Father and Mother, and each other, |
| 4 | The four young faces on which the firelight sh... | We haven't got Father, and shall not have him ... |
我們可以通過打印原始文本和提取文本來確認字符串操作在DataFrame中的最后一句話上是否如預期執行:
```python
print(little.loc[4, 'sentences'])
```
```
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
```
```python
print(little.loc[4, 'dialog'])
```
```
We haven't got Father, and shall not have him for a long time.
```
## 摘要[](#Summary)
python 中的`re`模塊提供了一組使用正則表達式操作文本的實用方法。在處理DataFrame時,我們經常使用`pandas`中實現的類似的字符串操作方法。
有關`re`模塊的完整文檔,請參閱[https://docs.python.org/3/library/re.html](https://docs.python.org/3/library/re.html)
有關`pandas`字符串方法的完整文檔,請參閱[https://pandas.pydata.org/pandas-docs/stable/text.html](https://pandas.pydata.org/pandas-docs/stable/text.html)
- 一、數據科學的生命周期
- 二、數據生成
- 三、處理表格數據
- 四、數據清理
- 五、探索性數據分析
- 六、數據可視化
- Web 技術
- 超文本傳輸協議
- 處理文本
- python 字符串方法
- 正則表達式
- regex 和 python
- 關系數據庫和 SQL
- 關系模型
- SQL
- SQL 連接
- 建模與估計
- 模型
- 損失函數
- 絕對損失和 Huber 損失
- 梯度下降與數值優化
- 使用程序最小化損失
- 梯度下降
- 凸性
- 隨機梯度下降法
- 概率與泛化
- 隨機變量
- 期望和方差
- 風險
- 線性模型
- 預測小費金額
- 用梯度下降擬合線性模型
- 多元線性回歸
- 最小二乘-幾何透視
- 線性回歸案例研究
- 特征工程
- 沃爾瑪數據集
- 預測冰淇淋評級
- 偏方差權衡
- 風險和損失最小化
- 模型偏差和方差
- 交叉驗證
- 正規化
- 正則化直覺
- L2 正則化:嶺回歸
- L1 正則化:LASSO 回歸
- 分類
- 概率回歸
- Logistic 模型
- Logistic 模型的損失函數
- 使用邏輯回歸
- 經驗概率分布的近似
- 擬合 Logistic 模型
- 評估 Logistic 模型
- 多類分類
- 統計推斷
- 假設檢驗和置信區間
- 置換檢驗
- 線性回歸的自舉(真系數的推斷)
- 學生化自舉
- P-HACKING
- 向量空間回顧
- 參考表
- Pandas
- Seaborn
- Matplotlib
- Scikit Learn