python 字符串方法 · UCB DS100 數據科學的原理與技巧

# python 字符串方法 > 原文：[Python String Methods](https://www.textbook.ds100.org/ch/08/text_strings.html) > > 校驗：[Kitty Du](https://github.com/miaoxiaozui2017) > > 自豪地采用[谷歌翻譯](https://translate.google.cn/) ```python # HIDDEN # Clear previously defined variables %reset -f # Set directory for data loading to work properly import os os.chdir(os.path.expanduser('~/notebooks/08')) ``` ```python # HIDDEN import warnings # Ignore numpy dtype warnings. These warnings are caused by an interaction # between numpy and Cython and can be safely ignored. # Reference: https://stackoverflow.com/a/40846742 warnings.filterwarnings("ignore", message="numpy.dtype size changed") warnings.filterwarnings("ignore", message="numpy.ufunc size changed") import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns %matplotlib inline import ipywidgets as widgets from ipywidgets import interact, interactive, fixed, interact_manual import nbinteract as nbi sns.set() sns.set_context('talk') np.set_printoptions(threshold=20, precision=2, suppress=True) pd.options.display.max_rows = 7 pd.options.display.max_columns = 8 pd.set_option('precision', 2) # This option stops scientific notation for pandas # pd.set_option('display.float_format', '{:.2f}'.format) ``` python 為基本的字符串操作提供了多種方法。雖然這些方法簡單，但是作為基礎組合在一起形成了更加復雜的字符串操作。我們將在“處理文本：數據清洗”的通用用例中介紹 Python 的字符串方法。 ## 清洗文本數據數據通常來自幾個不同的來源，每個來源都實現了自己的信息編碼方式。在下面的示例中，我們用一個表記錄一個縣(County)所屬的州(State)，另一個表記錄該縣的人口(Population)。 ```python # HIDDEN state = pd.DataFrame({ 'County': [ 'De Witt County', 'Lac qui Parle County', 'Lewis and Clark County', 'St John the Baptist Parish', ], 'State': [ 'IL', 'MN', 'MT', 'LA', ] }) population = pd.DataFrame({ 'County': [ 'DeWitt ', 'Lac Qui Parle', 'Lewis & Clark', 'St. John the Baptist', ], 'Population': [ '16,798', '8,067', '55,716', '43,044', ] }) ``` ``` state ``` | | County | State | | --- | --- | --- | | 0 | De Witt County | IL(伊利諾伊州) | | 1 | Lac qui Parle County | MN(明尼蘇達州) | | 2 | Lewis and Clark County | MT(蒙大拿州) | | 3 | St John the Baptist Parish | LA(路易斯安那州) | ``` population ``` | | County | Population | | --- | --- | --- | | 0 | DeWitt | 16,798 | | 1 | Lac Qui Parle | 8,067 | | 2 | Lewis & Clark | 55,716 | | 3 | St. John the Baptist | 43,044 | 我們當然希望使用`County`列連接`state`和`population`表。不幸的是，兩張表中沒有一個縣的拼寫相同。此示例說明了文本數據中存在以下常見問題： 1. 大寫：qui 對應 Qui 2. 不同的標點符號習慣：St. 對應 St 3. 缺少單詞：在`population`表中缺少單詞`County`/`Parish` 4. 空白的使用：DeWitt 對應 De Witt 5. 不同的縮寫習慣：& 對應 and ## 字符串方法[](#String-Methods) python 的字符串方法允許我們著手解決這些問題。這些方法在所有 python 字符串上都被方便地定義，因此不需要導入其他模塊。雖然有必要熟悉一下[字符串方法的完整列表](https://docs.python.org/3/library/stdtypes.html#string-methods)，但我們在下表中描述了一些最常用的方法。 | 方法 | 說明 | | --- | --- | | `str[x:y]` | 切片`str`，返回索引 x（包含）到 y（不包含） | | `str.lower()` | 返回字符串的副本，所有字母都轉換為小寫 | | `str.replace(a, b)` | 用子字符串`b`替換`str`中子字符串`a`的所有實例 | | `str.split(a)` | 返回在子字符串`a`處拆分的`str`子字符串 | | `str.strip()` | 從`str`中刪除前導空格和尾隨空格 | 我們從`state`和`population`表中選擇St. John the Baptist parish的字符串，并應用字符串方法去除大寫、標點符號和`county`/`parish`的出現。 ```python john1 = state.loc[3, 'County'] john2 = population.loc[3, 'County'] (john1 .lower() .strip() .replace(' parish', '') .replace(' county', '') .replace('&', 'and') .replace('.', '') .replace(' ', '') ) ``` ``` 'stjohnthebaptist' ``` 將同一組方法應用于`john2`，這樣我們就能驗證兩個字符串現在是否相同。 ```python (john2 .lower() .strip() .replace(' parish', '') .replace(' county', '') .replace('&', 'and') .replace('.', '') .replace(' ', '') ) ``` ``` 'stjohnthebaptist' ``` 滿意的是，我們創建了一個名為`clean_county`的方法來規范化輸入的county。 ```python def clean_county(county): return (county .lower() .strip() .replace(' county', '') .replace(' parish', '') .replace('&', 'and') .replace(' ', '') .replace('.', '')) ``` 現在，我們可以驗證`clean_county`方法為兩個表中的所有的county生成匹配的county： ```python ([clean_county(county) for county in state['County']], [clean_county(county) for county in population['County']] ) ``` ``` (['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'], ['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist']) ``` 因為兩個表中的每個county都有相同的轉換表示，所以我們可以使用轉換后的county成功地連接兩個表。 ## pandas 中的字符串方法[](#String-Methods-in-pandas) 在上面的代碼中，我們使用一個循環來轉換每個county。`pandas`的Series對象提供了一種將字符串方法應用于序列中每個項的方便方法。首先，`state`表中的county序列： ```python state['County'] ``` ``` 0 De Witt County 1 Lac qui Parle County 2 Lewis and Clark County 3 St John the Baptist Parish Name: County, dtype: object ``` `pandas`的Series的`.str`屬性提供了和原生Python 中相同的字符串方法。對`.str`屬性調用方法會對序列中的每個項調用該方法。 ```python state['County'].str.lower() ``` ``` 0 de witt county 1 lac qui parle county 2 lewis and clark county 3 st john the baptist parish Name: County, dtype: object ``` 這允許我們在不使用循環的情況下轉換序列中的每個字符串。 ```python (state['County'] .str.lower() .str.strip() .str.replace(' parish', '') .str.replace(' county', '') .str.replace('&', 'and') .str.replace('.', '') .str.replace(' ', '') ) ``` ``` 0 dewitt 1 lacquiparle 2 lewisandclark 3 stjohnthebaptist Name: County, dtype: object ``` 我們將轉換后的county保存回其原始表： ```python state['County'] = (state['County'] .str.lower() .str.strip() .str.replace(' parish', '') .str.replace(' county', '') .str.replace('&', 'and') .str.replace('.', '') .str.replace(' ', '') ) population['County'] = (population['County'] .str.lower() .str.strip() .str.replace(' parish', '') .str.replace(' county', '') .str.replace('&', 'and') .str.replace('.', '') .str.replace(' ', '') ) ``` 現在，這兩個表包含了county的相同字符串表示： ``` state ``` | | County | State | | --- | --- | --- | | 0 | dewitt | IL | | 1 | lacquiparle | MN | | 2 | lewisandclark | MT | | 3 | stjohnthebaptist | LA | ``` population ``` | | County | Population | | --- | --- | --- | | 0 | dewitt | 16,798 | | 1 | lacquiparle | 8,067 | | 2 | lewisandclark | 55,716 | | 3 | stjohnthebaptist | 43,044 | 一旦county匹配，就很容易連接這些表了。 ```python state.merge(population, on='County') ``` | | County | State | Population | | --- | --- | --- | --- | | 0 | dewitt | IL | 16,798 | | 1 | lacquiparle | MN | 8,067 | | 2 | lewisandclark | MT | 55,716 | | 3 | stjohnthebaptist | LA | 43,044 | ## 摘要[](#Summary) python 的字符串方法形成了一組簡單而有用的字符串操作。`pandas`的Series實現了相同的方法，將底層 python 方法應用于序列中的每個字符串。您可以在[這里](https://docs.python.org/3/library/stdtypes.html#string-methods)找到關于 python 的`string`方法的完整文檔。還可以在[這里](https://pandas.pydata.org/pandas-docs/stable/text.html#method-summary)找到關于 pandas 的文檔`str`方法的完整文檔。