7.3 正則處理 · Go Web 編程

正則表達式是一種進行模式匹配和文本操縱的復雜而又強大的工具。雖然正則表達式比純粹的文本匹配效率低，但是它卻更靈活。按照它的語法規則，隨需構造出的匹配模式就能夠從原始文本中篩選出幾乎任何想你要得到的字符組合。如果你在Web開發中需要從一些文本數據源中獲取數據,那么你只需要按照它的語法規則，隨需構造出正確的模式字符串就能夠從原數據源提取出有意義的文本信息。 Go語言通過`regexp`標準包為正則表達式提供了官方支持，如果你已經使用過其他編程語言提供的正則相關功能，那么你應該對Go語言版本的不會太陌生，但是它們之間也有一些小的差異，因為Go實現的是RE2標準，除了\C，詳細的語法描述參考：`http://code.google.com/p/re2/wiki/Syntax` 其實字符串處理我們可以使用`strings`包來進行搜索(Contains、Index)、替換(Replace)和解析(Split、Join)等操作，但是這些都是簡單的字符串操作，他們的搜索都是大小寫敏感，而且固定的字符串，如果我們需要匹配可變的那種就沒辦法實現了，當然如果`strings`包能解決你的問題，那么就盡量使用它來解決。因為他們足夠簡單、而且性能和可讀性都會比正則好。如果你還記得，在前面表單驗證的小節里，我們已經接觸過正則處理，在那里我們利用了它來驗證輸入的信息是否滿足某些預設的條件。在使用中需要注意的一點就是：所有的字符都是UTF-8編碼的。接下來讓我們更加深入的來學習Go語言的`regexp`包相關知識吧。 ## [](https://github.com/astaxie/build-web-application-with-golang/blob/master/zh/07.3.md#通過正則判斷是否匹配)通過正則判斷是否匹配 `regexp`包中含有三個函數用來判斷是否匹配，如果匹配返回true，否則返回false ~~~ func Match(pattern string, b []byte) (matched bool, error error) func MatchReader(pattern string, r io.RuneReader) (matched bool, error error) func MatchString(pattern string, s string) (matched bool, error error) ~~~ 上面的三個函數實現了同一個功能，就是判斷`pattern`是否和輸入源匹配，匹配的話就返回true，如果解析正則出錯則返回error。三個函數的輸入源分別是byte slice、RuneReader和string。如果要驗證一個輸入是不是IP地址，那么如何來判斷呢？請看如下實現 ~~~ func IsIP(ip string) (b bool) { if m, _ := regexp.MatchString("^[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}$", ip); !m { return false } return true } ~~~ 可以看到，`regexp`的pattern和我們平常使用的正則一模一樣。再來看一個例子：當用戶輸入一個字符串，我們想知道是不是一次合法的輸入： ~~~ func main() { if len(os.Args) == 1 { fmt.Println("Usage: regexp [string]") os.Exit(1) } else if m, _ := regexp.MatchString("^[0-9]+$", os.Args[1]); m { fmt.Println("數字") } else { fmt.Println("不是數字") } } ~~~ 在上面的兩個小例子中，我們采用了Match(Reader|String)來判斷一些字符串是否符合我們的描述需求，它們使用起來非常方便。 ## [](https://github.com/astaxie/build-web-application-with-golang/blob/master/zh/07.3.md#通過正則獲取內容)通過正則獲取內容 Match模式只能用來對字符串的判斷，而無法截取字符串的一部分、過濾字符串、或者提取出符合條件的一批字符串。如果想要滿足這些需求，那就需要使用正則表達式的復雜模式。我們經常需要一些爬蟲程序，下面就以爬蟲為例來說明如何使用正則來過濾或截取抓取到的數據： ~~~ package main import ( "fmt" "io/ioutil" "net/http" "regexp" "strings" ) func main() { resp, err := http.Get("http://www.baidu.com") if err != nil { fmt.Println("http get error.") } defer resp.Body.Close() body, err := ioutil.ReadAll(resp.Body) if err != nil { fmt.Println("http read error") return } src := string(body) //將HTML標簽全轉換成小寫 re, _ := regexp.Compile("\\<[\\S\\s]+?\\>") src = re.ReplaceAllStringFunc(src, strings.ToLower) //去除STYLE re, _ = regexp.Compile("\\<style[\\S\\s]+?\\</style\\>") src = re.ReplaceAllString(src, "") //去除SCRIPT re, _ = regexp.Compile("\\<script[\\S\\s]+?\\</script\\>") src = re.ReplaceAllString(src, "") //去除所有尖括號內的HTML代碼，并換成換行符 re, _ = regexp.Compile("\\<[\\S\\s]+?\\>") src = re.ReplaceAllString(src, "\n") //去除連續的換行符 re, _ = regexp.Compile("\\s{2,}") src = re.ReplaceAllString(src, "\n") fmt.Println(strings.TrimSpace(src)) } ~~~ 從這個示例可以看出，使用復雜的正則首先是Compile，它會解析正則表達式是否合法，如果正確，那么就會返回一個Regexp，然后就可以利用返回的Regexp在任意的字符串上面執行需要的操作。解析正則表達式的有如下幾個方法： ~~~ func Compile(expr string) (*Regexp, error) func CompilePOSIX(expr string) (*Regexp, error) func MustCompile(str string) *Regexp func MustCompilePOSIX(str string) *Regexp ~~~ CompilePOSIX和Compile的不同點在于POSIX必須使用POSIX語法，它使用最左最長方式搜索，而Compile是采用的則只采用最左方式搜索(例如[a-z]{2,4}這樣一個正則表達式，應用于"aa09aaa88aaaa"這個文本串時，CompilePOSIX返回了aaaa，而Compile的返回的是aa)。前綴有Must的函數表示，在解析正則語法的時候，如果匹配模式串不滿足正確的語法則直接panic，而不加Must的則只是返回錯誤。在了解了如何新建一個Regexp之后，我們再來看一下這個struct提供了哪些方法來輔助我們操作字符串，首先我們來看下面這些用來搜索的函數： ~~~ func (re *Regexp) Find(b []byte) []byte func (re *Regexp) FindAll(b []byte, n int) [][]byte func (re *Regexp) FindAllIndex(b []byte, n int) [][]int func (re *Regexp) FindAllString(s string, n int) []string func (re *Regexp) FindAllStringIndex(s string, n int) [][]int func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int func (re *Regexp) FindIndex(b []byte) (loc []int) func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int) func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int func (re *Regexp) FindString(s string) string func (re *Regexp) FindStringIndex(s string) (loc []int) func (re *Regexp) FindStringSubmatch(s string) []string func (re *Regexp) FindStringSubmatchIndex(s string) []int func (re *Regexp) FindSubmatch(b []byte) [][]byte func (re *Regexp) FindSubmatchIndex(b []byte) []int ~~~ 上面這18個函數我們根據輸入源(byte slice、string和io.RuneReader)不同還可以繼續簡化成如下幾個，其他的只是輸入源不一樣，其他功能基本是一樣的： ~~~ func (re *Regexp) Find(b []byte) []byte func (re *Regexp) FindAll(b []byte, n int) [][]byte func (re *Regexp) FindAllIndex(b []byte, n int) [][]int func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int func (re *Regexp) FindIndex(b []byte) (loc []int) func (re *Regexp) FindSubmatch(b []byte) [][]byte func (re *Regexp) FindSubmatchIndex(b []byte) []int ~~~ 對于這些函數的使用我們來看下面這個例子 ~~~ package main import ( "fmt" "regexp" ) func main() { a := "I am learning Go language" re, _ := regexp.Compile("[a-z]{2,4}") //查找符合正則的第一個 one := re.Find([]byte(a)) fmt.Println("Find:", string(one)) //查找符合正則的所有slice,n小于0表示返回全部符合的字符串，不然就是返回指定的長度 all := re.FindAll([]byte(a), -1) fmt.Println("FindAll", all) //查找符合條件的index位置,開始位置和結束位置 index := re.FindIndex([]byte(a)) fmt.Println("FindIndex", index) //查找符合條件的所有的index位置，n同上 allindex := re.FindAllIndex([]byte(a), -1) fmt.Println("FindAllIndex", allindex) re2, _ := regexp.Compile("am(.*)lang(.*)") //查找Submatch,返回數組，第一個元素是匹配的全部元素，第二個元素是第一個()里面的，第三個是第二個()里面的 //下面的輸出第一個元素是"am learning Go language" //第二個元素是" learning Go "，注意包含空格的輸出 //第三個元素是"uage" submatch := re2.FindSubmatch([]byte(a)) fmt.Println("FindSubmatch", submatch) for _, v := range submatch { fmt.Println(string(v)) } //定義和上面的FindIndex一樣 submatchindex := re2.FindSubmatchIndex([]byte(a)) fmt.Println(submatchindex) //FindAllSubmatch,查找所有符合條件的子匹配 submatchall := re2.FindAllSubmatch([]byte(a), -1) fmt.Println(submatchall) //FindAllSubmatchIndex,查找所有字匹配的index submatchallindex := re2.FindAllSubmatchIndex([]byte(a), -1) fmt.Println(submatchallindex) } ~~~ 前面介紹過匹配函數，Regexp也定義了三個函數，它們和同名的外部函數功能一模一樣，其實外部函數就是調用了這Regexp的三個函數來實現的： ~~~ func (re *Regexp) Match(b []byte) bool func (re *Regexp) MatchReader(r io.RuneReader) bool func (re *Regexp) MatchString(s string) bool ~~~ 接下里讓我們來了解替換函數是怎么操作的？ ~~~ func (re *Regexp) ReplaceAll(src, repl []byte) []byte func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte func (re *Regexp) ReplaceAllLiteralString(src, repl string) string func (re *Regexp) ReplaceAllString(src, repl string) string func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string ~~~ 這些替換函數我們在上面的抓網頁的例子有詳細應用示例，接下來我們看一下Expand的解釋： ~~~ func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte ~~~ 那么這個Expand到底用來干嘛的呢？請看下面的例子： ~~~ func main() { src := []byte(` call hello alice hello bob call hello eve `) pat := regexp.MustCompile(`(?m)(call)\s+(?P<cmd>\w+)\s+(?P<arg>.+)\s*$`) res := []byte{} for _, s := range pat.FindAllSubmatchIndex(src, -1) { res = pat.Expand(res, []byte("$cmd('$arg')\n"), src, s) } fmt.Println(string(res)) } ~~~ 至此我們已經全部介紹完Go語言的`regexp`包，通過對它的主要函數介紹及演示，相信大家應該能夠通過Go語言的正則包進行一些基本的正則的操作了。