4.Tokens詞法分析 · jQuery源碼構架

[toc] ## Tokens 詞法分析其實詞法分析是匯編里面提到的詞匯，把它用到這里感覺略有不合適，但 Sizzle 中的 tokensize函數干的就是詞法分析的活。上一章我們已經講到了 Sizzle 的用法，實際上就是 jQuery.find 函數，只不過還涉及到 jQuery.fn.find。jQuery.find 函數考慮的很周到，對于處理 #id、.class 和 TagName 的情況，都比較簡單，通過一個正則表達式 rquickExpr 將內容給分開，如果瀏覽器支持 querySelectorAll，那更是最好的。比較難的要數這種類似于 css 選擇器的` selector，div > div.seq h2 ~ p , #id p`，如果使用從左向右的查找規則，效率很低，而從右向左，可以提高效率。本章就來介紹 tokensize 函數，看看它是如何將復雜的 selector 處理成 tokens 的。我們以 `div > div.seq h2 ~ p , #id p` 為例，這是一個很簡單的 css，逗號 , 將表達式分成兩部分。css 中有一些基本的符號，這里有必要強調一下，比如 , `space > + ～`： 1. div,p , 表示并列關系，所有 div 元素和 p 元素； 2. div p 空格表示后代元素，div 元素內所有的 p 元素； 3. div>p > 子元素且相差只能是一代，父元素為 div 的所有 p 元素； 4. div+p + 表示緊鄰的兄弟元素，前一個兄弟節點為 div 的所有 p 元素； 5. div~p ~ 表示兄弟元素，所有前面有兄弟元素 div 的所有 p 元素。除此之外，還有一些 a、input 比較特殊的： 1. a[target=_blank] 選擇所有 target 為 _blank 的所有 a 元素； 2. a[title=search] 選擇所有 title 為 search 的所有 a 元素； 3. input[type=text] 選擇 type 為 text 的所有 input 元素； 4. p:nth-child(2) 選擇其為父元素第二個元素的所有 p 元素； Sizzle 都是支持這些語法的，如果我們把這一步叫做詞法分析，那么詞法分析的結果是一個什么東西呢？ `div > div.seq h2 ~ p , #id p` 經過 tokensize(selector) 會返回一個數組，該數組在函數中稱為 groups。因為這個例子有一個逗號，故該數組有兩個元素，分別是 tokens[0] 和 tokens[1]，代表選擇器逗號前后的兩部分。tokens 也是數組，它的每一個元素都是一個 token 對象。一個 token 對象結構如下所示： ``` token: { value: matched, // 匹配到的字符串 type: type, //token 類型 matches: match //去除 value 的正則結果數組 } ``` Sizzle 中 type 的種類有下面幾種：ID、CLASS、TAG、ATTR、PSEUDO、CHILD、bool、needsContext，這幾種有幾種我也不知道啥意思，child 表示 nth-child、even、odd 這種子選擇器。這是針對于 matches 存在的情況，對于 matches 不存在的情況，其 type 就是 value 的 trim() 操作，后面會談到。 tokensize 函數對 selector 的處理，連空格都不放過，因為空格也屬于 type 的一種，而且還很重要，div > div.seq h2 ~ p 的處理結果： ``` tokens: [ [value:'div', type:'TAG', matches:Array[1]], [value:' > ', type:'>'], [value:'div', type:'TAG', matches:Array[1]], [value:'.seq', type:'CLASS', matches:Array[1]], [value:' ', type:' '], [value:'h2', type:'TAG', matches:Array[1]], [value:' ~ ', type:'~'], [value:'p', type:'TAG', matches:Array[1]], ] ``` 這個數組會交給 Sizzle 的下一個流程來處理，今天暫不討論。 ## tokensize 源碼照舊，先來看一下幾個正則表達式。 ``` var rcomma = /^[\x20\t\r\n\f]*,[\x20\t\r\n\f]*/; rcomma.exec('div > div.seq h2 ~ p');//null rcomma.exec(' ,#id p');//[" ,"] ``` rcomma 這個正則，主要是用來區分 selector 是否到下一個規則，如果到下一個規則，就把之前處理好的 push 到 groups 中。這個正則中 `[\x20\t\r\n\f]` 是用來匹配類似于 whitespace 的，主體就一個逗號。 ``` var rcombinators = /^[\x20\t\r\n\f]*([>+~]|[\x20\t\r\n\f])[\x20\t\r\n\f]*/; rcombinators.exec(' > div.seq h2 ~ p'); //[" > ", ">"] rcombinators.exec(' ~ p'); //[" ~ ", "~"] rcombinators.exec(' h2 ~ p'); //[" ", " "] ``` 是不是看來 rcombinators 這個正則表達式，上面 tokens 那個數組的內容就完全可以看得懂了。其實，如果看 jQuery 的源碼，rcomma 和 rcombinators 并不是這樣來定義的，而是用下面的方式來定義： ``` var whitespace = "[\\x20\\t\\r\\n\\f]"; var rcomma = new RegExp( "^" + whitespace + "*," + whitespace + "*" ), rcombinators = new RegExp( "^" + whitespace + "*([>+~]|" + whitespace + ")" + whitespace + "*" ), rtrim = new RegExp( "^" + whitespace + "+|((?:^|[^\\\\])(?:\\\\.)*)" + whitespace + "+$", "g" ), ``` 有的時候必須得要佩服 jQuery 中的做法，該簡則簡，該省則省，每一處代碼都是極完美的。還有兩個對象，Expr 和 matchExpr，Expr 是一個非常關鍵的對象，它涵蓋了幾乎所有的可能的參數，比較重要的參數比如有： ``` Expr.filter = { "TAG": function(){...}, "CLASS": function(){...}, "ATTR": function(){...}, "CHILD": function(){...}, "ID": function(){...}, "PSEUDO": function(){...} } Expr.preFilter = { "ATTR": function(){...}, "CHILD": function(){...}, "PSEUDO": function(){...} } ``` 這個 filter 和 preFilter 是處理 type=TAG 的關鍵步驟，包括一些類似于 input[type=text] 也是這幾個函數處理，也比較復雜，我本人是看迷糊了。還有 matchExpr 正則表達式： ``` var identifier = "(?:\\\\.|[\\w-]|[^\0-\\xa0])+", attributes = "\\[" + whitespace + "*(" + identifier + ")(?:" + whitespace + // Operator (capture 2) "*([*^$|!~]?=)" + whitespace + // "Attribute values must be CSS identifiers [capture 5] or strings [capture 3 or capture 4]" "*(?:'((?:\\\\.|[^\\\\'])*)'|\"((?:\\\\.|[^\\\\\"])*)\"|(" + identifier + "))|)" + whitespace + "*\\]", pseudos = ":(" + identifier + ")(?:\$(" + // To reduce the number of selectors needing tokenize in the preFilter, prefer arguments: // 1. quoted (capture 3; capture 4 or capture 5) "('((?:\\\\.|[^\\\\'])*)'|\"((?:\\\\.|[^\\\\\"])*)\")|" + // 2. simple (capture 6) "((?:\\\\.|[^\\\\()[\\]]|" + attributes + ")*)|" + // 3. anything else (capture 2) ".*" + ")\$|)", booleans = "checked|selected|async|autofocus|autoplay|controls|defer|disabled|hidden|ismap|loop|multiple|open|readonly|required|scoped"; var matchExpr = { "ID": new RegExp( "^#(" + identifier + ")" ), "CLASS": new RegExp( "^\\.(" + identifier + ")" ), "TAG": new RegExp( "^(" + identifier + "|[*])" ), "ATTR": new RegExp( "^" + attributes ), "PSEUDO": new RegExp( "^" + pseudos ), "CHILD": new RegExp( "^:(only|first|last|nth|nth-last)-(child|of-type)(?:\$" + whitespace + "*(even|odd|(([+-]|)(\\d*)n|)" + whitespace + "*(?:([+-]|)" + whitespace + "*(\\d+)|))" + whitespace + "*\$|)", "i" ), "bool": new RegExp( "^(?:" + booleans + ")$", "i" ), // For use in libraries implementing .is() // We use this for POS matching in `select` "needsContext": new RegExp( "^" + whitespace + "*[>+~]|:(even|odd|eq|gt|lt|nth|first|last)(?:\$" + whitespace + "*((?:-\\d)?\\d*)" + whitespace + "*\$|)(?=[^-]|$)", "i" ) } ``` matchExpr 作為正則表達式對象，其 key 的每一項都是一個 type 類型，將 type 匹配到，交給后續函數處理。 tokensize 源碼如下： ``` var tokensize = function (selector, parseOnly) { var matched, match, tokens, type, soFar, groups, preFilters, cached = tokenCache[selector + " "]; // tokenCache 表示 token 緩沖，保持已經處理過的 token if (cached) { return parseOnly ? 0 : cached.slice(0); } soFar = selector; groups = []; preFilters = Expr.preFilter; while (soFar) { // 判斷一個分組是否結束 if (!matched || (match = rcomma.exec(soFar))) { if (match) { // 從字符串中刪除匹配到的 match soFar = soFar.slice(match[0].length) || soFar; } groups.push((tokens = [])); } matched = false; // 連接符 rcombinators if ((match = rcombinators.exec(soFar))) { matched = match.shift(); tokens.push({ value: matched, type: match[0].replace(rtrim, " ") }); soFar = soFar.slice(matched.length); } // 過濾，Expr.filter 和 matchExpr 都已經介紹過了 for (type in Expr.filter) { if ((match = matchExpr[type].exec(soFar)) && (!preFilters[type] || (match = preFilters[type](match)))) { matched = match.shift(); // 此時的 match 實際上是 shift() 后的剩余數組 tokens.push({ value: matched, type: type, matches: match }); soFar = soFar.slice(matched.length); } } if (!matched) { break; } } // parseOnly 這個參數應該以后會用到 return parseOnly ? soFar.length : soFar ? Sizzle.error(selector) : // 存入緩存 tokenCache(selector, groups).slice(0); } ``` 不僅數組，字符串也有 slice 操作，而且看源碼的話，jQuery 中對字符串的截取，使用的都是 slice 方法。而且本代碼中出現的 array.slice(0)方法是一個淺拷貝數組的好方法。如果此時 parseOnly 不成立，則返回結果需從 tokenCache 這個函數中來查找： ``` var tokenCache = createCache(); function createCache() { var keys = []; function cache( key, value ) { // Expr.cacheLength = 50 if ( keys.push( key + " " ) > Expr.cacheLength ) { // 刪，最不經常使用 delete cache[ keys.shift() ]; } // 整個結果返回的是 value return (cache[ key + " " ] = value); } return cache; } ``` 可知，返回的結果是 groups，tokensize 就學完了，下章將介紹 tokensize 的后續。 ## 總結對于一個復雜的 selector，其 tokensize 的過程遠比今天介紹的要復雜，今天的例子有點簡單（其實也比較復雜了），后面的內容更精彩。