3.1.3 解析鏈接 · python3爬蟲筆記

# 3.1.3 解析鏈接 ## 1.說明 Urllib 庫里還提供了 parse 這個模塊，定義了處理 URL 的標準接口，例如實現 URL 各部分的抽取，合并以及鏈接轉換。支持如下協議的 URL 處理：file、ftp、gopher、hdl、http、https、imap、mailto、 mms、news、nntp、prospero、rsync、rtsp、rtspu、sftp、shttp、 sip、sips、snews、svn、svn+ssh、telnet、wais ## 2.urlparse\(\) {#1-urlparse} urlparse\(\) 方法可以實現 URL 的識別和分段實例: ```text from urllib.parse import urlparse result = urlparse("https://www.google.com.hk/search;user?q=python#content") print(type(result),result,sep='\n') ``` 運行結果: ```text <class 'urllib.parse.ParseResult'> ParseResult(scheme='https', netloc='www.google.com.hk', path='/search', params='user', query='q=python', fragment='content') ``` 返回結果是一個 ParseResult 類型的對象，它包含了六個部分，分別是 scheme、netloc、path、params、query、fragment 實例url: ```text https://www.google.com.hk/search;user?q=python#content ``` urlparse\(\) 方法將其拆分成了六部分，解析時有特定的分隔符，比如 :// 前面的就是 scheme，代表協議，第一個 / 前面便是 netloc，即域名，分號 ; 前面是 params，代表參數。所以可以得出一個標準的鏈接格式如下： ```text scheme://netloc/path;parameters?query#fragment ``` urlparse\(\)API 用法： ```text urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True) ``` * urlstring，是必填項，即待解析的 URL。 * scheme，是默認的協議（比如http、https等），假如這個鏈接沒有帶協議信息，會將這個作為默認的協議。實例: ```text result = urlparse("www.google.com.hk/search;user?q=python#content",scheme="https") print(result) ``` 運行結果: ```text ParseResult(scheme='https', netloc='', path='www.google.com.hk/search', params='user', query='q=python', fragment='content') ``` scheme 參數只有在 URL 中不包含 scheme 信息時才會生效，如果 URL 中有 scheme 信息，那就返回解析出的 scheme 實例: ```text result = urlparse("http://www.google.com.hk/search;user?q=python#content",scheme="https") print(result) ``` 運行結果: ```text ParseResult(scheme='http', netloc='www.google.com.hk', path='/search', params='user', query='q=python', fragment='content') ``` * allow\_fragments，即是否忽略 fragment，如果它被設置為 False，fragment 部分就會被忽略，它會被解析為 path、parameters 或者 query 的一部分，fragment 部分為空。實例: ```text result = urlparse("http://www.google.com.hk/search;user?q=python#content",allow_fragments=False) print(result) ``` 運行結果: ```text ParseResult(scheme='http', netloc='www.google.com.hk', path='/search', params='user', query='q=python#content', fragment='') ``` 當 URL 中不包含 params 和 query 時， fragment 便會被解析為 path 的一部分實例: ```text result = urlparse("https://www.google.com.hk/webhp#content",allow_fragments=False) print(result) ``` 運行結果: ```text ParseResult(scheme='https', netloc='www.google.com.hk', path='/webhp#content', params='', query='', fragment='') ``` 返回結果 ParseResult 實際上是一個元組，我們可以用索引順序來獲取，也可以用屬性名稱獲取實例: ```text from urllib.parse import urlparse result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False) print(result.scheme, result[0], result.netloc, result[1], sep='\n') ``` 運行結果: ```text http http www.baidu.com www.baidu.com ``` ## 3. urlunparse\(\) {#2-urlunparse} * 與urlparse\(\)相反 * 接受的參數是一個可迭代對象，但是它的長度必須是 6，否則會拋出參數數量不足或者過多的問題實例: ```text from urllib.parse import urlunparse data = ['http', 'www.google.com', 'index.html', 'name', 'q=6', 'comment'] print(urlunparse(data)) ``` 運行結果: ```text http://www.google.com/index.html;name?q=6#comment ``` ## 4. urlsplit\(\) {#3-urlsplit} 與urlparse\(\) 方法非常相似，只不過它不會單獨解析 parameters 這一部分，只返回五個結果實例: ```text from urllib.parse import urlsplit result = urlsplit("https://www.google.com.hk/webhp#content") print(result) ``` 運行結果: ```text SplitResult(scheme='https', netloc='www.google.com.hk', path='/webhp', query='', fragment='content') ``` 返回結果是 SplitResult，其實也是一個元組類型，可以用屬性獲取值也可以用索引來獲取實例: ```text from urllib.parse import urlsplit result = urlsplit("https://www.google.com.hk/webhp#content") print(result.scheme,result[0]) ``` 運行結果: ```text https https ``` ## 5. urlunsplit\(\) {#4-urlunsplit} * 與urlsplit\(\)相反 * 與 urlunparse\(\) 類似，也是將鏈接的各個部分組合成完整鏈接的方法，傳入的也是一個可迭代對象實例: ```text from urllib.parse import urlunsplit data = ['http', 'www.google.com', 'index.html', 'q=6', 'comment'] print(urlunsplit(data)) ``` 運行結果: ```text http://www.google.com/index.html?q=6#comment ``` ## 6. urljoin\(\) {#5-urljoin} 生成鏈接還有另一個方法，利用 urljoin\(\) 方法我們可以提供一個 base\_url（基礎鏈接），新的鏈接作為第二個參數，方法會分析 base\_url 的 scheme、netloc、path，如果這三項在新的鏈接里面不存在，那么就予以補充，如果新的鏈接存在，那么就使用新的鏈接的部分。base\_url 中的 parameters、query、fragments 是不起作用的。實例: ```text from urllib.parse import urljoin print(urljoin('http://www.baidu.com', 'FAQ.html')) # 會覆蓋前面的url print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html')) print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html')) print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2')) print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php')) print(urljoin('http://www.baidu.com', '?category=2#comment')) print(urljoin('www.baidu.com', '?category=2#comment')) print(urljoin('www.baidu.com#comment', '?category=2')) ``` 運行結果: ```text http://www.baidu.com/FAQ.html https://cuiqingcai.com/FAQ.html https://cuiqingcai.com/FAQ.html https://cuiqingcai.com/FAQ.html?question=2 https://cuiqingcai.com/index.php http://www.baidu.com?category=2#comment www.baidu.com?category=2#comment www.baidu.com?category=2 ``` ## 7. urlencode\(\) {#6-urlencode} 構造get請求參數實例: ```text from urllib.parse import urlencode # 構造參數字典 params = { 'wd':'python', } base_url = 'https://www.baidu.com/s?' url = base_url + urlencode(params) print(url) ``` 運行結果: ```text https://www.baidu.com/s?wd=python ``` ## 8. parse\_qs\(\) {#7-parseqs} parse\_qs\(\) 方法可以將請求參數轉為字典實例: ```text from urllib.parse import parse_qs query = '''f=json&mock=&uin=777&key=777&pass_ticket=nFLy3qzW6g8xVh%25252FRdSuoEMZn%25252BYrRjEh0fsybociYtgE%25253D&wxtoken=777&devicetype=android-26&clientversion=26060739&appmsg_token=966_3pMS7R2ZHEtCjbLZ3O0EDgaTpZ9B-N7GrMG3lOqeNFz9EH9p3dcgPHSiCjE~&x5=1&f=json''' print(parse_qs(query)) ``` 運行結果: ```text {'f': ['json', 'json'], 'uin': ['777'], 'key': ['777'], 'pass_ticket': ['nFLy3qzW6g8xVh%252FRdSuoEMZn%252BYrRjEh0fsybociYtgE%253D'], 'wxtoken': ['777'], 'devicetype': ['android-26'], 'clientversion': ['26060739'], 'appmsg_token': ['966_3pMS7R2ZHEtCjbLZ3O0EDgaTpZ9B-N7GrMG3lOqeNFz9EH9p3dcgPHSiCjE~'], 'x5': ['1']} ``` ## 9.parse\_qsl\(\) parse\_qsl\(\) 方法可以將參數轉化為元組組成的列表實例: ```text from urllib.parse import parse_qsl query = '''f=json&mock=&uin=777&key=777&pass_ticket=nFLy3qzW6g8xVh%25252FRdSuoEMZn%25252BYrRjEh0fsybociYtgE%25253D&wxtoken=777&devicetype=android-26&clientversion=26060739&appmsg_token=966_3pMS7R2ZHEtCjbLZ3O0EDgaTpZ9B-N7GrMG3lOqeNFz9EH9p3dcgPHSiCjE~&x5=1&f=json''' print(parse_qsl(query)) ``` 運行結果: ```text [('f', 'json'), ('uin', '777'), ('key', '777'), ('pass_ticket', 'nFLy3qzW6g8xVh%252FRdSuoEMZn%252BYrRjEh0fsybociYtgE%253D'), ('wxtoken', '777'), ('devicetype', 'android-26'), ('clientversion', '26060739'), ('appmsg_token', '966_3pMS7R2ZHEtCjbLZ3O0EDgaTpZ9B-N7GrMG3lOqeNFz9EH9p3dcgPHSiCjE~'), ('x5', '1'), ('f', 'json')] ``` ## 10. quote\(\) quote\(\) 方法可以將內容轉化為 URL 編碼的格式，有時候 URL 中帶有中文參數的時候可能導致亂碼的問題，所以我們可以用這個方法將中文字符轉化為 URL 編碼實例: ```text from urllib.parse import quote wd = "貓" url = "https://www.baidu.com/s?wd="+quote(wd) print(url) ``` 運行結果: ```text https://www.baidu.com/s?wd=%E7%8C%AB ``` ## 11.unquote\(\) unquote\(\) 方法，它可以進行 URL 解碼實例: ```text from urllib.parse import unquote url = "https://www.baidu.com/s?wd=%E7%8C%AB" print(unquote(url)) ``` 運行結果: ```text https://www.baidu.com/s?wd=貓 ```