11.1.?概覽 · Dive Into Python

# 11.1.?概覽在講解[如何下載 web 頁](../html_processing/extracting_data.html#dialect.extract.urllib "例?8.5.?urllib 介紹")和[如何從 URL 解析 XML](../scripts_and_streams/index.html#kgp.openanything.urllib "例?10.2.?解析來自 URL 的 XML")時，你已經學習了關于 [HTML 處理](../html_processing/index.html "第?8?章?HTML 處理")和 [XML 處理](../xml_processing/index.html "第?9?章?XML 處理")，接下來讓我們來更全面地探討有關 HTTP web 服務的主題。簡單地講，HTTP web 服務是指以編程的方式直接使用 HTTP 操作從遠程服務器發送和接收數據。如果你要從服務器獲取數據，直接使用 HTTP GET；如果您想發送新數據到服務器，使用 HTTP POST。(一些較高級的 HTTP web 服務 API 也定義了使用 HTTP PUT 和 HTTP DELETE 修改和刪除現有數據的方法。) 換句話說，構建在 HTTP 協議中的 “verbs (動作)” (GET, POST, PUT 和 DELETE) 直接映射為接收、發送、修改和刪除等應用級別的操作。這種方法的主要優點是簡單，并且許多不同的站點充分印證了這樣的簡單性是受歡迎的。數據 (通常是 XML 數據) 能靜態創建和存儲，或通過服務器端腳本和所有主流計算機語言 (包括用于下載數據的 HTTP 庫) 動態生成。調試也很簡單，因為您可以在任意瀏覽器中調用網絡服務來查看這些原始數據。現代瀏覽器甚至可以為您進行良好的格式化并漂亮地打印這些 XML 數據，以便讓您快速地瀏覽。 HTTP web 服務上的純 XML 應用舉例： * Amazon API 允許您從 Amazon.com 在線商店獲取產品信息。 * National Weather Service (美國) 和 [Hong Kong Observatory](http://demo.xml.weather.gov.hk/) (香港) 通過 web 服務提供天氣警報。 * Atom API 用來管理基于 web 的內容。 * Syndicated feeds 應用于 weblogs 和新聞站點中帶給您來自眾多站點的最新消息。在后面的幾章里，我們將探索使用 HTTP 進行數據發送和接收傳輸的 API，但是不會將應用語義映射到潛在的 HTTP 語義。(所有這些都是通過 HTTP POST 這個管道完成的。) 但是本章將關注使用 HTTP GET 從遠程服務器獲取數據，并且將探索幾個由純 HTTP web 服務帶來最大利益的 HTTP 特性。如下所示為[上一章](../scripts_and_streams/index.html "第?10?章?腳本和流")曾經看到過的 `openanything` 模塊的更高級版本： ## 例?11.1.?`openanything.py` 如果您還沒有下載本書附帶的樣例程序, 可以 [下載本程序和其他樣例程序](http://www.woodpecker.org.cn/diveintopython/download/diveintopython-exampleszh-cn-5.4b.zip "Download example scripts")。 ``` import urllib2, urlparse, gzip from StringIO import StringIO USER_AGENT = 'OpenAnything/1.0 +http://diveintopython.org/http_web_services/' class SmartRedirectHandler(urllib2.HTTPRedirectHandler): def http_error_301(self, req, fp, code, msg, headers): result = urllib2.HTTPRedirectHandler.http_error_301( self, req, fp, code, msg, headers) result.status = code return result def http_error_302(self, req, fp, code, msg, headers): result = urllib2.HTTPRedirectHandler.http_error_302( self, req, fp, code, msg, headers) result.status = code return result class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): def http_error_default(self, req, fp, code, msg, headers): result = urllib2.HTTPError( req.get_full_url(), code, msg, headers, fp) result.status = code return result def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT): '''URL, filename, or string --> stream This function lets you define parsers that take any input source (URL, pathname to local or network file, or actual data as a string) and deal with it in a uniform manner. Returned object is guaranteed to have all the basic stdio read methods (read, readline, readlines). Just .close() the object when you're done with it. If the etag argument is supplied, it will be used as the value of an If-None-Match request header. If the lastmodified argument is supplied, it must be a formatted date/time string in GMT (as returned in the Last-Modified header of a previous request). The formatted date/time will be used as the value of an If-Modified-Since request header. If the agent argument is supplied, it will be used as the value of a User-Agent request header. ''' if hasattr(source, 'read'): return source if source == '-': return sys.stdin if urlparse.urlparse(source)[0] == 'http': # open URL with urllib2 request = urllib2.Request(source) request.add_header('User-Agent', agent) if etag: request.add_header('If-None-Match', etag) if lastmodified: request.add_header('If-Modified-Since', lastmodified) request.add_header('Accept-encoding', 'gzip') opener = urllib2.build_opener(SmartRedirectHandler(), DefaultErrorHandler()) return opener.open(request) # try to open with native open function (if source is a filename) try: return open(source) except (IOError, OSError): pass # treat source as string return StringIO(str(source)) def fetch(source, etag=None, last_modified=None, agent=USER_AGENT): '''Fetch data and metadata from a URL, file, stream, or string''' result = {} f = openAnything(source, etag, last_modified, agent) result['data'] = f.read() if hasattr(f, 'headers'): # save ETag, if the server sent one result['etag'] = f.headers.get('ETag') # save Last-Modified header, if the server sent one result['lastmodified'] = f.headers.get('Last-Modified') if f.headers.get('content-encoding', '') == 'gzip': # data came back gzip-compressed, decompress it result['data'] = gzip.GzipFile(fileobj=StringIO(result['data']])).read() if hasattr(f, 'url'): result['url'] = f.url result['status'] = 200 if hasattr(f, 'status'): result['status'] = f.status f.close() return result ``` ## 進一步閱讀 * Paul Prescod 認為[純 HTTP web 服務是 Internet 的未來](http://webservices.xml.com/pub/a/ws/2002/02/06/rest.html)。