爬蟲Scrapy學習指南之抓取新浪天氣 · Scrapy爬蟲教程

scrapy有一個簡單的入門文檔，大家可以參考一下，我感覺官方文檔是最靠譜的，也是最真實的。首先我們先創建一個scrapy的項目 ~~~ scrapy startproject weather ~~~ 我采用的是ubuntu12.04的系統，建立項目之后主文件夾就會出現一個weather的文件夾。我們可以通過tree來查看文件夾的結構。可以使用sudoapt-get install tree安裝。 ~~~ tree weather ~~~ ~~~ weather ├── scrapy.cfg ├── wea.json ├── weather │?? ├── __init__.py │?? ├── __init__.pyc │?? ├── items.py │?? ├── items.pyc │?? ├── pipelines.py │?? ├── pipelines.py~ │?? ├── pipelines.pyc │?? ├── settings.py │?? ├── settings.pyc │?? └── spiders │?? ├── __init__.py │?? ├── __init__.pyc │?? ├── weather_spider1.py │?? ├── weather_spider1.pyc │?? ├── weather_spider2.py │?? ├── weather_spider2.py~ │?? ├── weather_spider2.pyc │?? └── weather_spider.pyc ├── weather.json └── wea.txt ~~~ 上面就是我編寫過之后的爬蟲文件，現在我們新創建一個weathertest來看一下初始的時候文件是什么樣的。 ~~~ weathertest ├── scrapy.cfg └── weathertest ├── __init__.py ├── items.py ├── pipelines.py ├── settings.py └── spiders └── __init__.py ~~~ ~~~ scrapy.cfg:項目的配置文件 weather/:該項目的python模塊。之后您將在此加入代碼。 weather/items.py:相當于要提取的元素，相當于一個容器 weather/pipelines.py:存文件時或者發送到其他地方可用其編寫 weather/settings.py:項目的設置文件. weather/spiders/:放置spider代碼的目錄. ~~~ Item是保存爬取到的數據的容器；其使用方法和python字典類似，并且提供了額外保護機制來避免拼寫錯誤導致的未定義字段錯誤。 ~~~ # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class WeatherItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() city = scrapy.Field() date = scrapy.Field() dayDesc = scrapy.Field() dayTemp = scrapy.Field() pass ~~~ 之后我們編寫今天的爬蟲一號，使用xpath分解html中的標簽，為了創建一個Spider，您必須繼承scrapy.Spider類，且定義以下三個屬性: 1.name:用于區別Spider。該名字必須是唯一的，您不可以為不同的Spider設定相同的名字。 2.start_urls:包含了Spider在啟動時進行爬取的url列表。因此，第一個被獲取到的頁面將是其中之一。后續的URL則從初始的URL獲取到的數據中提取。 3.parse()是spider的一個方法。被調用時，每個初始URL完成下載后生成的Response對象將會作為唯一的參數傳遞給該函數。該方法負責解析返回的數據(responsedata)，提取數據(生成item)以及生成需要進一步處理的URL的Request對象。 ~~~ import scrapy from weather.items import WeatherItem class WeatherSpider(scrapy.Spider): name = 'weather_spider1' allowed_domains = ['sina.com.cn'] start_urls = ['http://weather.sina.com.cn/beijing'] def parse(self,response): item = WeatherItem() item['city'] = response.xpath("//*[@id='slider_ct_name']/text()").extract() tenDay = response.xpath('//*[@id="blk_fc_c0_scroll"]'); item['date'] = tenDay.css('p.wt_fc_c0_i_date::text').extract() item['dayDesc'] = tenDay.css('img.icons0_wt::attr(title)').extract() item['dayTemp'] = tenDay.css('p.wt_fc_c0_i_temp::text').extract() return item ~~~ Scrapy使用了一種基于XPath和CSS表達式機制:Scrapy Selectors。這里給出XPath表達式的例子及對應的含義: /html/head/title:選擇HTML文檔中<head>標簽內的<title>元素 /html/head/title/text():選擇上面提到的<title>元素的文字 //td:選擇所有的<td>元素 //div[@class="mine"]:選擇所有具有class="mine"屬性的div元素上邊僅僅是幾個簡單的XPath例子，XPath實際上要比這遠遠強大的多。為了配合XPath，Scrapy除了提供了Selector之外，還提供了方法來避免每次從response中提取數據時生成selector的麻煩。 Selector有四個基本的方法(點擊相應的方法可以看到詳細的API文檔): xpath():傳入xpath表達式，返回該表達式所對應的所有節點的selectorlist列表。 css():傳入CSS表達式，返回該表達式所對應的所有節點的selectorlist列表. extract():序列化該節點為unicode字符串并返回list。 re():根據傳入的正則表達式對數據進行提取，返回unicode字符串list列表。然后我們就可以編寫pipelines.py文件了，如果你只是想保存文件，也可以不編寫這個文件，就保持原樣即可，運行爬蟲的時候再后面加上 -o weather.json ~~~ scrapy crawl weather_spider1 -o weather.json ~~~ ~~~ # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html class WeatherPipeline(object): def __init__(self): self.file = open('wea.txt','w+') def process_item(self, item, spider): city = item['city'][0].encode('utf-8') self.file.write('city:'+str(city)+'\n\n') date = item['date'] desc = item['dayDesc'] dayDesc = desc[1::2] nightDesc = desc[0::2] dayTemp = item['dayTemp'] weaitem = zip(date,dayDesc,nightDesc,dayTemp) for i in range(len(weaitem)): item = weaitem[i] d = item[0] dd = item[1] nd = item[2] ta = item[3].split('/') dt = ta[0] nt = ta[1] txt = 'date: {0} \t\t day:{1}({2}) \t\t night:{3}({4}) \n\n'.format( d, dd.encode('utf-8'), dt.encode('utf-8'), nd.encode('utf-8'), nt.encode('utf-8') ) self.file.write(txt) return item ~~~ 最后設置一下settings.py文件就OK了。settings.py文件可以設置一下爬蟲抓取網站時的身份或者代理。 ~~~ # -*- coding: utf-8 -*- # Scrapy settings for weather project # # For simplicity, this file contains only the most important settings by # default. All the other settings are documented here: # # http://doc.scrapy.org/en/latest/topics/settings.html # BOT_NAME = 'weather' SPIDER_MODULES = ['weather.spiders'] NEWSPIDER_MODULE = 'weather.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'weather (+http://www.yourdomain.com)' USER_AGENT = 'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36' DEFAULT_REQUEST_HEADERS = { 'Referer': 'http://www.weibo.com' } ITEM_PIPELINES = { 'weather.pipelines.WeatherPipeline': 1 } DOWNLOAD_DELAY = 0.5 ~~~ 爬蟲抓取網頁也可以使用BeautifulSoup來抓取，來看一下我們今天的爬蟲2號，哇咔咔。 ~~~ # -*- coding: utf-8 -*- import scrapy from bs4 import BeautifulSoup from weather.items import WeatherItem class WeatherSpider(scrapy.Spider): name = "weather_spider2" allowed_domains = ["sina.com.cn"] start_urls = ['http://weather.sina.com.cn'] def parse(self, response): html_doc = response.body #html_doc = html_doc.decode('utf-8') soup = BeautifulSoup(html_doc) itemTemp = {} itemTemp['city'] = soup.find(id='slider_ct_name') tenDay = soup.find(id='blk_fc_c0_scroll') itemTemp['date'] = tenDay.findAll("p", {"class": 'wt_fc_c0_i_date'}) itemTemp['dayDesc'] = tenDay.findAll("img", {"class": 'icons0_wt'}) itemTemp['dayTemp'] = tenDay.findAll('p', {"class": 'wt_fc_c0_i_temp'}) item = WeatherItem() for att in itemTemp: item[att] = [] if att == 'city': item[att] = itemTemp.get(att).text continue for obj in itemTemp.get(att): if att == 'dayDesc': item[att].append(obj['title']) else: item[att].append(obj.text) return item ~~~ 最后進入到weather文件夾內，開始運行scrapy。可以先查看一下scrapy的命令有那些，在主文件夾內查看和在項目文件中查看是兩個效果。 ~~~ Scrapy 0.24.6 - project: weather Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test check Check spider contracts crawl Run a spider deploy Deploy project in Scrapyd target edit Edit spider fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates list List available spiders parse Parse URL (using its spider) and print the results runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy Use "scrapy <command> -h" to see more info about a command ~~~ 我們可以使用scrapy crawl weather_spider1或者scrapy crawl weather_spider2.然后在主文件夾內生成一個wea.txt的文件打開之后就是今天的天氣。 ~~~ city:北京 date: 05-11 day:多云(20°C ) night:多云( 11°C) date: 05-12 day:晴(27°C ) night:晴( 11°C) date: 05-13 day:多云(29°C ) night:晴( 17°C) date: 05-14 day:多云(29°C ) night:多云( 19°C) date: 05-15 day:晴(26°C ) night:晴( 12°C) date: 05-16 day:晴(27°C ) night:晴( 16°C) date: 05-17 day:陰(29°C ) night:晴( 19°C) date: 05-18 day:晴(29°C ) night:少云( 16°C) date: 05-19 day:局部多云(31°C ) night:少云( 16°C) date: 05-20 day:局部多云(29°C ) night:局部多云( 16°C) ~~~