scrapy有一個簡單的入門文檔,大家可以參考一下,我感覺官方文檔是最靠譜的,也是最真實的。
首先我們先創建一個scrapy的項目
~~~
scrapy startproject weather
~~~
我采用的是ubuntu12.04的系統,建立項目之后主文件夾就會出現一個weather的文件夾。我們可以通過tree來查看文件夾的結構。可以使用sudoapt-get install tree安裝。
~~~
tree weather
~~~
~~~
weather
├── scrapy.cfg
├── wea.json
├── weather
│?? ├── __init__.py
│?? ├── __init__.pyc
│?? ├── items.py
│?? ├── items.pyc
│?? ├── pipelines.py
│?? ├── pipelines.py~
│?? ├── pipelines.pyc
│?? ├── settings.py
│?? ├── settings.pyc
│?? └── spiders
│?? ├── __init__.py
│?? ├── __init__.pyc
│?? ├── weather_spider1.py
│?? ├── weather_spider1.pyc
│?? ├── weather_spider2.py
│?? ├── weather_spider2.py~
│?? ├── weather_spider2.pyc
│?? └── weather_spider.pyc
├── weather.json
└── wea.txt
~~~
上面就是我編寫過之后的爬蟲文件,現在我們新創建一個weathertest來看一下初始的時候文件是什么樣的。
~~~
weathertest
├── scrapy.cfg
└── weathertest
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
~~~
~~~
scrapy.cfg:項目的配置文件
weather/:該項目的python模塊。之后您將在此加入代碼。
weather/items.py:相當于要提取的元素,相當于一個容器
weather/pipelines.py:存文件時或者發送到其他地方可用其編寫
weather/settings.py:項目的設置文件.
weather/spiders/:放置spider代碼的目錄.
~~~
Item是保存爬取到的數據的容器;其使用方法和python字典類似,并且提供了額外保護機制來避免拼寫錯誤導致的未定義字段錯誤。
~~~
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class WeatherItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
city = scrapy.Field()
date = scrapy.Field()
dayDesc = scrapy.Field()
dayTemp = scrapy.Field()
pass
~~~
之后我們編寫今天的爬蟲一號,使用xpath分解html中的標簽,為了創建一個Spider,您必須繼承scrapy.Spider類, 且定義以下三個屬性:
1.name:用于區別Spider。該名字必須是唯一的,您不可以為不同的Spider設定相同的名字。
2.start_urls:包含了Spider在啟動時進行爬取的url列表。因此,第一個被獲取到的頁面將是其中之一。后續的URL則從初始的URL獲取到的數據中提取。
3.parse()是spider的一個方法。被調用時,每個初始URL完成下載后生成的Response對象將會作為唯一的參數傳遞給該函數。該方法負責解析返回的數據(responsedata),提取數據(生成item)以及生成需要進一步處理的URL的Request對象。
~~~
import scrapy
from weather.items import WeatherItem
class WeatherSpider(scrapy.Spider):
name = 'weather_spider1'
allowed_domains = ['sina.com.cn']
start_urls = ['http://weather.sina.com.cn/beijing']
def parse(self,response):
item = WeatherItem()
item['city'] = response.xpath("//*[@id='slider_ct_name']/text()").extract()
tenDay = response.xpath('//*[@id="blk_fc_c0_scroll"]');
item['date'] = tenDay.css('p.wt_fc_c0_i_date::text').extract()
item['dayDesc'] = tenDay.css('img.icons0_wt::attr(title)').extract()
item['dayTemp'] = tenDay.css('p.wt_fc_c0_i_temp::text').extract()
return item
~~~
Scrapy使用了一種基于XPath和CSS表達式機制:Scrapy Selectors。
這里給出XPath表達式的例子及對應的含義:
/html/head/title:選擇HTML文檔中<head>標簽內的<title>元素
/html/head/title/text():選擇上面提到的<title>元素的文字
//td:選擇所有的<td>元素
//div[@class="mine"]:選擇所有具有class="mine"屬性的div元素
上邊僅僅是幾個簡單的XPath例子,XPath實際上要比這遠遠強大的多。
為了配合XPath,Scrapy除了提供了Selector之外,還提供了方法來避免每次從response中提取數據時生成selector的麻煩。
Selector有四個基本的方法(點擊相應的方法可以看到詳細的API文檔):
xpath():傳入xpath表達式,返回該表達式所對應的所有節點的selectorlist列表 。
css():傳入CSS表達式,返回該表達式所對應的所有節點的selectorlist列表.
extract():序列化該節點為unicode字符串并返回list。
re():根據傳入的正則表達式對數據進行提取,返回unicode字符串list列表。
然后我們就可以編寫pipelines.py文件了,如果你只是想保存文件,也可以不編寫這個文件,就保持原樣即可,運行爬蟲的時候再后面加上 -o weather.json
~~~
scrapy crawl weather_spider1 -o weather.json
~~~
~~~
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class WeatherPipeline(object):
def __init__(self):
self.file = open('wea.txt','w+')
def process_item(self, item, spider):
city = item['city'][0].encode('utf-8')
self.file.write('city:'+str(city)+'\n\n')
date = item['date']
desc = item['dayDesc']
dayDesc = desc[1::2]
nightDesc = desc[0::2]
dayTemp = item['dayTemp']
weaitem = zip(date,dayDesc,nightDesc,dayTemp)
for i in range(len(weaitem)):
item = weaitem[i]
d = item[0]
dd = item[1]
nd = item[2]
ta = item[3].split('/')
dt = ta[0]
nt = ta[1]
txt = 'date: {0} \t\t day:{1}({2}) \t\t night:{3}({4}) \n\n'.format(
d,
dd.encode('utf-8'),
dt.encode('utf-8'),
nd.encode('utf-8'),
nt.encode('utf-8')
)
self.file.write(txt)
return item
~~~
最后設置一下settings.py文件就OK了。settings.py文件可以設置一下爬蟲抓取網站時的身份或者代理。
~~~
# -*- coding: utf-8 -*-
# Scrapy settings for weather project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'weather'
SPIDER_MODULES = ['weather.spiders']
NEWSPIDER_MODULE = 'weather.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'weather (+http://www.yourdomain.com)'
USER_AGENT = 'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
DEFAULT_REQUEST_HEADERS = {
'Referer': 'http://www.weibo.com'
}
ITEM_PIPELINES = {
'weather.pipelines.WeatherPipeline': 1
}
DOWNLOAD_DELAY = 0.5
~~~
爬蟲抓取網頁也可以使用BeautifulSoup來抓取,來看一下我們今天的爬蟲2號,哇咔咔。
~~~
# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from weather.items import WeatherItem
class WeatherSpider(scrapy.Spider):
name = "weather_spider2"
allowed_domains = ["sina.com.cn"]
start_urls = ['http://weather.sina.com.cn']
def parse(self, response):
html_doc = response.body
#html_doc = html_doc.decode('utf-8')
soup = BeautifulSoup(html_doc)
itemTemp = {}
itemTemp['city'] = soup.find(id='slider_ct_name')
tenDay = soup.find(id='blk_fc_c0_scroll')
itemTemp['date'] = tenDay.findAll("p", {"class": 'wt_fc_c0_i_date'})
itemTemp['dayDesc'] = tenDay.findAll("img", {"class": 'icons0_wt'})
itemTemp['dayTemp'] = tenDay.findAll('p', {"class": 'wt_fc_c0_i_temp'})
item = WeatherItem()
for att in itemTemp:
item[att] = []
if att == 'city':
item[att] = itemTemp.get(att).text
continue
for obj in itemTemp.get(att):
if att == 'dayDesc':
item[att].append(obj['title'])
else:
item[att].append(obj.text)
return item
~~~
最后進入到weather文件夾內,開始運行scrapy。
可以先查看一下scrapy的命令有那些,在主文件夾內查看和在項目文件中查看是兩個效果。
~~~
Scrapy 0.24.6 - project: weather
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
deploy Deploy project in Scrapyd target
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
~~~
我們可以使用scrapy crawl weather_spider1或者scrapy crawl weather_spider2.然后在主文件夾內生成一個wea.txt的文件打開之后就是今天的天氣。
~~~
city:北京
date: 05-11 day:多云(20°C ) night:多云( 11°C)
date: 05-12 day:晴(27°C ) night:晴( 11°C)
date: 05-13 day:多云(29°C ) night:晴( 17°C)
date: 05-14 day:多云(29°C ) night:多云( 19°C)
date: 05-15 day:晴(26°C ) night:晴( 12°C)
date: 05-16 day:晴(27°C ) night:晴( 16°C)
date: 05-17 day:陰(29°C ) night:晴( 19°C)
date: 05-18 day:晴(29°C ) night:少云( 16°C)
date: 05-19 day:局部多云(31°C ) night:少云( 16°C)
date: 05-20 day:局部多云(29°C ) night:局部多云( 16°C)
~~~