<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                企業??AI智能體構建引擎,智能編排和調試,一鍵部署,支持知識庫和私有化部署方案 廣告
                對于urllib2的學習,這里先推薦一個教程《IronPython In Action》,上面有很多簡明例子,并且也有很詳盡的原理解釋:[http://www.voidspace.org.uk/python/articles/urllib2.shtml](http://www.voidspace.org.uk/python/articles/urllib2.shtml) 最基本的爬蟲,主要就是兩個函數的使用urllib2.urlopen()和re.compile()。 ##一、網頁抓取簡單例子 先來看一個最簡單的例子,以百度音樂頁面為例,訪問返回頁面html的string形式,程序如下: ~~~ # -*- coding: utf8 -*- import urllib2 response = urllib2.urlopen('http://music.baidu.com') html = response.read() print html ~~~ ![](https://box.kancloud.cn/2016-02-18_56c5641c9b026.jpg) 這個例子主要說下urllib2.open()函數,其作用是:用一個request對象來映射發出的http請求(這里的請求頭不一定是http,還可以是ftp:或file:等),http基于請求和應答機制,即客戶端提出請求request,服務端應答response。 urllib2用你請求的地址創建一個request對象,調用urlopen并將結果返回作為response對象,并且可以用.read()來讀取response對象的內容。所以上面的程序也可以這么寫: ~~~ # -*- coding: utf8 -*- import urllib2 request = urllib2.Request(‘http://music.baidu.com’) response = urllib2.urlopen(request) html = response.read() print html ~~~ ##二、網易微博爬蟲實例 仍舊以之前的微博爬蟲為例,抓取新浪微博一個話題下所有頁面,并以html文件形式儲存在本地,路徑為當前工程目錄。url=http://s.weibo.com/wb/蘋果手機&nodup=1&page=20 源碼如下: ~~~ # -*- coding:utf-8 -*- ''' #===================================================== # FileName: sina_html.py # Desc: download html pages from sina_weibo and save to local files # Author: DianaCody # Version: 1.0 # Since: 2014-09-27 15:20:21 #===================================================== ''' import string, urllib2 # sina tweet's url = 'http://s.weibo.com/wb/topic&nodup=1&page=20' def writeHtml(url, start_page, end_page): for i in range(start_page, end_page+1): FileName = string.zfill(i, 3) HtmlPath = FileName + '.html' print 'Downloading No.' + str(i) + ' page and save as ' + FileName + '.html...' f = open(HtmlPath, 'w+') html = urllib2.urlopen(url + str(i)).read() f.write(html) f.close() def crawler(): url = 'http://s.weibo.com/wb/iPhone&nodup=1&page=' s_page = 1; e_page = 10; print 'Now begin to download html pages...' writeHtml(url, s_page, e_page) if __name__ == '__main__': crawler() ~~~ 程序運行完畢后,html頁面存放在當前工程目錄下,在左側Package Explorer里刷新一下,可以看到抓回來的html頁面,這里先抓了10個頁面,打開一個看看: ![](https://box.kancloud.cn/2016-02-18_56c5641cb5207.jpg) html頁面的源碼: ![](https://box.kancloud.cn/2016-02-18_56c5641ccef76.jpg) 剩下的就是正則解析提取字段了,主要用到python的re模塊。 ##三、網易微博爬蟲軟件開發(python版) 上面只是給出了基本爬取過程,后期加上正則解析提取微博文本數據,中文字符編碼處理等等,下面給出這個爬蟲軟件。(已轉換為可執行exe程序) 完整源碼: ~~~ # -*- coding:utf-8 -*- ''' #===================================================== # FileName: tweet163_crawler.py # Desc: download html pages from 163 tweet and save to local files # Author: DianaCody # Version: 1.0 # Since: 2014-09-27 15:20:21 #===================================================== ''' import string import urllib2 import re import chardet # sina tweet's url = 'http://s.weibo.com/wb/topic&nodup=1&page=20' # 163 tweet's url = 'http://t.163.com/tag/topic&nodup=1&page=20' def writeHtml(url, start_page, end_page): for i in range(start_page, end_page+1): FileName = string.zfill(i, 3) HtmlPath = FileName + '.html' print 'Downloading No.' + str(i) + ' page and save as ' + FileName + '.html...' f = open(HtmlPath, 'w+') html = urllib2.urlopen(url + str(i)).read() f.write(html) f.close() def crawler(key, s_page, e_page): url = 'http://t.163.com/tag/'+ key +'&nodup=1&page=' print 'Now begin to download html pages...' writeHtml(url, s_page, e_page) def regex(): start_page = 1 end_page = 9 for i in range(start_page, end_page): HtmlPath = '00'+str(i)+'.html' page = open(HtmlPath).read() # set encode format charset = chardet.detect(page) charset = charset['encoding'] if charset!='utf-8' and charset!='UTF-8': page = page.decode('gb2312', 'ignore').encode("utf-8") unicodePage = page.decode('utf-8') pattern = re.compile('"content":\s".*?",', re.DOTALL) contents = pattern.findall(unicodePage) for content in contents: print content if __name__ == '__main__': key = str(raw_input(u'please input you search key: \n')) begin_page = int(raw_input(u'input begin pages:\n')) end_page = int(raw_input(u'input end pages:\n')) crawler(key, begin_page, end_page) print'Crawler finished... \n' print'The contents are: ' regex() raw_input() ~~~ **實現自定義輸入關鍵詞,指定要爬取的頁面數據,根據關鍵詞提取頁面中的微博信息數據。** - 自定義搜索關鍵字 - 自定義爬取頁面數目 - 非登錄,爬取當天微博信息數據存儲于本地文件 - 解析微博頁面獲取微博文本內容信息 - 軟件為exe程序,無python環境也可運行 1.軟件功能 實時爬取微博信息數據,數據源?[http://t.163.com/tag/searchword/](http://t.163.com/tag/yourkey/) 2.軟件演示 1.自定義關鍵詞、抓取頁面數量 ![](https://box.kancloud.cn/2016-02-18_56c5641ce5e84.jpg) 2.爬取結果顯示微博文本內容 ![](https://box.kancloud.cn/2016-02-18_56c5641d0b4ce.jpg) 3.軟件下載 軟件已經放到github,地址?[https://github.com/DianaCody/Spider_python](https://github.com/DianaCody/Spider_python/tree/master/Tweet163_Crawler/release)/。 軟件地址:?[https://github.com/DianaCody/Spider_python/tree/master/Tweet163_Crawler/release](https://github.com/DianaCody/Spider_python/tree/master/Tweet163_Crawler/release) exe的軟件也可以在這里下載:[點擊下載](http://download.csdn.net/detail/dianacody/7659093) [http://download.csdn.net/detail/dianacody/8001441](http://download.csdn.net/detail/dianacody/8001441) 原創文章,轉載請注明出處:[http://blog.csdn.net/dianacody/article/details/39741413](http://blog.csdn.net/dianacody/article/details/39741413)
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看